WO2023170973A1 - 強化学習装置、強化学習方法、及び強化学習プログラム - Google Patents
強化学習装置、強化学習方法、及び強化学習プログラム Download PDFInfo
- Publication number
- WO2023170973A1 WO2023170973A1 PCT/JP2022/011121 JP2022011121W WO2023170973A1 WO 2023170973 A1 WO2023170973 A1 WO 2023170973A1 JP 2022011121 W JP2022011121 W JP 2022011121W WO 2023170973 A1 WO2023170973 A1 WO 2023170973A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- agent model
- reinforcement learning
- reward
- simulation
- state
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000004088 simulation Methods 0.000 claims abstract description 115
- 230000009471 action Effects 0.000 claims abstract description 61
- 230000008569 process Effects 0.000 claims abstract description 9
- 230000002787 reinforcement Effects 0.000 claims description 90
- 230000006399 behavior Effects 0.000 claims description 56
- 238000004378 air conditioning Methods 0.000 claims description 16
- 230000008685 targeting Effects 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 7
- 238000009826 distribution Methods 0.000 claims description 3
- 239000003795 chemical substances by application Substances 0.000 description 95
- 238000003860 storage Methods 0.000 description 59
- 238000004422 calculation algorithm Methods 0.000 description 27
- 230000007613 environmental effect Effects 0.000 description 7
- 238000004891 communication Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000000977 initiatory effect Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Definitions
- the disclosed technology relates to a reinforcement learning device, a reinforcement learning method, and a reinforcement learning program.
- Reinforcement learning is a method that can learn better behavior in an unknown environment. It is also possible to use continuous values for behavior, and when dealing with continuous behavior, the probability density function of the policy can be treated as a normal distribution with mean ⁇ and variance ⁇ 2 (for example, non-patent literature (see 1). At this time, the larger ⁇ is, the greater the variation in the calculated behavior is, and the wider the search is performed.
- the first problem is that it takes a lot of time for reinforcement learning to converge. Therefore, if it is possible to perform learning with as few trials as possible and to prevent the computation time per trial from increasing excessively for efficient search, the computation time can be reduced and learning can be converged.
- the second issue is that in reinforcement learning, which performs search and trial-and-error based on a policy, if the policy is incomplete, the search may not be successful, resulting in a local solution, making it impossible to achieve optimal control.
- the disclosed technology has been made in view of the above points, and provides a reinforcement learning device, a reinforcement learning method, and a reinforcement learning method that can dynamically adjust a search space according to a predicted reward in reinforcement learning targeting a continuous action space.
- the purpose is to provide reinforcement learning programs.
- a first aspect of the present disclosure is a reinforcement learning device that performs reinforcement learning targeting a continuous action space, wherein predetermined settings for a simulation and an agent model are stored, and the settings are based on the settings in the reinforcement learning.
- a predefined behavior is input, and the state in the next trial, a reward corresponding to the state, and a flag indicating whether the simulation execution has ended are acquired.
- an agent model estimating unit that inputs the state acquired by the agent model into the agent model and obtains a policy; an action determining unit that calculates the behavior based on the policy and a predefined amount of search; a search amount estimating section for estimating a search amount, and the agent model estimating section is configured to calculate a search amount according to the settings of the agent model based on the state, the reward, the flag, and the action.
- the agent model is updated, and the search amount estimation unit updates the search amount based on the predicted reward calculated for the reward and the search amount in the previous trial, and updates the search amount based on the flag and the setting.
- the calculation of the behavior, the update of the agent model, and the update of the search amount are repeated until the corresponding predetermined condition is satisfied.
- a second aspect of the present disclosure is a reinforcement learning method that performs reinforcement learning targeting a continuous action space, in which predetermined settings for simulation and agent models are stored, and the settings are based on the settings in the reinforcement learning.
- a predefined behavior is input, and the state in the next trial, a reward corresponding to the state, and a flag indicating whether the simulation execution has ended are acquired.
- the state obtained by is input to the agent model, a policy is obtained, the action is calculated based on the policy and a predefined amount of search, and the state, the reward, and the
- the agent model is updated according to the settings of the agent model based on the flag and the behavior, and the agent model is updated based on the predicted reward calculated for the reward and the search amount in the previous trial.
- the computer is caused to execute a process of updating the search amount and repeating the calculation of the behavior, the update of the agent model, and the update of the search amount until a predetermined condition according to the flag and the setting is satisfied.
- a third aspect of the present disclosure is a reinforcement learning program that performs reinforcement learning targeting a continuous action space, wherein predetermined settings for simulation and agent models are stored, and the program is based on the settings in the reinforcement learning.
- a predefined behavior is input, and the state in the next trial, a reward corresponding to the state, and a flag indicating whether the simulation execution has ended are acquired.
- the state obtained by is input to the agent model, a policy is obtained, the action is calculated based on the policy and a predefined amount of search, and the state, the reward, and the
- the agent model is updated according to the settings of the agent model based on the flag and the behavior, and the agent model is updated based on the predicted reward calculated for the reward and the search amount in the previous trial.
- the computer is caused to execute a process of updating the search amount and repeating the calculation of the behavior, the update of the agent model, and the update of the search amount until a predetermined condition according to the flag and the setting is satisfied.
- the search space can be dynamically adjusted according to the predicted reward.
- FIG. 2 is a block diagram showing the hardware configuration of a reinforcement learning device.
- FIG. 1 is a block diagram showing the functional configuration of a reinforcement learning device according to the present embodiment. This is an example of data stored in the learning setting storage unit. This is an example of agent model data stored in the model storage unit. This is an example of behavior data stored in the behavior storage unit. It is a flowchart which shows the flow of reinforcement learning processing by a reinforcement learning device.
- FIG. 1 is a block diagram showing the hardware configuration of the reinforcement learning device 100.
- the reinforcement learning device 100 includes a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a storage 14, an input unit 15, and a display unit 1. 6 and communication interface (I/F) 17. Each configuration is communicably connected to each other via a bus 19.
- CPU Central Processing Unit
- ROM Read Only Memory
- RAM Random Access Memory
- storage 14 an input unit
- I/F communication interface
- the CPU 11 is a central processing unit that executes various programs and controls various parts. That is, the CPU 11 reads a program from the ROM 12 or the storage 14 and executes the program using the RAM 13 as a work area. The CPU 11 controls each of the above components and performs various arithmetic operations according to programs stored in the ROM 12 or the storage 14. In this embodiment, the ROM 12 or the storage 14 stores a reinforcement learning program.
- the ROM 12 stores various programs and various data.
- the RAM 13 temporarily stores programs or data as a work area.
- the storage 14 is constituted by a storage device such as an HDD (Hard Disk Drive) or an SSD (Solid State Drive), and stores various programs including an operating system and various data.
- the input unit 15 includes a pointing device such as a mouse and a keyboard, and is used to perform various inputs.
- the display unit 16 is, for example, a liquid crystal display, and displays various information.
- the display section 16 may employ a touch panel system and function as the input section 15.
- the communication interface 17 is an interface for communicating with other devices such as terminals.
- a wired communication standard such as Ethernet (registered trademark) or FDDI
- a wireless communication standard such as 4G, 5G, or Wi-Fi (registered trademark) is used.
- FIG. 2 is a block diagram showing the functional configuration of the reinforcement learning device of this embodiment.
- Each functional configuration is realized by the CPU 11 reading out a reinforcement learning program stored in the ROM 12 or the storage 14, loading it into the RAM 13, and executing it.
- the reinforcement learning device 100 performs reinforcement learning targeting a continuous action space.
- the reinforcement learning device 100 includes a learning setting storage section 110, an agent model estimating section 111, a search amount estimating section 112, and a behavior determining section 113.
- This configuration is the main configuration 100A of the reinforcement learning device 100.
- the reinforcement learning device 100 includes a setting input section 101, a simulation execution section 102, a model storage section 103, a behavior storage section 104, and an operation output section 105 as processing sections responsible for input/output functions.
- the settings input unit 101 stores data received through input from the user in the learning settings storage unit 110. Note that the setting input section 101 corresponds to the input section 15 in terms of hardware.
- the learning settings storage unit 110 stores data received from the user through the settings input unit 101 as settings.
- FIG. 3 is an example of data stored in the learning setting storage unit 110.
- Information is saved for each column of "setting item”, “setting content”, and “setting target”.
- “Setting contents” are settings or setting values for “setting items.”
- the “setting target” is the target processing unit of the reinforcement learning device 100.
- “setting items” include settings for ⁇ search amount estimator parameters ⁇ and ⁇ initial search amount ⁇ , which are used by the search amount estimator 112. As ⁇ search amount estimator parameters ⁇ , ⁇ , ⁇ , and C are determined.
- "setting item” is a setting for ⁇ reinforcement learning algorithm name ⁇ , which is used by the agent model estimation unit 111.
- ⁇ Reinforcement learning algorithm name ⁇ selects a reinforcement learning algorithm (hereinafter simply referred to as "algorithm” refers to a reinforcement learning algorithm) that determines the processing content of the agent model estimation unit 111.
- algorithm a reinforcement learning algorithm
- “setting items” required for each algorithm are stored, including settings for ⁇ maximum number of steps ⁇ and ⁇ agent model storage frequency ⁇ .
- "setting items” are ⁇ simulation type name ⁇ that selects the processing content of the simulation execution unit 102, ⁇ simulation initialization parameters ⁇ , and ⁇ initial behavior value ⁇ that indicates the action value at the start of execution. The settings for are saved.
- the learning setting storage unit 110 transmits each saved setting value to the simulation execution unit 102, agent model estimating unit 111, and search amount estimating unit 112, which are targets for setting each setting value.
- the simulation execution unit 102 executes a simulation for the input action a.
- action a which is a trigger for simulation execution
- an initial action value is received from the learning setting storage unit 110 at the time of initial operation, and the initial action value is set as action a.
- the action a is received from the action determining unit 113.
- the simulation execution unit 102 outputs a state s (next state s) observed as a result of the simulation, a reward r defined for the state, and a flag d, and uses these outputs as an agent model. It is transmitted to the estimator 111.
- the flag d is a truth value indicating whether the simulation has ended and the simulation environment should be reset.
- the internal algorithm of the simulation execution unit 102 is set according to the ⁇ simulation type name ⁇ stored in the learning setting storage unit 110.
- the internal algorithm for example, it is possible to use a simulator or one equivalent to a video game or a board game characterized by a screen or state transition in response to a specific operation.
- a simulator reproduces state changes when operating equipment in a specific state prepared by the user in advance.
- a simulator can be used that reproduces changes in indoor temperature and humidity when air conditioning is controlled.
- the actual environment is, for example, an environment where there is a building where air conditioning can be controlled, and where changes in indoor temperature and humidity can be measured using sensors or the like and data can be collected.
- the agent model estimation unit 111 receives the output (state s, reward r, and flag d) transmitted from the simulation execution unit 102, and receives the behavior a of the previous trial transmitted from the behavior determination unit 113. Further, the agent model estimation unit 111 reads various setting values stored in the learning setting storage unit 110 and extracts an agent model stored in the model storage unit 103.
- the agent model estimation unit 111 inputs the state s obtained from the simulation execution unit 102 into the agent model, and obtains the policy ⁇ as part of the output of the agent model.
- the agent model estimation unit 111 transmits the acquired policy ⁇ to the behavior determination unit 113.
- the agent model estimation unit 111 inputs the state s, reward r, flag d, and action a to the extracted agent model, and updates the agent model.
- the state s used to update the agent model is the next state s for the next time (trial) as described later, and the reward r and flag d corresponding to the next state s.
- the action a is updated after the policy ⁇ is obtained.
- the internal algorithm (reinforcement learning algorithm) used to calculate the agent model is defined by ⁇ reinforcement learning algorithm name ⁇ stored in the learning setting storage unit 110.
- Existing technology may be used as the reinforcement learning algorithm, and an algorithm that targets continuous value behavior may be used.
- the agent models defined by the algorithms are in the form of functions or neural networks, and their hyperparameters and weighting coefficients of the neural networks are updated in a manner defined by each algorithm.
- the history of the state s, reward r, and action a may be saved in the agent model estimation unit and used to update the model.
- the agent model estimation section 111 updates the updated agent model based on the setting value.
- the agent model is sent to the model storage unit 103.
- the search amount estimation unit 112 receives the reward r transmitted from the simulation execution unit 102, and updates the predicted reward r pred based on the reward r, for example, from equation (1).
- the parameter ⁇ in equation (1) is the learning rate of the predicted reward, and uses the setting value of ⁇ search amount estimation parameter ⁇ stored in the learning setting storage unit 110.
- r pred on the right side is the predicted reward before updating. Note that an arbitrary value such as 0 is used as the initial value of r pred on the right side. ...(1)
- the search amount estimating unit 112 determines the search amount ⁇ from equation (2) based on the predicted reward r pred , and transmits it to the action determining unit 113.
- the parameters ⁇ and C in equation (2) use the setting values of ⁇ search amount estimation parameters ⁇ stored in the learning setting storage unit 110. ...(2)
- Equations (1) and (2) are simplified models that dynamically adjust movement variations in animal and human motor learning.
- the behavior determining unit 113 calculates and determines the behavior a for the next trial based on the policy ⁇ transmitted from the agent model estimating unit 111 and the search amount ⁇ output from the search amount estimating unit 112, and executes the simulation.
- the action a is transmitted to the section 102.
- the probability density function of the action a can be expressed as shown in equation (3) below using the search amount ⁇ output from the search amount estimation unit 112.
- x is a random variable.
- Action a is stochastically determined according to a probability density function. This allows reinforcement learning to be performed in a continuous action space. ...(3)
- the agent model updated by the agent model estimation unit 111 is stored in the model storage unit 103.
- FIG. 4 is an example of agent model data stored in the model storage unit 103.
- the model should be saved every time it is updated. If ⁇ agent model storage frequency ⁇ is written in the learning setting storage unit 110, that setting is followed.
- the behavior storage unit 104 stores the behavior at each time transmitted from the behavior determination unit 113.
- FIG. 5 is an example of behavior data stored in the behavior storage unit 104.
- the operation output unit 105 extracts the behavior for a specific period stored in the behavior storage unit 104, and outputs the control content to the target controller.
- FIG. 6 is a flowchart showing the flow of reinforcement learning processing by the reinforcement learning device 100. Reinforcement learning processing is performed by the CPU 11 reading a reinforcement learning program from the ROM 12 or the storage 14, loading it onto the RAM 13, and executing it.
- step S100 the CPU 11 initializes the agent model and search amount.
- the agent model is initialized by the agent model estimating section 111, and the search amount is initialized by the search amount estimating section 112.
- the agent model estimation unit 111 initializes the agent model based on setting items such as ⁇ reinforcement learning algorithm name ⁇ stored in the learning setting storage unit 110.
- the model storage unit 103 if a model corresponding to the combination of ⁇ reinforcement learning algorithm name ⁇ and ⁇ simulation type name ⁇ stored in the learning setting storage unit 110 already exists, the model with a large number of steps among the corresponding models is saved. is read out as the weight of the agent model.
- the current number of steps is defined by the number of steps of the stored model. If the model stored in the model storage unit 103 is not read out, the current number of steps is set to 0. The current number of steps is stored within the agent model estimation unit 111.
- the search amount estimation unit 112 extracts ⁇ search amount estimation parameters ⁇ and ⁇ initial search amount ⁇ stored in the learning setting storage unit 110. Initialize to use the value of the initial search amount as the search amount.
- step S102 the CPU 11 initializes the simulator in the simulation execution unit 102 and obtains the state s.
- the simulation execution unit 102 reads the ⁇ simulation type name ⁇ and ⁇ simulation initialization parameter ⁇ stored in the learning setting storage unit 110, and initializes the simulation environment corresponding to the simulation type name using the simulation initialization parameter. . Upon initialization, the simulation execution unit 102 outputs an initial state s, and outputs it to the agent model estimation unit. Further, the state s is stored inside the simulation execution unit 102.
- step S104 the CPU 11, as the agent model estimation unit 111, inputs the state s obtained from the simulation execution unit 102 into the agent model, and obtains the policy ⁇ as an output.
- step S106 the CPU 11, as the action determining unit 113, calculates the action a based on the policy ⁇ output from the agent model estimating unit 111 and the search amount ⁇ defined by the search amount estimating unit 112, and determines the action a. do.
- Behavior a is output to the simulation execution unit 102 and behavior storage unit 104 and is also stored within the agent model estimation unit 111.
- step S108 the CPU 11 adds 1 to the current step stored in the agent model estimation unit 111.
- step S110 the CPU 11, as the simulation execution unit 102, obtains the next state s, the reward r, and the flag d.
- the simulation execution unit 102 acquires the next state s for the next time (next trial) based on the behavior a acquired from the behavior determination unit 113 and the state s stored inside the simulation execution unit 102. .
- the simulation execution unit 102 obtains the next state s, and also calculates a reward r and a flag d indicating whether the simulation execution has ended, depending on the next state s.
- step S112 the CPU 11, as the agent model estimation unit 111, updates the agent model.
- the update is executed according to ⁇ reinforcement learning algorithm name ⁇ based on the state s, reward r, and flag d acquired from the simulation execution unit 102 and the action a stored inside the agent model estimation unit 111.
- the updated agent model is stored in the model storage unit 103 when the frequency described in ⁇ agent model storage frequency ⁇ is met.
- the agent model is updated each time a simulation is executed, and cases where the model is not updated and is updated every multiple times.
- the algorithm name registered in ⁇ reinforcement learning algorithm name ⁇ is an algorithm that is updated at once every several times, it is stored inside the agent model estimation unit 111 without being updated. That is, in cases other than the update timing defined by the algorithm, the agent model is not updated, and instead, the state s, reward r, and flag d obtained from the simulation execution unit 102 are stored inside the agent model estimation unit 111. do.
- the agent model is updated based on the history of the state s, reward r, flag d, and action a stored in the agent model estimation unit 111.
- step S114 the CPU 11, as the search amount estimation unit 112, performs a search based on the predicted reward obtained from the reward r obtained from the simulation execution unit 102 and the search amount ⁇ at the previous time stored in the search amount estimation unit 112.
- Update amount The search amount is updated by calculating the predicted reward using equation (1) above and calculating the search amount ⁇ using equation (2).
- the updated search amount is stored inside the search amount estimation section 112 and used when the action determination section 113 operates. By updating the search amount in this way, it is possible to update the search space to widen it when the reward amount is small.
- step S116 the CPU 11 determines whether the flag d in the simulation execution unit 102 is True or False. If True, the simulation execution unit 102 is initialized in step S102, and subsequent processing is executed again. If the flag d is False, the process moves to step S118. Being False is an example of satisfying the predetermined condition regarding the flag of the present disclosure.
- step S118 the CPU 11 determines whether the current step, which is a variable held in the agent model estimation unit 111, exceeds the maximum number of steps stored in the learning setting storage unit 110. If it does not exceed the threshold, the processing from step S104 onwards is executed again, and if it exceeds the threshold, all processing ends. Exceeding the maximum number of steps is an example of satisfying a predetermined condition regarding settings of the present disclosure.
- the search space in reinforcement learning targeting a continuous action space, can be dynamically adjusted according to the predicted reward.
- the time required for learning to converge is shortened without expanding the search space, while in situations where no rewards can be obtained, the search space is expanded to achieve optimal control without falling into a local solution. It can be realized.
- the current policy can perform good control, so the demand for extensive search is low.
- the demand for extensive search is low.
- the search space is expanded when a strategy that does not yield a reward is reached, and by trying a wide range of actions, the local solution is escape and search for the optimal solution.
- the method of the present disclosure it is possible to shorten the time until learning convergence by efficiently performing a search using reinforcement learning targeting a continuous action space, and the first problem can be solved. Furthermore, by widening the search space when the reward amount is small, it is possible to learn strategies that can obtain more rewards without falling into a local solution, and the second problem can be solved.
- ⁇ When used for air conditioning control> In this usage mode, a simulator that predicts future room temperature changes and heat consumption by inputting weather data, number of visitors, past room temperature, air conditioning control data, etc. is used as the simulation execution unit 102, and air conditioning control is performed as an action. Handle setting values. This makes it possible to create an agent model that learns optimal air conditioning control that achieves energy savings while maintaining comfort.
- Temperature prediction by the simulation execution unit 102 can be achieved by using a neural network or regression model that inputs various data and outputs room temperature.
- heat consumption can be predicted by using a regression model that predicts the amount of heat required by inputting weather data, the number of visitors, and air conditioner settings. Moreover, these can also be used in combination.
- the simulation execution unit 102 internally holds data acquired from various sensors, such as weather data, the number of visitors, past room temperature, and air conditioning control history (this is defined as environmental data). In addition, the simulation execution unit 102 uses these environmental data to learn in advance a model that reproduces environmental changes at a future time and a model that estimates the amount of heat (heat consumption) consumed by air conditioning equipment during air conditioning control. It is assumed that Further, it is assumed that rules are predetermined for evaluating whether these values are comfortable and energy-saving based on the estimated temperature/humidity and heat consumption.
- step S100 is as described in the processing flow above.
- the simulation execution unit 102 reads ⁇ simulation type name ⁇ and ⁇ simulation initialization parameter ⁇ registered in the learning setting storage unit 110.
- a simulator eg, indoor temperature/humidity reproduction env
- the simulation execution unit 102 is initialized according to ⁇ simulation initialization parameters ⁇ . For example, it is necessary to randomly select one day from among the dates for which environmental data exists and for which simulation can be executed, and to reproduce the indoor temperature and humidity from the time t of the relevant date using the time t specified in the simulation initialization parameter.
- environment data is loaded and held in the simulation execution unit 102. Additionally, environmental data necessary for estimating heat consumption from time t on the date is similarly loaded and held in the simulation execution unit 102. As an initial state, indoor temperature and humidity data at time t is acquired and output to the agent model estimation unit 111.
- Steps S104, S108, S112 and subsequent steps are as described in the processing flow above.
- Step S106 is as described in the processing flow above.
- action a indicates an air conditioning control method at a certain time, and indicates set values for each air conditioning device as shown in FIG.
- the simulation execution unit 102 predicts the indoor temperature and humidity at the next time (for example, 10 minutes later).
- the indoor temperature and humidity are predicted based on the behavior a (that is, the air conditioning control method) acquired from the behavior determination unit 113, the state s stored in the simulation execution unit 102, and the environmental data loaded in advance. Further, the amount of heat consumed by the air conditioner during air conditioning control is estimated using the state s and the environmental data loaded in advance.
- the reward is determined based on a predetermined rule for evaluating whether the condition is good from the viewpoint of comfort and energy saving, based on the state s representing the indoor temperature and humidity and the predicted heat consumption. If the time at which the simulation was performed is the last time for which data exists on the date, the flag d indicating whether the simulation has ended is set to True; otherwise, it is set to False.
- the state s, reward r, and flag d are output to the agent model estimator 111 and the search amount estimator 112.
- the simulation execution unit 102 uses a simulator that predicts the future state of the equipment by inputting information indicating the equipment state and equipment operation, and uses equipment operation commands (such as motor operation and equipment movement) as actions. instructions).
- Information indicating the state of the device includes joint angles and speeds, robot position information, etc. This makes it possible to create an agent model that learns optimal device control to achieve the desired behavior.
- the simulation execution unit 102 has been trained to be able to predict changes in the device state from previously measured data, or can predict changes in the device state using a physical simulator. Further, it is assumed that rules for evaluating whether the motion is a desired motion are determined in advance.
- step S100 is as described in the processing flow above.
- the simulation execution unit 102 reads ⁇ simulation type name ⁇ and ⁇ simulation initialization parameter ⁇ registered in the learning setting storage unit 110.
- the simulation execution unit 102 when using it for robot control, specify the name of the simulator (eg, robot arm env) that predicts the next state from the previous state and device operation.
- the simulation execution unit 102 is initialized according to ⁇ simulation initialization parameters ⁇ .
- Steps S104, S108, S112 and subsequent steps are as described in the processing flow above.
- Step S106 is as described in the processing flow above.
- action a indicates a device control method at a certain time.
- the simulation execution unit 102 predicts a state change at the next time (for example, one second later).
- the state change is predicted based on the action a (that is, the device control method) acquired from the action determining unit 113, the state s saved in the simulation execution unit, and the environment data loaded in advance.
- the reward is determined based on a rule that evaluates whether the movement has a predetermined purpose.
- the flag d indicating whether the simulation has ended is set to True; otherwise, it is set to False.
- the state s, reward r, and end flag d are output to the agent model estimator and the search amount estimator. Operation failures include, for example, dropping an object when carrying it with a robot arm, or moving the robot out of the operation target area.
- the simulation execution unit 102 uses as a simulator a game in which the state changes by inputting information indicating the state (game screen, etc.) and game operations, and handles the game operations as actions. This makes it possible to create an agent model that learns game operations that will yield high scores. At this time, it is assumed that the rules of the game are determined in advance and can be obtained as rewards.
- step S100 is as described in the processing flow above.
- the simulation execution unit 102 reads ⁇ simulation type name ⁇ and ⁇ simulation initialization parameter ⁇ registered in the learning setting storage unit 110.
- the name of the simulator eg, block break env.
- Steps S104, S108, S112 and subsequent steps are as described in the processing flow above.
- Step S106 is as described in the processing flow above.
- action a indicates a device control method at a certain time.
- the simulation execution unit 102 executes the game using the action a (that is, the game operation) acquired from the action determination unit 113, and obtains the state change at the next time (for example, after one frame).
- rewards are obtained based on predetermined game rules.
- the flag d indicating whether the simulation has ended is set to True; otherwise, it is set to False.
- the state s, reward r, and end flag d are output to the agent model estimator and the search amount estimator.
- the reinforcement learning process that the CPU reads and executes the software (program) in the above embodiments may be executed by various processors other than the CPU.
- processors include FPGA (Field-Programmable Gate Array), PLD (Programmable Logic Device) whose circuit configuration can be changed after manufacturing, GPU (Graphics Processing Unit), etc. ), and identification of ASIC (Application Specific Integrated Circuit), etc.
- An example is a dedicated electric circuit that is a processor having a circuit configuration specifically designed to execute the processing.
- the reinforcement learning process may be executed by one of these various processors, or by a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, and a combination of a CPU and an FPGA). etc.).
- the hardware structure of these various processors is, more specifically, an electric circuit that is a combination of circuit elements such as semiconductor elements.
- the program can be installed on CD-ROM (Compact Disk Read Only Memory), DVD-ROM (Digital Versatile Disk Read Only Memory), and USB (Universal Serial Bus) stored in a non-transitory storage medium such as memory It may be provided in the form of Further, the program may be downloaded from an external device via a network.
- CD-ROM Compact Disk Read Only Memory
- DVD-ROM Digital Versatile Disk Read Only Memory
- USB Universal Serial Bus
- a reinforcement learning device that performs reinforcement learning targeting a continuous action space, Predefined settings for simulation and agent models are saved. In a simulation based on the settings in reinforcement learning, a predefined behavior is input, and the state in the next trial, a reward according to the state, and a flag indicating whether the simulation execution has ended are obtained.
- a reinforcement learning device configured as follows.
- a non-transitory storage medium storing a program executable by a computer to perform a reinforcement learning process
- the program is a reinforcement learning program that performs reinforcement learning targeting a continuous action space
- Predefined settings for simulation and agent models are saved.
- a predefined behavior is input, and the state in the next trial, a reward according to the state, and a flag indicating whether the simulation execution has ended are obtained.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
・・・(1)
・・・(2)
・・・(3)
次に、強化学習装置100の作用について説明する。図6は、強化学習装置100による強化学習処理の流れを示すフローチャートである。CPU11がROM12又はストレージ14から強化学習プログラムを読み出して、RAM13に展開して実行することにより、強化学習処理が行なわれる。
本開示における強化学習装置100を用いた手法は、種々の産業分野に利用できるため活用例を挙げて各ケースを説明する。
本利用形態の場合、シミュレーション実行部102として気象データ、来館者人数、過去の室温、空調制御データ等を入力として将来の室温変化及び熱消費量を予測するシミュレータを利用し、行動として空調制御の設定値を扱う。これにより、快適さを保ちながら省エネ性を実現する最適な空調制御を学習するエージェントモデルを作成することができる。
本利用形態の場合、シミュレーション実行部102として機器の状態を示す情報と機器操作を入力として将来の機器の状態を予測するシミュレータを利用し、行動として機器の操作コマンド(モーターの動作や機器の移動指示)を扱う。機器の状態を示す情報は関節の角度や速度、ロボットの位置情報などである。これにより目的の動作を実現する最適な機器制御を学ぶエージェントモデルを作成することができる。このとき、シミュレーション実行部102は、事前に計測されたデータから機器状態の変化をあらかじめ予測できるように学習済みであるか、物理シミュレータにより機器状態の変化を予測できるとする。また、目的の動作であることを評価するルールはあらかじめ定められているとする。
本利用形態の場合、シミュレーション実行部102として状態を示す情報(ゲーム画面など)とゲーム操作を入力として状態が遷移するゲームをシミュレータとして利用し、行動としてゲーム操作を扱う。これにより高得点を得られるゲーム操作を学ぶエージェントモデルを作成することができる。このとき、ゲームのルールはあらかじめ定められており、報酬として取得できることを前提とする。
メモリと、
前記メモリに接続された少なくとも1つのプロセッサと、
を含み、
前記プロセッサは、
連続行動空間を対象とした強化学習を行う強化学習装置であって、
シミュレーション及びエージェントモデルについてあらかじめ定められた設定が保存されており、
前記強化学習における前記設定に基づくシミュレーションでは、あらかじめ定義された行動を入力として、次の試行における状態と、当該状態に応じた報酬と、シミュレーション実行が終了したかどうかを示すフラグとが取得されるようになっており、
前記シミュレーションにより取得された前記状態を前記エージェントモデルに入力し、方策を取得し、
前記方策と、あらかじめ定義された探索量とに基づいて、前記行動を算出し、
更に、前記状態と、前記報酬と、前記フラグと、前記行動とに基づいて、前記エージェントモデルの前記設定に応じて、前記エージェントモデルを更新し、
前記報酬に対して求まる予測報酬と、前の試行における前記探索量とに基づいて、前記探索量を更新し、
前記フラグ及び前記設定に応じた所定の条件を満たすまで、前記行動の算出、前記エージェントモデルの更新、及び前記探索量の更新を繰り返す、
ように構成されている強化学習装置。
強化学習処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
前記プログラムは連続行動空間を対象とした強化学習を行う強化学習プログラムであって、
シミュレーション及びエージェントモデルについてあらかじめ定められた設定が保存されており、
前記強化学習における前記設定に基づくシミュレーションでは、あらかじめ定義された行動を入力として、次の試行における状態と、当該状態に応じた報酬と、シミュレーション実行が終了したかどうかを示すフラグとが取得されるようになっており、
前記シミュレーションにより取得された前記状態を前記エージェントモデルに入力し、方策を取得し、
前記方策と、あらかじめ定義された探索量とに基づいて、前記行動を算出し、
更に、前記状態と、前記報酬と、前記フラグと、前記行動とに基づいて、前記エージェントモデルの前記設定に応じて、前記エージェントモデルを更新し、
前記報酬に対して求まる予測報酬と、前の試行における前記探索量とに基づいて、前記探索量を更新し、
前記フラグ及び前記設定に応じた所定の条件を満たすまで、前記行動の算出、前記エージェントモデルの更新、及び前記探索量の更新を繰り返す、
非一時的記憶媒体。
100 学習装置
101 設定入力部
102 シミュレーション実行部
103 モデル保存部
104 行動保存部
105 操作出力部
110 学習設定保存部
111 エージェントモデル推定部
112 探索量推定部
113 行動決定部
Claims (6)
- 連続行動空間を対象とした強化学習を行う強化学習装置であって、
シミュレーション及びエージェントモデルについてあらかじめ定められた設定が保存されており、
前記強化学習における前記設定に基づくシミュレーションでは、あらかじめ定義された行動を入力として、次の試行における状態と、当該状態に応じた報酬と、シミュレーション実行が終了したかどうかを示すフラグとが取得されるようになっており、
前記シミュレーションにより取得された前記状態を前記エージェントモデルに入力し、方策を取得するエージェントモデル推定部と、
前記方策と、あらかじめ定義された探索量とに基づいて、前記行動を算出する行動決定部と、
前記探索量を推定するための探索量推定部とを含み、
前記エージェントモデル推定部は、前記状態と、前記報酬と、前記フラグと、前記行動とに基づいて、前記エージェントモデルの前記設定に応じて、前記エージェントモデルを更新し、
前記探索量推定部は、前記報酬に対して求まる予測報酬と、前の試行における前記探索量とに基づいて、前記探索量を更新し、
前記フラグ及び前記設定に応じた所定の条件を満たすまで、前記行動の算出、前記エージェントモデルの更新、及び前記探索量の更新を繰り返す、
強化学習装置。 - 前記探索量推定部は、前記設定に定めた予測報酬の学習率のパラメータと、前記報酬とに基づいて前記予測報酬を算出し、算出した前記予測報酬と前記設定における探索量推定のためのパラメータとに基づいて探索量を更新する請求項1に記載の強化学習装置。
- 前記行動決定部により決定される前記行動は、確率変数と、前記方策が表す、正規分布の平均及び分散とを用いた確率密度関数に応じて確率的に決定される請求項1又は請求項2に記載の強化学習装置。
- 前記行動は、空調制御方法である請求項1又は請求項2に記載の強化学習装置。
- 連続行動空間を対象とした強化学習を行う強化学習方法であって、
シミュレーション及びエージェントモデルについてあらかじめ定められた設定が保存されており、
前記強化学習における前記設定に基づくシミュレーションでは、あらかじめ定義された行動を入力として、次の試行における状態と、当該状態に応じた報酬と、シミュレーション実行が終了したかどうかを示すフラグとが取得されるようになっており、
前記シミュレーションにより取得された前記状態を前記エージェントモデルに入力し、方策を取得し、
前記方策と、あらかじめ定義された探索量とに基づいて、前記行動を算出し、
更に、前記状態と、前記報酬と、前記フラグと、前記行動とに基づいて、前記エージェントモデルの前記設定に応じて、前記エージェントモデルを更新し、
前記報酬に対して求まる予測報酬と、前の試行における前記探索量とに基づいて、前記探索量を更新し、
前記フラグ及び前記設定に応じた所定の条件を満たすまで、前記行動の算出、前記エージェントモデルの更新、及び前記探索量の更新を繰り返す、
処理をコンピュータに実行させる強化学習方法。 - 連続行動空間を対象とした強化学習を行う強化学習プログラムであって、
シミュレーション及びエージェントモデルについてあらかじめ定められた設定が保存されており、
前記強化学習における前記設定に基づくシミュレーションでは、あらかじめ定義された行動を入力として、次の試行における状態と、当該状態に応じた報酬と、シミュレーション実行が終了したかどうかを示すフラグとが取得されるようになっており、
前記シミュレーションにより取得された前記状態を前記エージェントモデルに入力し、方策を取得し、
前記方策と、あらかじめ定義された探索量とに基づいて、前記行動を算出し、
更に、前記状態と、前記報酬と、前記フラグと、前記行動とに基づいて、前記エージェントモデルの前記設定に応じて、前記エージェントモデルを更新し、
前記報酬に対して求まる予測報酬と、前の試行における前記探索量とに基づいて、前記探索量を更新し、
前記フラグ及び前記設定に応じた所定の条件を満たすまで、前記行動の算出、前記エージェントモデルの更新、及び前記探索量の更新を繰り返す、
処理をコンピュータに実行させる強化学習プログラム。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/011121 WO2023170973A1 (ja) | 2022-03-11 | 2022-03-11 | 強化学習装置、強化学習方法、及び強化学習プログラム |
JP2024505854A JPWO2023170973A1 (ja) | 2022-03-11 | 2022-03-11 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/011121 WO2023170973A1 (ja) | 2022-03-11 | 2022-03-11 | 強化学習装置、強化学習方法、及び強化学習プログラム |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023170973A1 true WO2023170973A1 (ja) | 2023-09-14 |
Family
ID=87936387
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2022/011121 WO2023170973A1 (ja) | 2022-03-11 | 2022-03-11 | 強化学習装置、強化学習方法、及び強化学習プログラム |
Country Status (2)
Country | Link |
---|---|
JP (1) | JPWO2023170973A1 (ja) |
WO (1) | WO2023170973A1 (ja) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013225192A (ja) * | 2012-04-20 | 2013-10-31 | Nippon Telegr & Teleph Corp <Ntt> | 報酬関数推定装置、報酬関数推定方法、およびプログラム |
US20200113017A1 (en) * | 2018-10-05 | 2020-04-09 | Airspan Networks Inc. | Apparatus and method for configuring a communication link |
US20200110964A1 (en) * | 2018-10-04 | 2020-04-09 | Seoul National University R&Db Foundation | Method and device for reinforcement learning using novel centering operation based on probability distribution |
CN111782871A (zh) * | 2020-06-18 | 2020-10-16 | 湖南大学 | 基于时空强化学习的跨模态视频时刻定位方法 |
-
2022
- 2022-03-11 JP JP2024505854A patent/JPWO2023170973A1/ja active Pending
- 2022-03-11 WO PCT/JP2022/011121 patent/WO2023170973A1/ja active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013225192A (ja) * | 2012-04-20 | 2013-10-31 | Nippon Telegr & Teleph Corp <Ntt> | 報酬関数推定装置、報酬関数推定方法、およびプログラム |
US20200110964A1 (en) * | 2018-10-04 | 2020-04-09 | Seoul National University R&Db Foundation | Method and device for reinforcement learning using novel centering operation based on probability distribution |
US20200113017A1 (en) * | 2018-10-05 | 2020-04-09 | Airspan Networks Inc. | Apparatus and method for configuring a communication link |
CN111782871A (zh) * | 2020-06-18 | 2020-10-16 | 湖南大学 | 基于时空强化学习的跨模态视频时刻定位方法 |
Non-Patent Citations (4)
Title |
---|
"Actor-Critic Algorithm Using Appropriateness History For Actor: Reinforcement Learning Under Incomplete Value-Function", ARTIFICIAL INTELLIGENCE, vol. 15, no. 2, 2000, pages 267 - 275 |
GABRIEL DULAC-ARNOLD; DANIEL MANKOWITZ; TODD HESTER: "Challenges of Real-World Reinforcement Learning", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 29 April 2019 (2019-04-29), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081268717 * |
KOBAYASHI, TAISUKE: "Machine Learning and Control: Reinforcement Learning in Continuous Action Space", JOURNAL OF THE SOCIETY OF INSTRUMENT AND CONTROL ENGINEERS, vol. 58, no. 10, 10 October 2019 (2019-10-10), pages 806 - 810, XP009548991, ISSN: 0453-4662, DOI: 10.11499/sicejl.58.806 * |
KOICHIRO MORIHAYATO YAMANA: "Kyoka-gakushu heiretsuka ni yoru gakushu no kosokuka (in Japanese) (Accelerating Learning by Reinforcement Learning Parallelization", IPSJ SIG TECHNICAL REPORTS, INTELLIGENCE AND COMPLEX SYSTEM (ICS, 2004, pages 89 - 94 |
Also Published As
Publication number | Publication date |
---|---|
JPWO2023170973A1 (ja) | 2023-09-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hao et al. | Exploration in deep reinforcement learning: From single-agent to multiagent domain | |
Wang et al. | Multi-population following behavior-driven fruit fly optimization: A Markov chain convergence proof and comprehensive analysis | |
US20230131283A1 (en) | Method for generating universal learned model | |
JP7279445B2 (ja) | 予測方法、予測プログラムおよび情報処理装置 | |
Abed-Alguni et al. | A comparison study of cooperative Q-learning algorithms for independent learners | |
Duell et al. | Solving partially observable reinforcement learning problems with recurrent neural networks | |
CN110481536A (zh) | 一种应用于混合动力汽车的控制方法及设备 | |
CN112016678B (zh) | 用于增强学习的策略生成网络的训练方法、装置和电子设备 | |
Ma et al. | A novel APSO-aided weighted LSSVM method for nonlinear hammerstein system identification | |
Hafez et al. | Topological Q-learning with internally guided exploration for mobile robot navigation | |
Wu et al. | Torch: Strategy evolution in swarm robots using heterogeneous–homogeneous coevolution method | |
JP2018528511A (ja) | 生産システムにおける出力効率の最適化 | |
Pan et al. | Additional planning with multiple objectives for reinforcement learning | |
Hafez et al. | Efficient intrinsically motivated robotic grasping with learning-adaptive imagination in latent space | |
JP6947029B2 (ja) | 制御装置、それを使用する情報処理装置、制御方法、並びにコンピュータ・プログラム | |
WO2023170973A1 (ja) | 強化学習装置、強化学習方法、及び強化学習プログラム | |
CN115453860A (zh) | 环境参数控制设备集群控制方法、装置、设备及存储介质 | |
CN117478538A (zh) | 一种基于深度强化学习的物联网设备探测与控制方法 | |
Tan et al. | Q-learning with heterogeneous update strategy | |
Du et al. | Reinforcement learning | |
Fu et al. | Federated Reinforcement Learning for Adaptive Traffic Signal Control: A Case Study in New York City | |
Du et al. | A novel locally regularized automatic construction method for RBF neural models | |
Li et al. | A novel estimation of distribution algorithm using graph-based chromosome representation and reinforcement learning | |
Ansari et al. | Language expansion in text-based games | |
Chandak et al. | Equilibrium bandits: Learning optimal equilibria of unknown dynamics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22930953 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2024505854 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022930953 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2022930953 Country of ref document: EP Effective date: 20241011 |