WO2023037504A1

WO2023037504A1 - Reinforced learning system, reinforced learning device, and reinforced learning method

Info

Publication number: WO2023037504A1
Application number: PCT/JP2021/033360
Authority: WO
Inventors: 裕志吉田
Original assignee: 日本電気株式会社
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2023-03-16
Also published as: JPWO2023037504A1

Abstract

In order to select a more suitable action, this reinforced learning system (1) comprises an acquisition unit (11) for acquiring a first state in an environment that is the object of reinforced learning, a generation unit (12) for generating a second state by adding noise to the first state, a calculation unit (13) for calculating a first action value function in accordance with the second state, and a selection unit (14) for selecting an action in accordance with the first action value function.

Description

Reinforcement learning system, reinforcement learning device, and reinforcement learning method

The present invention relates to a reinforcement learning system, a reinforcement learning device, and a reinforcement learning method.

Reinforcement learning research is underway to learn actions that maximize the reward obtained when the next action is taken in a certain state. Japanese Patent Laid-Open No. 2002-200002 describes a technique for learning images of concave parts and convex parts and control amounts when combining parts by reinforcement learning in assembly work by a robot arm. Further, Patent Literature 2 describes a technique of learning an accelerator operation amount using reinforcement learning, and selecting an action based on a throttle opening command value and a retardation amount according to the state. Patent Literature 2 also describes that a function approximator may be used for the action-value function Q.

International Publication No. 2018/146770 Japanese Patent Application Laid-Open No. 2021-67193

However, the techniques described in

Patent Documents

1 and 2 have room for improvement in terms of selecting more suitable actions. This is because an appropriate action can be selected if the action-value function can be accurately estimated in reinforcement learning, but the action-value function estimated in the techniques described in

Patent Documents

1 and 2 includes errors. Especially when the state-action space is huge, it is difficult to accurately estimate the action-value function.

One aspect of the present invention has been made in view of the above problems, and an example of its purpose is to provide a technique that allows selection of more suitable actions.

A reinforcement learning system according to one aspect of the present invention includes acquisition means for acquiring a first state in an environment that is a target of reinforcement learning, and adding noise to the first state to generate a second state. A generating means, a calculating means for calculating a first action-value function according to the second state, and a selecting means for selecting an action according to the first action-value function.

A reinforcement learning device according to one aspect of the present invention includes acquisition means for acquiring a first state in an environment that is a target of reinforcement learning, and adding noise to the first state to generate a second state. A generating means, a calculating means for calculating a first action-value function according to the second state, and a selecting means for selecting an action according to the first action-value function.

A reinforcement learning method according to one aspect of the present invention includes generating a second state by adding noise to the first state in an environment that is a target of reinforcement learning, and depending on the second state, calculating a first action-value function; and selecting an action in response to said first action-value function.

According to one aspect of the present invention, a more suitable action can be selected.

1 is a block diagram showing the configuration of a reinforcement learning system according to Exemplary Embodiment 1 of the present invention; FIG. FIG. 2 is a flow diagram showing the flow of a reinforcement learning method according to exemplary embodiment 1 of the present invention; 1 is a block diagram illustrating the configuration of a reinforcement learning system according to exemplary embodiment 1 of the present invention; FIG. 1 is a block diagram showing an example of a device configuration for realizing exemplary embodiment 1 of the present invention; FIG. FIG. 4 is a block diagram showing the configuration of a reinforcement learning system according to exemplary embodiment 2 of the present invention; FIG. 5 is a flow diagram showing the flow of a reinforcement learning method according to exemplary embodiment 2 of the present invention; FIG. 11 is a diagram showing an example of a game screen according to exemplary embodiment 3 of the present invention; Fig. 3 illustrates a first state according to exemplary embodiment 3 of the present invention; FIG. 10 is a diagram showing an example of an evaluation result according to an application example of exemplary embodiment 3 of the present invention; FIG. 10 is a diagram showing an example of an evaluation result according to an application example of exemplary embodiment 3 of the present invention; FIG. 10 is a diagram showing an example of an evaluation result according to an application example of exemplary embodiment 3 of the present invention; FIG. 10 is a diagram showing an example of an evaluation result according to an application example of exemplary embodiment 3 of the present invention; 3 is a block diagram showing configurations of a computer functioning as a reinforcement learning device, a terminal 20, and a server 30 according to exemplary embodiments 1 to 7 of the present invention; FIG.

[Exemplary embodiment 1]
A first exemplary embodiment of the invention will now be described in detail with reference to the drawings. This exemplary embodiment is the basis for the exemplary embodiments described later.

<Configuration of reinforcement learning system>
A configuration of a reinforcement learning system 1 according to this exemplary embodiment will be described with reference to FIG. FIG. 1 is a block diagram showing the configuration of a reinforcement learning system 1. As shown in FIG. The reinforcement learning system 1 is a system that selects actions by reinforcement learning. The reinforcement learning system 1 is, for example, a system for controlling construction operations of construction machines such as excavators, a system for controlling transportation by a transportation device, or a system for autonomous play of computer games. However, the reinforcement learning of the reinforcement learning system 1 is not limited to the example described above, and the reinforcement learning performed by the reinforcement learning system 1 can be applied to various systems. The action is the action of an agent in reinforcement learning, and examples thereof include excavating motion control of an excavator, transport motion control of a transport device, or autonomous play control of a computer game. However, the actions are not limited to these examples, and may be other than the above.

The reinforcement learning system 1 includes an acquisition unit 11, a generation unit 12, a calculation unit 13, and a selection unit 14, as shown in FIG. The acquisition unit 11 is a configuration that implements acquisition means in this exemplary embodiment. The generation unit 12 is a configuration that implements generation means in this exemplary embodiment. The calculator 13 is a configuration that implements a calculator in this exemplary embodiment. The selection unit 14 is a configuration that realizes selection means in this exemplary embodiment.

The acquisition unit 11 acquires the first state. The first state is the state in the environment that is the object of reinforcement learning. For example, if the reinforcement learning system 1 is a system for selecting an excavating operation of an excavator, the first state may be, for example, the posture and position of the excavator that excavates the earth and sand, the shape of the earth and sand to be excavated, and the excavation including part or all of the amount of soil in the bucket of the machine. Further, when the reinforcement learning system 1 is a system for selecting the transport operation of the transport device, the first state includes, for example, the position, moving direction, speed and angle of the transport device, the position of the passage, and the static location and velocity of dynamic obstacles or dynamic obstacles. Moreover, if the reinforcement learning system 1 is a system for autonomous play of a computer game, the first state includes, as an example, the state of an object that affects the progress of the game in the computer game. However, the first state is not limited to the one described above, and may be another state. The first condition may include environmental conditions such as temperature or weather, for example.

The generator 12 generates the second state by adding noise to the first state. Noise is, for example, random numbers such as normal random numbers or uniform random numbers. However, the noise added to the first state by the generator 12 is not limited to these, and may be noise other than the above. The generator 12 may add noise to all of the elements included in the first state, or may add noise to some of the elements included in the first state.

The calculation unit 13 calculates the first action value function according to the second state. As an example, the calculation unit 13 calculates the first action value function using a state sequence including a plurality of second states. Further, the calculation unit 13 may calculate the first action value function using a state sequence including the first state and one or more second states. In other words, the state sequence used by the calculator 13 to calculate the first action-value function includes one or more second states, and the states included in the state sequence are either the first state or the first state. 2 state. In the following description, when there is no need to distinguish between the first state and the second state, they may simply be referred to as "states."

The first action value function is a function for evaluating actions in a state. The first action-value function is, for example, an action-value function used in Q-learning, and is updated by the following equation (1), for example. However, the first action-value function is not limited to that given by Equation (1), and may be another function.

In equation (1), s _t ⁽ⁱ⁾ (1≤i≤n; i and n are natural numbers) is the state included in the state sequence (i.e., the first state or the second state), and a is the action , and Q(s _t ⁽ⁱ⁾ , a) is the first action-value function. α is the learning rate, s _t+1 ⁽ⁱ⁾ is the post-transition state, r _t+1 is the reward the agent receives when it transitions to the state s _t+1 ⁽ⁱ⁾ , and γ (0≦γ≦1) is the discount rate. Also, a′εA, set A is the set of possible actions of the agent in state s _t ⁽ⁱ⁾ .

A reward is a reward obtained from the environment by an agent's actions. For example, the reward is added or subtracted according to the amount of excavation by the excavator, the time required for excavation, the time required for transportation, whether or not there was contact with an obstacle during transportation, the win or loss of the game, or the score of the game. is the value to be However, the reward is not limited to these examples, and may be other than the above.

When formula (1) is used, the calculation unit 13 calculates the first action value function for each state included in the state string. In other words, the calculation unit 13 calculates the first action-value function by the number of states included in the state sequence.

The selection unit 14 selects an action according to the first action value function. As an example, the selection unit 14 selects an action that maximizes the first action-value function. The selection unit 14 may select an action by an ε greedy method, roulette selection used in genetic algorithms, a softmax method using Boltzmann distribution, or the like.

Further, when using a plurality of first action-value functions, the selection unit 14 may select an action using any one of the plurality of first action-value functions, as an example, and the calculation unit 13 may A second action-value function may be calculated using a plurality of calculated first action-value functions, and an action may be selected using the calculated second action-value function. A second action value function is a function for evaluating actions in a state. The second action-value function may be, for example, expected values of a plurality of first action-value functions. may also be a function with a small value. The second action value function is given by the following formula (2) or formula (3) as an example. However, the second action value function is not limited to the one given by Equation (2) or (3), and may be other functions.

In equations (2) and (3), J(s _t , a) is the second action-value function, s _t is the first state, a is the action, θ is the hyperparameter, Q(s _t ⁽ⁱ⁾ , a) is the first action-value function, s _t ⁽ⁱ⁾ is the state included in the state sequence, and E is the expected value. Equation (3) is obtained by Taylor-expanding Equation (2), adopting terms up to the second order, and discarding the third and subsequent terms.

When the selection unit 14 calculates the second action-value function, the selection unit 14 selects an action that maximizes the second action-value function, using the policy given by Equation (4) as an example. Note that the policy for selecting an action is not limited to the policy given by Equation (4), and may be another policy. The selection unit 14 may select actions by, for example, the ε greedy method, roulette selection used in genetic algorithms, or the softmax method using the Boltzmann distribution. When using the ε-greedy approach, the policy is given by the following equation (5) as an example.

In equations (4) and (5), π is the next action to choose, and a′ is the possible action of the agent in the first state s _t . Also, in equation (5), ε (0<ε<1) is a constant, and v is a random number that satisfies 0≦v≦1.

<Effect of Reinforcement Learning System>
According to the reinforcement learning system 1 according to the present exemplary embodiment, by calculating the action value function using the second state obtained by adding noise to the first state, the first action considering the variation of the state A value function can be calculated. By selecting an action using this first action-value function, the reinforcement learning system 1 can select a more suitable action.

<Flow of reinforcement learning method>
FIG. 2 is a flowchart showing the flow of the reinforcement learning method S1 executed by the reinforcement learning system 1. As shown in FIG. The reinforcement learning system 1 repeatedly selects an action by repeating the reinforcement learning method S1. In addition, the description about the content already demonstrated is not repeated.

The reinforcement learning method S1 includes steps S11 to S14. In step S11, the obtaining unit 11 obtains the first state. In step S12, the generator 12 generates a second state by adding noise to the first state.

In step S13, the calculation unit 13 calculates the first action value function according to the second state. Here, in the n-th repetition (n is a natural number), the data that the calculation unit 13 refers to in order to calculate the first action-value function is, for example, the state accumulated up to the (n-1)th repetition. , behavior, and reward are used. In step S14, the selection unit 14 selects an action according to the first action value function.

<Effect of reinforcement learning method>
According to the reinforcement learning method S1 according to the present exemplary embodiment, by calculating the action-value function using the second state in which noise is added to the first state, the action-value function considering the variation of the states can be calculated. can be calculated. By selecting an action using this action-value function, a more suitable action can be selected.

<Device configuration example of reinforcement learning system>
Next, a device configuration example of the reinforcement learning system 1 according to this exemplary embodiment will be described with reference to the drawings. FIG. 3 is a block diagram showing an example of the configuration of the reinforcement learning system 1. As shown in FIG. In the example of FIG. 3, the reinforcement learning system 1 includes a reinforcement learning device 10. FIG. The reinforcement learning device 10 includes an acquisition unit 11 , a generation unit 12 , a calculation unit 13 and a selection unit 14 . The reinforcement learning device 10 is, for example, a server device, a personal computer, or a game device, but is not limited to these, and may be a device other than the above. As an example, the reinforcement learning device 10 may acquire the first state by receiving the first state via a communication interface.

FIG. 4 is a block diagram showing another example of the configuration of the reinforcement learning system 1. As shown in FIG. In the example of FIG. 4, the reinforcement learning system 1 includes a terminal 20 and a server 30. FIG. The terminal 20 is, for example, a personal computer or a game machine, but is not limited to these, and may be a device other than the above. The terminal 20 has an acquisition unit 11 . The server 30 includes a generator 12 , a calculator 13 and a selector 14 . The terminal 20 acquires the first state and supplies the acquired first state to the server 30 .

Although FIGS. 3 and 4 are illustrated as configuration examples of the reinforcement learning system 1 in this exemplary embodiment, the configuration of the reinforcement learning system 1 is not limited to those illustrated in FIGS. Various other configurations are applicable.

[Exemplary embodiment 2]
A second exemplary embodiment of the invention will now be described in detail with reference to the drawings. Components having the same functions as the components described in exemplary embodiment 1 are denoted by the same reference numerals, and description thereof will not be repeated.

<Configuration of reinforcement learning system>
FIG. 5 is a block diagram showing the configuration of the reinforcement learning system 2. As shown in FIG. As shown in FIG. 5, the reinforcement learning system 2 includes a terminal 40 and a reinforcement learning device 50. FIG. The terminal 40 and the reinforcement learning device 50 are configured to be communicable via a communication line N. FIG. The specific configuration of the communication line N does not limit this exemplary embodiment, but as an example, it may be a wireless LAN (Local Area Network), a wired LAN, a WAN (Wide Area Network), a public line network, or a mobile data communication network. , or a combination of these networks can be used.

The terminal 40 is, for example, a general-purpose computer, and more specifically, for example, a control device for controlling construction machinery such as an excavator, a management device for managing transportation by a transportation device, or a game device for playing a computer game. is. Note that the terminal 40 is not limited to these, and may be a device other than the above. The reinforcement learning device 50 is, for example, a server device.

<Device configuration>
The terminal 40 includes a communication section 41 , a control section 42 and an input reception section 43 . The communication unit 41 transmits and receives information to and from the reinforcement learning device 50 via the communication line N under the control of the control unit 42 . Hereinafter, the transmission/reception of information between the control unit 42 and the reinforcement learning device 50 via the communication unit 41 is simply referred to as the transmission/reception of information between the control unit 42 and the reinforcement learning device 50 .

The control unit 42 includes a state provision unit 421, an action execution unit 422, and a reward provision unit 423. The state providing unit 421 acquires the first state and provides the acquired first state to the reinforcement learning device 50 . In this exemplary embodiment, the first state obtained by the state provider 421 includes a plurality of elements accompanied by attributes. The attribute is information indicating the characteristics and/or type of the element, and includes, for example, information indicating whether the element is a dynamic element that moves within the environment or a static element that does not move within the environment. Attributes may also be information indicating the types of elements such as people, automobiles, bicycles, and buildings. However, the attribute is not limited to the above example, and may be information other than the above.

As an example, the state providing unit 421 may acquire, as the first state, sensor information output by a sensor that detects the operation of a construction machine, transport device, or the like. Also, as an example, the state providing unit 421 may acquire the first state of an object that affects the progress of a computer game. However, the first state acquired by the state providing unit 421 is not limited to the example described above, and may be a state other than the above.

As an example, the state providing unit 421 receives input of the first state via the input receiving unit 43 and provides the received first state to the reinforcement learning device 50 . Also, as an example, the state providing unit 421 may receive the first state from another device connected via the communication unit 41 and provide the received first state to the reinforcement learning device 50 .

The action execution unit 422 executes the action determined by the reinforcement learning device 50. As an example, the action execution unit 422 outputs control information for causing the construction machine, the transport device, or the like to perform the action determined by the reinforcement learning device 50 . Also, as an example, the action execution unit 422 controls the action of an object that is the target of a user's operation in a computer game. However, the actions executed by the reinforcement learning device 50 are not limited to the examples described above, and may be actions other than the above.

The reward providing unit 423 provides the reinforcement learning device 50 with the reward obtained when the agent executes the action determined by the reinforcement learning device 50 . As an example, the reward providing unit 423 may determine the amount of excavation by the excavator, the time required for excavation, the time required for transportation by the transportation device, the presence or absence of contact with obstacles during transportation, the win or loss of the game, or the score of the game. The indicated information is provided to the reinforcement learning device 50 as a reward. However, the reward provided by the reward providing unit 423 is not limited to the example described above, and may be other rewards than the above.

As an example, the reward providing unit 423 provides the reinforcement learning device 50 with the reward obtained via the input receiving unit 43 . Also, the reward providing unit 423 may receive a reward from another device connected via the communication unit 41 and provide the received reward to the reinforcement learning device 50 .

The input reception unit 43 receives various inputs to the terminal 40. The specific configuration of the input reception unit 43 does not limit this exemplary embodiment, but as an example, the input reception unit 43 can be configured to include an input device such as a keyboard and a touch pad. Further, the input reception unit 43 may be configured to include a data scanner that reads data via electromagnetic waves such as infrared rays and radio waves, a sensor that senses the state of the environment, and the like. As an example, the reward providing unit 423 measures the time required for transportation by the transport device based on the sensing result acquired by the input receiving unit 43, and provides the reinforcement learning device 50 with a reward indicating the measurement result.

The input reception unit 43 supplies the input-received information to the control unit 42 via the above-described input device, data scanner, sensor, and the like. As an example, the input reception unit 43 acquires the above-described state and reward, and supplies the acquired state and reward to the control unit 42 .

<Configuration of reinforcement learning device>
The reinforcement learning device 50 includes a communication section 51 , a control section 52 and a storage section 53 . The communication unit 51 transmits and receives information to and from the reinforcement learning device 50 via the communication line N under the control of the control unit 52 . In the following description, the transmission and reception of information between the control unit 52 and the terminal 40 via the communication unit 51 is simply referred to as the transmission and reception of information between the control unit 52 and the terminal 40 .

The control unit 52 includes a reward acquisition unit 521, a state observation unit 522, a state randomization unit 523, a learning unit 524, an estimation unit 525, and a selection unit 526. The state observer 522 is a configuration that implements an acquisition means in this exemplary embodiment. The state randomization unit 523 is a configuration that implements the generating means in this exemplary embodiment. The estimating unit 525 is a configuration that implements the calculating means in this exemplary embodiment. The selection unit 526 is a configuration that implements selection means in this exemplary embodiment.

The reward acquisition unit 521 acquires the reward provided by the terminal 40 via the communication unit 51. The state observation unit 522 acquires the first state provided by the terminal 40 via the communication unit 51 . State randomization section 523 generates one or more second states by adding noise to the first state obtained by state observation section 522 . The learning unit 524 learns the action-value function model 531 for updating the first action-value function. The action-value function model 531 is used to estimate the first action-value function.

The estimation unit 525 calculates a first action value function according to a state sequence including a first state and one or more second states or a state sequence including a plurality of second states. . Also, the estimation unit 525 calculates a second action-value function using the first action-value function.

The selection unit 526 selects an action using the second action value function, stores information indicating the selected action in the storage unit 53, and transmits information indicating the selected action to the terminal 40.

The storage unit 53 stores various data that the control unit 52 refers to. As an example, the storage unit 53 stores an action-value function model 531 and learning data 532 . The action-value function model 531 is a learning model for updating the first action-value function. The learning data 532 is data used in reinforcement learning performed by the reinforcement learning device 50 . Learning data 532 includes, by way of example, first states, second states, actions, and rewards.

<Flow of reinforcement learning method>
FIG. 6 is a flowchart showing the flow of the reinforcement learning method S2 executed by the reinforcement learning system 2. As shown in FIG. The reinforcement learning system 2 repeatedly selects an action by repeating steps S21 to S29. Note that some steps may be performed in parallel or out of order.

In step S<b>21 , the state providing unit 421 acquires the first state s _t and provides the acquired first state s _t to the reinforcement learning device 50 . In step S<b>22 , the state observation unit 522 acquires the first state s _t from the terminal 40 .

In step S23, the state randomization unit 523 generates one or more second states by adding noise to the first state _st . The noise added to the first state _st by the state randomization unit 523 is, for example, a normal random number or a uniform random number. However, the noise added to the first state _st by the state randomization unit 523 is not limited to these, and may be noise other than the above. The second state to which noise is added represents a state in which the first state _st is slightly blurred.

In this operation example, the state randomization unit 523 generates the second state by selectively adding noise to a plurality of elements included in the first state _st according to the attribute. As an example, the state randomization unit 523 adds noise to elements associated with attributes that satisfy a predetermined condition. The predetermined condition is, for example, an attribute indicating a dynamic element or an attribute indicating a static element. However, the predetermined condition is not limited to the example described above, and may be another condition.

The state randomization unit 523 also generates a state sequence {s _t ⁽ⁱ⁾ } (1≦i≦n; i is a natural number and n is a natural number of 2 or more) including the generated second state. A state sequence {s _t ⁽ⁱ⁾ } is a state sequence including a first state s _t and one or more second states, or a state sequence including a plurality of second states. In other words, the state sequence {s _t ⁽ⁱ⁾ } includes at least the second state and may or may not include the first state s _t .

In step S24, the estimation unit 525 calculates the first action value function Q(s _t ⁽ⁱ⁾ , a) according to the state sequence {s _t ⁽ⁱ⁾ }. As an example, the estimation unit 525 calculates a first action value function Q(s _t (i) , a ⁾ for each of a plurality of states s _t (i) included in the state sequence {s _t ⁽ⁱ ⁾ }. . More specifically, as an example, the estimation unit 525 updates the first action-value function Q(s _t ⁽ⁱ⁾ , a) for the state s _t ⁽ⁱ⁾ according to Equation (1) above. In this operation example, the first action value function (s _t ⁽ⁱ⁾ , a) is an m-dimensional vector (m is an integer of 2 or more), where m is the number of elements in set A (that is, the number of types of action a). ).

In step S25, the estimation unit 525 calculates the second action-value function J(s _t , a) based on the plurality of calculated first action-value functions Q(s _t ⁽ⁱ⁾ , a). The second action-value function J(s _t , a) is given by the above equation (2) or (3) as an example. In other words, the estimator 525 calculates the second action value function given by Equation (2) or Equation (3) above. The second action-value function given by the above formula (2) or the above-mentioned formula (3) is such that the larger the variation of the plurality of first action-value functions Q(s _t ⁽ⁱ⁾ , a), the more the first action-value function It is a function that is lower than the expected value of the function Q(s _t ⁽ⁱ⁾ , a).

^In step _S26 , the selection unit 526 selects the behavior _a to select. As an example, the selection unit 526 selects the action a by the policy given by the above equation (4). Note that the policy for selecting the action a is not limited to the policy given by the above formula (4), and other policies such as the ε-greedy policy and the softmax technique may be used. The selection unit 526 notifies the terminal 40 of the selected action a.

In step S<b>27 , the action execution unit 422 executes the action a notified from the reinforcement learning device 50 . In step S<b>28 , the reward providing unit 423 provides the reinforcement learning device 50 with the reward _rt obtained by executing the action selected by the reinforcement learning device 50 . In step S29, the reward obtaining unit 521 accumulates learning data including the state sequence {s _t ⁽ⁱ⁾ } and the reward r _t .

<Effect of Reinforcement Learning System>
In reinforcement learning, even if the state is slightly different, the value of the action-value function may be greatly different. In other words, a small difference in state can have a large impact on the value of the action-value function. In this exemplary embodiment, by calculating the first action value function Q using the second state in which some noise is added to the first state _st , the first An action-value function Q can be calculated. By selecting the action a using this first action-value function Q, the action a can be selected more appropriately according to this exemplary embodiment.

Further, in the reinforcement learning system 2 according to the present exemplary embodiment, a configuration is adopted in which the first action value function Q is calculated according to a plurality of states s _t ⁽ⁱ⁾ including the second state to which noise is added. It is Therefore, according to the reinforcement learning system 2 according to the present exemplary embodiment, in addition to the effect of the reinforcement learning system 1 according to the first exemplary embodiment, an effect that a more appropriate action a can be selected can be obtained.

Also, in this exemplary embodiment, when the reinforcement learning system 2 calculates the second action-value function J using the above equation (2), the second action-value function J is It is an index sensitive to the risk (variation) By the reinforcement learning system 2 selecting the action a using the second action-value function J, the action a that is more sensitive to risk can be selected.

Further, in this exemplary embodiment, when the reinforcement learning system 2 calculates the second action value function J using the above equation (3), the calculation process is overflow does not occur in By the reinforcement learning system 2 selecting the action a using the second action-value function J, the action a can be selected more appropriately, and the processing load associated with the selection of the action a can be reduced.

[Exemplary embodiment 3]
An exemplary embodiment 3 of the present invention will be described with reference to the drawings. Components having the same functions as those described in

exemplary embodiments

1 and 2 are denoted by the same reference numerals, and description thereof will not be repeated.

<Configuration of reinforcement learning system>
A reinforcement learning system according to this exemplary embodiment (hereinafter referred to as "reinforcement learning system 3") is obtained by applying the reinforcement learning system 2 according to the second exemplary embodiment to autonomous play of a computer game. The reinforcement learning system 3 has the same configuration as the reinforcement learning system 2 shown in FIG. 5 in the second exemplary embodiment described above. The components of reinforcement learning system 3 are the same as those of reinforcement learning system 2, and the description thereof will not be repeated here.

In this exemplary embodiment, the first state _st includes, as an example, the state of an object that affects the progress of the game in a computer game. Action a includes, for example, the action of an object operated by a computer game player. The reward r _t includes, for example, a reward for winning or losing a game or a game score.

FIG. 7 is a diagram showing a screen SC1, which is an example of a game screen of a computer game related to the reinforcement learning system 3. FIG. The screen SC1 includes a first dynamic object C11, second dynamic objects C21-C23, first static objects C31-C34, and a second static object C4. The first dynamic object C11, the second dynamic objects C21-C23, the first static objects C31-C34, and the second static object C4 are examples of objects that affect the progress of the game.

In the computer game shown in FIG. 7, the player designates the moving direction of the first dynamic object C11 moving in the maze, and the second dynamic object C11 placed in the maze while dodging the tracking of the second dynamic objects C21 to C23. In this game, the round is cleared when the static objects C31 to C34 are collected.

The first dynamic object C11 and the second dynamic objects C21 to C23 are objects that move on the screen while the game is in progress, and are examples of dynamic elements that move within the environment. On the other hand, the first static objects C31 to C34 and the second static object C4 are objects that do not move on the screen while the game is in progress, and are examples of static elements that do not move within the environment. The first dynamic object C11 is an object to be operated by the player. The first dynamic object C11 moves in the maze at a constant speed during the progress of the game, and changes its moving direction according to the player's operation. The second dynamic objects C21 to C23 are objects that move following the first dynamic object C11 during the progress of the game. Although three second dynamic objects C21 to C23 are illustrated in FIG. 7, the number of second dynamic objects is not limited to three, and may be more or less.

The first static objects C31 to C34 are objects placed in the maze and collected by the first dynamic object C11. When the first dynamic object C11 collides with the first static objects C31-C34, the first static objects C31-C34 are recovered by the first dynamic object C11. Although four first static objects C31 to C34 are illustrated in FIG. 7, the number of first static objects is not limited to four, and may be more or less. A second static object C4 is a wall forming a maze.

In the example of FIG. 7, the first state s _t includes states for a first dynamic object C11, second dynamic objects C21-C23, first static objects C31-C34, and a second static object C4. . In other words, the first state includes states for dynamic elements that move within the environment and states for static elements that do not move within the environment. More specifically, the first state s _t includes the position of the first dynamic object C11, the positions of the second dynamic objects C21-C23, the positions of the first static objects C31-C34, and the second static the position of object C4.

In this exemplary embodiment, the first state s _t is an image representing a game play screen.
FIG. 8 is a diagram showing an image Img11 as an example of the first state _st . The image Img11 is a grayscale image in which the elements included in the game screen are represented by pixel values from 0 to 255. The image Img11 is divided into a predetermined number of squares, and each square is represented by a pixel value corresponding to the attribute of the element located in each square. As an example, the position of the first dynamic object C11 has a pixel value of 255, the positions of the second dynamic objects C21 to C23 have a pixel value of 160, and the positions of the first static objects C31 to C34 have a pixel value of 128. The position of the path formed by the two static objects C4 is represented by a pixel value of 64, and the immovable place is represented by a pixel value of 0.

In this exemplary embodiment, the action a is the movement of the first dynamic object C11, and there are four types of movement: move up, move down, move right, and move left. The reward r _t is, for example, a predetermined additional value obtained when the score increases (eg, +1), and a predetermined subtracted value obtained when captured by the second dynamic objects C21 to C23 (eg, -10). A predetermined additional value (for example, +1) may be obtained as a reward _rt when the score is increased by the action, regardless of the degree of increase in the score in one action.

<Flow of reinforcement learning method>
The reinforcement learning system 3 executes the reinforcement learning method S2 of FIG. 6 according to the exemplary embodiment 2 described above. The characteristic operation of this exemplary embodiment will be mainly described below, and the description of the contents described in the second exemplary embodiment will not be repeated.

In this exemplary embodiment, in step S23, the state randomizer 523 generates the second state by adding noise to the states of the dynamic elements included in the first state _st . As an example, the state randomization unit 523 generates a second state by randomizing the position of the first dynamic object C11 and the positions of the second dynamic objects C21 to C23 by random walk.

More specifically, as an example, the state randomization unit 523 divides the game screen into a predetermined number of squares (for example, into 33×33 squares), and divides the game screen into front, rear, left, and right directions (directions with roads) in which it can proceed. The probability of advancing/not advancing by one square is selected with equal probability. The state randomization unit 523 performs random walk σ ² times (σ is an integer equal to or greater than 1) for the position of the first dynamic object C11 and the second dynamic objects C21 to C23. Performing σ ² random walks moves the dynamic element by σ squares on average.

The state sequence {s _t ⁽ⁱ⁾ } generated by the state randomization unit 523 in step S23 includes the first state s _t and (n−1) second states obtained by randomizing the first state s _t . , and a total of n states. In addition, since there are four types of actions a: move up, move down, move right, and move left, the first action value function (s _t ⁽ⁱ⁾ , a ) is a four-dimensional vector.

In step S26, the selection unit 526 selects one of four types of action a, up, down, left, and right, as the movement direction of the first dynamic object C11 at an intersection or corner (that is, a point where the movement direction can be changed). However, the selection unit 526 excludes directions in which the first dynamic object C11 cannot move.

<Evaluation of this exemplary embodiment>
9 to 12 are diagrams showing examples of evaluation results of autonomous play of a computer game according to the reinforcement learning system 3, respectively. In the computer game according to this exemplary embodiment, the life of the first dynamic object is one, and the game is over when the first dynamic object is captured by the second dynamic object. Also, the number of stages was one, and the game ended when the game was cleared, that is, when all the first static objects were collected.

In the examples of FIGS. 9 to 12, the results of the autonomous play of the computer game by the reinforcement learning system 3 were evaluated under multiple conditions in which the values of σ and θ in the reinforcement learning of the reinforcement learning system 3 were changed. Moreover, the result of the autonomous play by the conventional reinforcement learning method, which is not the reinforcement learning system 3, was also used for comparison. As a conventional reinforcement learning method, a DQN (deep Q-network) method with an improved action selection policy is used.

FIG. 9 is a graph showing scores from autonomous play when σ=2. σ is the average number of movements in the random walk as described above. In FIG. 9, the vertical axis indicates the score. A graph g91 indicates an average score of autonomous play by conventional reinforcement learning. Graphs g11 to g14 represent average values of scores obtained by autonomous play of the reinforcement learning system 3. FIG. Graphs g11 to g14 have different values of the hyperparameter θ in the formula representing the second action-value function J (formula (2) or formula (3) above). Graphs g11 to g14 are graphs representing the average values of the scores when the hyperparameter θ is set to 0, 0.001, 0.01, and 0.1, respectively.

Comparing the graph g91 with the graphs g11 to g14, the score of the reinforcement learning system 3 according to the present exemplary embodiment is higher than the score of the conventional reinforcement learning. , the score is higher.

FIG. 10 is a graph showing the collection rate of the first static object by autonomous play when σ=2. In FIG. 10, the vertical axis indicates the recovery rate. A graph g92 shows the average recovery rate of autonomous play by conventional reinforcement learning. Graphs g21 to g24 represent the average recovery rates of the reinforcement learning system 3 through autonomous play. Graphs g21 to g24 have different values of the hyperparameter θ of the formula representing the second action-value function J (formula (2) or formula (3) above). Graphs g21 to g24 are graphs showing average recovery rates when the hyperparameter θ is set to 0, 0.001, 0.01, and 0.1, respectively.

Comparing the graph g92 with the graphs g21 to g24, the recovery rate of the reinforcement learning system 3 according to this exemplary embodiment tends to be higher than the recovery rate of conventional reinforcement learning. is "0.01", the score is high.

FIG. 11 is a graph showing the relationship between the score and σ in autonomous play. In FIG. 11, the horizontal axis indicates σ and the vertical axis indicates the score. Graphs g31 to g34 respectively show the average values of the scores when σ is 1 to 5 when the hyperparameter θ is 0, 0.001, 0.01, and 0.1. show. In addition, the average value of the score of the autonomous play by the conventional reinforcement learning is "2009."

In the example of FIG. 11, the score values when the value of σ is 1 to 3 are often higher than the scores obtained by conventional reinforcement learning. In particular, the score for θ=0.01 and σ=2 is higher than the others.

FIG. 12 is a graph showing the relationship between the collection rate of autonomous play and σ. In FIG. 12, the horizontal axis indicates σ and the vertical axis indicates recovery rate. Graphs g41 to g44 respectively represent the average values of the recovery rate for each value of σ when the hyperparameter θ is "0", "0.001", "0.01", and "0.1". The average collection rate of autonomous play by conventional reinforcement learning is 67.5%.

In the example of FIG. 12, the recovery rate when the value of σ is 1 to 3 is often higher than the recovery rate of conventional reinforcement learning. In particular, when θ=0.01 and σ=2, the recovery rate is higher than the others.

As described above, according to this exemplary embodiment, the reinforcement learning system 3 computes the first action-value function using the second state obtained by adding noise to the first state, so that the computer game action selection in the autonomous play can be performed more preferably.

[Exemplary embodiment 4]
An exemplary embodiment 4 of the present invention will now be described. Components having the same functions as the components described in exemplary embodiments 1 to 3 are denoted by the same reference numerals, and description thereof will not be repeated.

A reinforcement learning system according to the present exemplary embodiment (hereinafter referred to as "reinforcement learning system 4") applies the reinforcement learning system 2 according to the second exemplary embodiment to control construction machinery such as an excavator that excavates earth and sand. It is what I did. The reinforcement learning system 3 has the same configuration as the reinforcement learning system 2 shown in FIG. 5 in the second exemplary embodiment described above. The components of reinforcement learning system 4 are the same as those of reinforcement learning system 2, and the description thereof will not be repeated here.

The reinforcement learning system 4 selects the operation of the construction machine, such as the excavation operation when the hydraulic excavator excavates earth and sand, through reinforcement learning. As an example, the purpose of actions in reinforcement learning is to excavate a bucket full of earth and sand so that the vehicle does not tilt or drag during excavation.

In this exemplary embodiment, the first state s _t includes, as an example, the attitude and position of a construction machine such as a hydraulic excavator, the shape of the earth and sand to be excavated (3D data, etc.), and the inside of the bucket of the excavator. Including part or all of the amount of sediment. The posture of the construction machine includes, for example, the angles of the bucket, arm, boom, and rotating body of the construction machine. The position of the construction machine includes, for example, the position and direction of the crawler of the construction machine.

The action a includes, for example, attitude control of the construction machine (bucket, arm, boom, angle control of the rotating body, etc.). For example, the reward _rt is a positive reward whose absolute value increases as the amount of excavation increases, and its absolute value increases as the degree of inclination of the body of the construction machine, the degree of dragging, or the time required for excavation increases. large negative rewards, including in part or in whole.

The state randomization unit 523 may add noise to all of the multiple elements included in the first state _st , or may add noise to some of the elements. When noise is added to some elements, the elements to which noise is added may include, for example, hydraulic excavator posture and 3D data of observed earth and sand.

According to this exemplary embodiment, the reinforcement learning system 4 computes the first action-value function using the second state obtained by adding noise to the first state s _t , thereby determining the operation of the construction machine. The selection can be made better.

[Exemplary embodiment 5]
Exemplary Embodiment 5 of the present invention will now be described. Components having the same functions as the components described in exemplary embodiments 1 to 4 are denoted by the same reference numerals, and the description thereof will not be repeated.

A reinforcement learning system according to this exemplary embodiment (hereinafter referred to as "reinforcement learning system 5") applies the reinforcement learning system 2 according to the second exemplary embodiment to control of a transport device that transports packages. . The transport device is, for example, an automated guided vehicle (AGV) that runs automatically. The reinforcement learning system 5 has the same configuration as the reinforcement learning system 2 shown in FIG. 5 in the second exemplary embodiment described above. The components of reinforcement learning system 5 are the same as those of reinforcement learning system 2, and the description thereof will not be repeated here.

Reinforcement learning system 5 shortens the transportation time as much as possible (increases the transportation speed) when transporting a load from a predetermined position to another position, and avoids static obstacles (shelves, loads, etc.) and moving objects on the way. Choose actions so that there is no contact with physical obstacles (people, other robots, etc.).

In this exemplary embodiment, the first state s _t includes, by way of example, the position, direction of movement, speed and angular velocity of a conveying device conveying the goods, the position of passages, the position of static obstacles, and the dynamic Including part or all of the position and movement speed of the obstacle. Action a includes, for example, velocity control and angular velocity control of the conveying device. The reward _rt is, for example, a part or all of a positive reward obtained when transportation is completed, a negative reward obtained when contacting an obstacle, or a negative reward whose absolute value increases as the transportation time increases. including.

State randomization section 523 may add noise to all of the plurality of elements included in _st in the first state, or may add noise to some of the elements. When adding noise to some elements, the elements to which noise is added may include, for example, the position, orientation, velocity and angular velocity of the transport device, and also the position of static obstacles, or the position of dynamic obstacles. may include the position and velocity of In addition, the state randomization unit 523, for example, adds noise to obstacles positioned in the traveling direction of the conveying device or on the traveling route, and gives noise to obstacles positioned outside the traveling direction or outside the traveling route. You can choose not to.

According to this exemplary embodiment, the reinforcement learning system 5 computes the first action-value function using the second state obtained by adding noise to the first state _st , thereby controlling the transport control of the transport device. can be performed more preferably.

[Exemplary embodiment 6]
An exemplary embodiment 6 of the present invention will now be described. Components having the same functions as the components described in the exemplary embodiments 1 to 5 are denoted by the same reference numerals, and the description thereof will not be repeated.

The reinforcement learning system according to this exemplary embodiment (hereinafter referred to as "reinforcement learning system 6") applies the reinforcement learning system 2 according to the second exemplary embodiment to the control of a forklift. The reinforcement learning system 6 has the same configuration as the reinforcement learning system 2 shown in FIG. 5 in the second exemplary embodiment described above. The components of reinforcement learning system 6 are the same as those of reinforcement learning system 2, and the description thereof will not be repeated here.

When transporting a pallet from a predetermined position to another position, the reinforcement learning system 6 shortens the transport time as much as possible (increases the transport speed) and avoids static obstacles (shelves, luggage, etc.) Choose actions so that there is no contact with physical obstacles (people, other rags, etc.).

In this exemplary embodiment, the first state s _t includes, by way of example, the position, direction of movement, speed and angular velocity of the forklift, the position of the path, the position of static obstacles, and the position and speed of dynamic obstacles. , including part or all of Action a includes, for example, speed control and angular speed control of a forklift. The reward _rt is, for example, a part or all of a positive reward obtained when transportation is completed, a negative reward obtained when contacting an obstacle, or a negative reward whose absolute value increases as the transportation time increases. including.

State randomization section 523 may add noise to all of the plurality of elements included in _st in the first state, or may add noise to some of the elements. When adding noise to some elements, the elements to which noise is added may include, for example, the position, orientation, velocity and angular velocity of a forklift, and also the position of static obstacles, or the May include position and velocity. In addition, the state randomization unit 523, for example, adds noise to obstacles positioned in the traveling direction of the forklift or on the traveling route, and does not add noise to obstacles positioned outside the traveling direction or outside the traveling route. You may do so.

According to this exemplary embodiment, the reinforcement learning system 5 calculates a first action-value function using a second state obtained by adding noise to the first state _st , thereby making forklift control more suitable. can be done.

[Exemplary Embodiment 7]
A seventh exemplary embodiment of the present invention will now be described. Components having the same functions as the components described in exemplary embodiments 1 to 6 are denoted by the same reference numerals, and the description thereof will not be repeated.

The reinforcement learning system according to this exemplary embodiment (hereinafter referred to as "reinforcement learning system 7") has the same configuration as the reinforcement learning system 2 shown in FIG. 5 in the second exemplary embodiment. The components of reinforcement learning system 6 are the same as those of reinforcement learning system 2, and the description thereof will not be repeated here.

In this exemplary embodiment, the first state s _t includes a plurality of elements accompanied by attributes. In addition, the state randomization unit 523, when adding noise to the first state _st , gives different weightings for adding noise depending on attributes. As an example, the state randomization unit 523 may increase the weighting of dynamic elements that move within the environment and decrease the weighting of static elements that do not move within the environment. Also, as an example, the state randomization unit 523 may weight the position of a person among the dynamic elements that move within the environment more than weight the other dynamic elements.

According to this exemplary embodiment, by calculating the first action value function using the second state to which noise is added with weighting according to the attribute of the element, variation in data according to the attribute is taken into account. A first action value function can be calculated. By selecting an action using this first action-value function, it is possible to select an action in consideration of variations in data according to attributes.

In addition, in this exemplary embodiment, the state randomization unit 523 may change the weighting of noise addition during execution of reinforcement learning. As an example, the state randomization unit 523 performs control such that when a dynamic element is moving in the environment, the weight is increased, and when the dynamic element is not moving in the environment, the weight is decreased. may

[Example of realization by software]
Some or all of the functions of the reinforcement learning device 10, the terminal 20, the server 30, the terminal 40, and the reinforcement learning device 50 (hereinafter referred to as "reinforcement learning device 10, etc.") are realized by hardware such as integrated circuits (IC chips). may be implemented by software.

In the latter case, the reinforcement learning device 10 and the like are implemented by, for example, a computer that executes program instructions that are software that implements each function. An example of such a computer (hereinafter referred to as computer C) is shown in FIG. Computer C comprises at least one processor C1 and at least one memory C2. A program P for operating the computer C as the reinforcement learning device 10 or the like is recorded in the memory C2. In the computer C, the processor C1 reads the program P from the memory C2 and executes it, thereby realizing each function of the reinforcement learning device 10 and the like.

As the processor C1, for example, CPU (Central Processing Unit), GPU (Graphic Processing Unit), DSP (Digital Signal Processor), MPU (Micro Processing Unit), FPU (Floating point number Processing Unit), PPU (Physics Processing Unit) , a microcontroller, or a combination thereof. As the memory C2, for example, a flash memory, HDD (Hard Disk Drive), SSD (Solid State Drive), or a combination thereof can be used.

Note that the computer C may further include a RAM (Random Access Memory) for expanding the program P during execution and temporarily storing various data. Computer C may further include a communication interface for sending and receiving data to and from other devices. Computer C may further include an input/output interface for connecting input/output devices such as a keyboard, mouse, display, and printer.

In addition, the program P can be recorded on a non-temporary tangible recording medium M that is readable by the computer C. As such a recording medium M, for example, a tape, disk, card, semiconductor memory, programmable logic circuit, or the like can be used. The computer C can acquire the program P via such a recording medium M. Also, the program P can be transmitted via a transmission medium. As such a transmission medium, for example, a communication network or broadcast waves can be used. Computer C can also obtain program P via such a transmission medium.

[Appendix 1]
The present invention is not limited to the above-described embodiments, and various modifications are possible within the scope of the claims. For example, embodiments obtained by appropriately combining the technical means disclosed in the embodiments described above are also included in the technical scope of the present invention.

[Appendix 2]
Some or all of the above-described embodiments may also be described as follows. However, the present invention is not limited to the embodiments described below.

(Appendix 1)
Acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
generating means for generating a second state by adding noise to the first state;
calculating means for calculating a first action-value function according to the second state;
selection means for selecting an action according to the first action-value function;
A reinforcement learning system comprising:

According to the above configuration, a more suitable action can be selected by calculating the first action value function using the second state obtained by adding noise to the first state.

(Appendix 2)
The calculation means calculates the first action value function according to the first state and the second state,
The reinforcement learning system according to Appendix 1.

According to the above configuration, a more suitable action can be selected by calculating the first action value function using a plurality of states including the second state to which noise is added.

(Appendix 3)
The calculation means calculates the first action-value function for each of the first state and the second state,
The selection means selects the action according to a second action-value function calculated based on a plurality of the first action-value functions.
The reinforcement learning system according to appendix 2.

According to the above configuration, a more suitable action can be selected by using the second action-value function calculated using the plurality of first action-value functions.

(Appendix 4)
The first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. including one
The reinforcement learning system according to any one of Appendices 1 to 3.

According to the above configuration, it is possible to more preferably select the transport operation of the transport device by reinforcement learning.

(Appendix 5)
The first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
The reinforcement learning system according to any one of Appendices 1 to 3.

According to the above configuration, it is possible to more preferably select the construction motion of the construction machine by reinforcement learning.

(Appendix 6)
the first state includes a plurality of elements accompanied by attributes;
The generation means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
6. The reinforcement learning system according to any one of Appendices 1 to 5.

According to the above configuration, it is possible to calculate the first action value function that takes into account the variation in the data of the elements associated with the attribute that satisfies the predetermined condition.

(Appendix 7)
the first state includes a state related to a dynamic element moving within the environment;
The generating means generates the second state by adding noise to states of the dynamic elements included in the first state.
The reinforcement learning system according to appendix 6.

According to the above configuration, it is possible to calculate the first action value function that takes into account the variation in the data of the dynamic elements.

(Appendix 8)
Acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
generating means for generating a second state by adding noise to the first state;
calculating means for calculating a first action-value function according to the second state;
selection means for selecting an action according to the first action-value function;
A reinforcement learning device comprising:

(Appendix 9)
The calculation means calculates the first action value function according to the first state and the second state,
The reinforcement learning device according to appendix 8.

(Appendix 10)
The calculation means calculates the first action value function for each of the plurality of states included in the state sequence,
The selection means selects the action according to a second action-value function calculated based on a plurality of the first action-value functions.
The reinforcement learning device according to appendix 9.

(Appendix 11)
The first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. including one
The reinforcement learning device according to any one of Appendices 8 to 10.

(Appendix 12)
The first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
The reinforcement learning device according to any one of Appendices 8 to 10.

(Appendix 13)
the first state includes a plurality of elements accompanied by attributes;
The generation means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
13. The reinforcement learning device according to any one of Appendices 8 to 12.

(Appendix 14)
the first state includes a state related to a dynamic element moving within the environment;
The generating means generates the second state by adding noise to states of the dynamic elements included in the first state.
The reinforcement learning device according to appendix 13.

(Appendix 15)
Obtaining a first state in an environment that is subject to reinforcement learning;
generating a second state by adding noise to the first state;
calculating a first action-value function according to the second state;
selecting an action according to the first action-value function;
Reinforcement learning methods including.

(Appendix 16)
In calculating the first action value function,
calculating the first action value function according to the first state and the second state;
The reinforcement learning method according to appendix 15.

(Appendix 17)
In calculating the first action-value function, calculating the first action-value function for each of a plurality of states included in the state sequence;
selecting the action according to a second action-value function calculated based on a plurality of the first action-value functions;
The reinforcement learning method according to appendix 16.

(Appendix 18)
The first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. including one
18. The reinforcement learning method according to any one of Appendices 15 to 17.

(Appendix 19)
The first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
18. The reinforcement learning method according to any one of Appendices 15-17.

(Appendix 20)
the first state includes a plurality of elements accompanied by attributes;
In generating the second state, generating the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute;
20. The reinforcement learning method according to any one of Appendices 15 to 19.

(Appendix 21)
the first state includes a state related to a dynamic element moving within the environment;
generating the second state by adding noise to states of the dynamic element included in the first state in generating the second state;
The reinforcement learning method according to appendix 20.

(Appendix 22)
the first state includes a plurality of elements accompanied by attributes;
The generating means weights the addition of the noise differently depending on the attribute.
6. The reinforcement learning system according to any one of Appendices 1 to 5.

According to the above configuration, by using the second state in which noise is added with weighting according to the attributes of the elements, it is possible to calculate the first action value function that takes into account the variation in the data according to the attributes.

(Appendix 23)
The first state includes part or all of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
the action includes attitude control of the construction machine;
19. The reinforcement learning system according to any one of Appendices 1 to 6 and 19.

According to the above configuration, by calculating the first action value function using the second state obtained by adding noise to the first state, the selection of the excavating motion of the excavator is more preferably performed by reinforcement learning. be able to.

(Appendix 24)
The first state includes part or all of the position, direction of movement, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. ,
the action includes velocity control and angular velocity control of the conveying device;
19. The reinforcement learning system according to any one of Appendices 1 to 6 and 19.

According to the above configuration, by calculating the first action value function using the second state obtained by adding noise to the first state, the selection of the conveying operation of the conveying device by reinforcement learning is performed more preferably. be able to.

(Appendix 25)
the first state includes a state of an object that affects the progress of the game in the computer game;
the action includes an action of an object operated by a player of the computer game;
19. The reinforcement learning system according to any one of Appendices 1 to 6 and 19.

According to the above configuration, by calculating the first action value function using the second state obtained by adding noise to the first state, it is possible to more preferably select the action of the object in the autonomous play of the computer game. It can be carried out.

(Appendix 26)
A program that causes a computer to function as a reinforcement learning device,
The program causes the computer to:
Acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
generating means for generating a second state by adding noise to the first state;
calculating means for calculating a first action-value function according to the second state;
selection means for selecting an action according to the first action-value function;
A program characterized by functioning as

(Appendix 27)
The calculation means calculates the first action value function according to the first state and the second state,
27. The program according to appendix 26, characterized by:

(Appendix 28)
The calculation means calculates the first action-value function for each of the first state and the second state,
The selection means selects the action according to a second action-value function calculated based on a plurality of the first action-value functions.
The program according to appendix 27, characterized by:

(Appendix 29)
The first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. including one
29. A program according to any one of appendices 26-28.

(Appendix 30)
The first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
29. A program according to any one of appendices 26-28.

(Appendix 31)
the first state includes a plurality of elements accompanied by attributes;
The generating means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
31. The program according to any one of appendices 26 to 30, characterized in that:

(Appendix 32)
the first state includes a state related to a dynamic element moving within the environment;
The generating means generates the second state by adding noise to states of the dynamic elements included in the first state.
The program according to appendix 27, characterized by:

[Appendix 3]
Some or all of the embodiments described above can also be expressed as follows.

at least one processor, said processor comprising:
Acquisition processing for acquiring a first state in an environment that is a target of reinforcement learning;
a generating process for generating a second state by adding noise to the first state;
a calculation process for calculating a first action-value function according to the second state;
a selection process for selecting an action according to the first action-value function;
Reinforcement learning device that executes

The reinforcement learning device may further include a memory, and the memory stores a program for causing the processor to execute the acquisition process, the generation process, the calculation process, and the selection process. may be stored. Also, this program may be recorded in a computer-readable non-temporary tangible recording medium.

1, 2, 3, 4, 5, 6, 7

Reinforcement learning system

10, 50 Reinforcement learning device 11 Acquisition unit 12 Generation unit 13

Calculation unit

14, 526

Selection unit

20, 40 Terminal 30

Server

41, 51

Communication unit

42, 52 Control unit 43 Input reception unit 53 Storage unit 421 State provision unit 422 Action execution unit 423 Reward provision unit 521 Reward acquisition unit 522 State observation unit 523 State randomization unit 524 Learning unit 525 Estimation unit

Claims

Acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
generating means for generating a second state by adding noise to the first state;
calculating means for calculating a first action-value function according to the second state;
selection means for selecting an action according to the first action-value function;
A reinforcement learning system comprising:
The calculation means calculates the first action value function according to the first state and the second state,
The reinforcement learning system according to claim 1.
The calculation means calculates the first action-value function for each of the first state and the second state,
The selection means selects the action according to a second action-value function calculated based on a plurality of the first action-value functions.
The reinforcement learning system according to claim 2.
The first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. including one
The reinforcement learning system according to any one of claims 1 to 3.
The first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
The reinforcement learning system according to any one of claims 1 to 3.
the first state includes a plurality of elements accompanied by attributes;
The generating means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
The reinforcement learning system according to any one of claims 1 to 5.
the first state includes a state related to a dynamic element moving within the environment;
The generating means generates the second state by adding noise to states of the dynamic elements included in the first state.
The reinforcement learning system according to claim 6.
Acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
generating means for generating a second state by adding noise to the first state;
calculating means for calculating a first action-value function according to the second state;
selection means for selecting an action according to the first action-value function;
A reinforcement learning device comprising:
The calculation means calculates the first action value function according to the first state and the second state,
The reinforcement learning device according to claim 8.
The calculation means calculates the first action-value function for each of the first state and the second state,
The selection means selects the action according to a second action-value function calculated based on a plurality of the first action-value functions.
The reinforcement learning device according to claim 9.
The first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. including one
The reinforcement learning device according to any one of claims 8 to 10.
The first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
The reinforcement learning device according to any one of claims 8 to 10.
the first state includes a plurality of elements accompanied by attributes;
The generation means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
The reinforcement learning device according to any one of claims 8 to 12.
the first state includes a state related to a dynamic element moving within the environment;
The generating means generates the second state by adding noise to states of the dynamic elements included in the first state.
The reinforcement learning device according to claim 13.
Obtaining a first state in an environment that is subject to reinforcement learning;
generating a second state by adding noise to the first state;
calculating a first action-value function according to the second state;
selecting an action according to the first action-value function;
Reinforcement learning methods including.
In calculating the first action value function,
calculating the first action value function according to the first state and the second state;
The reinforcement learning method according to claim 15.
calculating the first action-value function for each of the first state and the second state in calculating the first action-value function;
selecting the action according to a second action-value function calculated based on a plurality of the first action-value functions;
The reinforcement learning method according to claim 16.
The first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. including one
The reinforcement learning method according to any one of claims 15-17.
The first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
The reinforcement learning method according to any one of claims 15-17.
the first state includes a plurality of elements accompanied by attributes;
In generating the second state, generating the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute;
The reinforcement learning method according to any one of claims 15-19.
the first state includes a state related to a dynamic element moving within the environment;
generating the second state by adding noise to states of the dynamic element included in the first state in generating the second state;
The reinforcement learning method according to claim 20.