WO2023037504A1 - Reinforced learning system, reinforced learning device, and reinforced learning method - Google Patents

Reinforced learning system, reinforced learning device, and reinforced learning method Download PDF

Info

Publication number
WO2023037504A1
WO2023037504A1 PCT/JP2021/033360 JP2021033360W WO2023037504A1 WO 2023037504 A1 WO2023037504 A1 WO 2023037504A1 JP 2021033360 W JP2021033360 W JP 2021033360W WO 2023037504 A1 WO2023037504 A1 WO 2023037504A1
Authority
WO
WIPO (PCT)
Prior art keywords
state
action
reinforcement learning
value function
learning system
Prior art date
Application number
PCT/JP2021/033360
Other languages
French (fr)
Japanese (ja)
Inventor
裕志 吉田
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to PCT/JP2021/033360 priority Critical patent/WO2023037504A1/en
Priority to JP2023546676A priority patent/JPWO2023037504A5/en
Publication of WO2023037504A1 publication Critical patent/WO2023037504A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to a reinforcement learning system, a reinforcement learning device, and a reinforcement learning method.
  • Patent Laid-Open No. 2002-200002 describes a technique for learning images of concave parts and convex parts and control amounts when combining parts by reinforcement learning in assembly work by a robot arm. Further, Patent Literature 2 describes a technique of learning an accelerator operation amount using reinforcement learning, and selecting an action based on a throttle opening command value and a retardation amount according to the state. Patent Literature 2 also describes that a function approximator may be used for the action-value function Q.
  • Patent Documents 1 and 2 have room for improvement in terms of selecting more suitable actions. This is because an appropriate action can be selected if the action-value function can be accurately estimated in reinforcement learning, but the action-value function estimated in the techniques described in Patent Documents 1 and 2 includes errors. Especially when the state-action space is huge, it is difficult to accurately estimate the action-value function.
  • One aspect of the present invention has been made in view of the above problems, and an example of its purpose is to provide a technique that allows selection of more suitable actions.
  • a reinforcement learning system includes acquisition means for acquiring a first state in an environment that is a target of reinforcement learning, and adding noise to the first state to generate a second state.
  • a reinforcement learning device includes acquisition means for acquiring a first state in an environment that is a target of reinforcement learning, and adding noise to the first state to generate a second state.
  • a reinforcement learning method includes generating a second state by adding noise to the first state in an environment that is a target of reinforcement learning, and depending on the second state, calculating a first action-value function; and selecting an action in response to said first action-value function.
  • a more suitable action can be selected.
  • FIG. 1 is a block diagram showing the configuration of a reinforcement learning system according to Exemplary Embodiment 1 of the present invention
  • FIG. FIG. 2 is a flow diagram showing the flow of a reinforcement learning method according to exemplary embodiment 1 of the present invention
  • 1 is a block diagram illustrating the configuration of a reinforcement learning system according to exemplary embodiment 1 of the present invention
  • FIG. 1 is a block diagram showing an example of a device configuration for realizing exemplary embodiment 1 of the present invention
  • FIG. FIG. 4 is a block diagram showing the configuration of a reinforcement learning system according to exemplary embodiment 2 of the present invention
  • FIG. 5 is a flow diagram showing the flow of a reinforcement learning method according to exemplary embodiment 2 of the present invention
  • FIG. 11 is a diagram showing an example of a game screen according to exemplary embodiment 3 of the present invention;
  • Fig. 3 illustrates a first state according to exemplary embodiment 3 of the present invention;
  • FIG. 10 is a diagram showing an example of an evaluation result according to an application example of exemplary embodiment 3 of the present invention;
  • FIG. 10 is a diagram showing an example of an evaluation result according to an application example of exemplary embodiment 3 of the present invention;
  • FIG. 10 is a diagram showing an example of an evaluation result according to an application example of exemplary embodiment 3 of the present invention;
  • FIG. 10 is a diagram showing an example of an evaluation result according to an application example of exemplary embodiment 3 of the present invention;
  • FIG. 10 is a diagram showing an example of an evaluation result according to an application example of exemplary embodiment 3 of the present invention
  • 3 is a block diagram showing configurations of a computer functioning as a reinforcement learning device, a terminal 20, and a server 30 according to exemplary embodiments 1 to 7 of the present invention
  • FIG. 10 is a diagram showing an example of an evaluation result according to an application example of exemplary embodiment 3 of the present invention
  • 3 is a block diagram showing configurations of a computer functioning as a reinforcement learning device, a terminal 20, and a server 30 according to exemplary embodiments 1 to 7 of the present invention
  • FIG. 1 is a block diagram showing the configuration of a reinforcement learning system 1.
  • the reinforcement learning system 1 is a system that selects actions by reinforcement learning.
  • the reinforcement learning system 1 is, for example, a system for controlling construction operations of construction machines such as excavators, a system for controlling transportation by a transportation device, or a system for autonomous play of computer games.
  • the reinforcement learning of the reinforcement learning system 1 is not limited to the example described above, and the reinforcement learning performed by the reinforcement learning system 1 can be applied to various systems.
  • the action is the action of an agent in reinforcement learning, and examples thereof include excavating motion control of an excavator, transport motion control of a transport device, or autonomous play control of a computer game.
  • the actions are not limited to these examples, and may be other than the above.
  • the reinforcement learning system 1 includes an acquisition unit 11, a generation unit 12, a calculation unit 13, and a selection unit 14, as shown in FIG.
  • the acquisition unit 11 is a configuration that implements acquisition means in this exemplary embodiment.
  • the generation unit 12 is a configuration that implements generation means in this exemplary embodiment.
  • the calculator 13 is a configuration that implements a calculator in this exemplary embodiment.
  • the selection unit 14 is a configuration that realizes selection means in this exemplary embodiment.
  • the acquisition unit 11 acquires the first state.
  • the first state is the state in the environment that is the object of reinforcement learning.
  • the first state may be, for example, the posture and position of the excavator that excavates the earth and sand, the shape of the earth and sand to be excavated, and the excavation including part or all of the amount of soil in the bucket of the machine.
  • the reinforcement learning system 1 is a system for selecting the transport operation of the transport device
  • the first state includes, for example, the position, moving direction, speed and angle of the transport device, the position of the passage, and the static location and velocity of dynamic obstacles or dynamic obstacles.
  • the first state includes, as an example, the state of an object that affects the progress of the game in the computer game.
  • the first state is not limited to the one described above, and may be another state.
  • the first condition may include environmental conditions such as temperature or weather, for example.
  • the generator 12 generates the second state by adding noise to the first state.
  • Noise is, for example, random numbers such as normal random numbers or uniform random numbers.
  • the noise added to the first state by the generator 12 is not limited to these, and may be noise other than the above.
  • the generator 12 may add noise to all of the elements included in the first state, or may add noise to some of the elements included in the first state.
  • the calculation unit 13 calculates the first action value function according to the second state.
  • the calculation unit 13 calculates the first action value function using a state sequence including a plurality of second states.
  • the calculation unit 13 may calculate the first action value function using a state sequence including the first state and one or more second states.
  • the state sequence used by the calculator 13 to calculate the first action-value function includes one or more second states, and the states included in the state sequence are either the first state or the first state. 2 state.
  • states when there is no need to distinguish between the first state and the second state, they may simply be referred to as "states.”
  • the first action value function is a function for evaluating actions in a state.
  • the first action-value function is, for example, an action-value function used in Q-learning, and is updated by the following equation (1), for example.
  • the first action-value function is not limited to that given by Equation (1), and may be another function.
  • s t (i) (1 ⁇ i ⁇ n; i and n are natural numbers) is the state included in the state sequence (i.e., the first state or the second state), and a is the action , and Q(s t (i) , a) is the first action-value function.
  • is the learning rate
  • s t+1 (i) is the post-transition state
  • r t+1 is the reward the agent receives when it transitions to the state s t+1 (i)
  • ⁇ (0 ⁇ 1) is the discount rate.
  • a′ ⁇ A, set A is the set of possible actions of the agent in state s t (i) .
  • a reward is a reward obtained from the environment by an agent's actions. For example, the reward is added or subtracted according to the amount of excavation by the excavator, the time required for excavation, the time required for transportation, whether or not there was contact with an obstacle during transportation, the win or loss of the game, or the score of the game. is the value to be However, the reward is not limited to these examples, and may be other than the above.
  • the calculation unit 13 calculates the first action value function for each state included in the state string. In other words, the calculation unit 13 calculates the first action-value function by the number of states included in the state sequence.
  • the selection unit 14 selects an action according to the first action value function. As an example, the selection unit 14 selects an action that maximizes the first action-value function.
  • the selection unit 14 may select an action by an ⁇ greedy method, roulette selection used in genetic algorithms, a softmax method using Boltzmann distribution, or the like.
  • the selection unit 14 may select an action using any one of the plurality of first action-value functions, as an example, and the calculation unit 13 may
  • a second action-value function may be calculated using a plurality of calculated first action-value functions, and an action may be selected using the calculated second action-value function.
  • a second action value function is a function for evaluating actions in a state.
  • the second action-value function may be, for example, expected values of a plurality of first action-value functions. may also be a function with a small value.
  • the second action value function is given by the following formula (2) or formula (3) as an example. However, the second action value function is not limited to the one given by Equation (2) or (3), and may be other functions.
  • Equation (3) J(s t , a) is the second action-value function, s t is the first state, a is the action, ⁇ is the hyperparameter, Q(s t (i) , a) is the first action-value function, s t (i) is the state included in the state sequence, and E is the expected value. Equation (3) is obtained by Taylor-expanding Equation (2), adopting terms up to the second order, and discarding the third and subsequent terms.
  • the selection unit 14 selects an action that maximizes the second action-value function, using the policy given by Equation (4) as an example.
  • the policy for selecting an action is not limited to the policy given by Equation (4), and may be another policy.
  • the selection unit 14 may select actions by, for example, the ⁇ greedy method, roulette selection used in genetic algorithms, or the softmax method using the Boltzmann distribution.
  • the policy is given by the following equation (5) as an example.
  • is the next action to choose
  • a′ is the possible action of the agent in the first state s t .
  • ⁇ (0 ⁇ 1) is a constant
  • v is a random number that satisfies 0 ⁇ v ⁇ 1.
  • the reinforcement learning system 1 by calculating the action value function using the second state obtained by adding noise to the first state, the first action considering the variation of the state A value function can be calculated. By selecting an action using this first action-value function, the reinforcement learning system 1 can select a more suitable action.
  • FIG. 2 is a flowchart showing the flow of the reinforcement learning method S1 executed by the reinforcement learning system 1. As shown in FIG. The reinforcement learning system 1 repeatedly selects an action by repeating the reinforcement learning method S1. In addition, the description about the content already demonstrated is not repeated.
  • the reinforcement learning method S1 includes steps S11 to S14.
  • step S11 the obtaining unit 11 obtains the first state.
  • step S12 the generator 12 generates a second state by adding noise to the first state.
  • step S13 the calculation unit 13 calculates the first action value function according to the second state.
  • the data that the calculation unit 13 refers to in order to calculate the first action-value function is, for example, the state accumulated up to the (n-1)th repetition. , behavior, and reward are used.
  • step S14 the selection unit 14 selects an action according to the first action value function.
  • FIG. 3 is a block diagram showing an example of the configuration of the reinforcement learning system 1.
  • the reinforcement learning system 1 includes a reinforcement learning device 10.
  • FIG. The reinforcement learning device 10 includes an acquisition unit 11 , a generation unit 12 , a calculation unit 13 and a selection unit 14 .
  • the reinforcement learning device 10 is, for example, a server device, a personal computer, or a game device, but is not limited to these, and may be a device other than the above.
  • the reinforcement learning device 10 may acquire the first state by receiving the first state via a communication interface.
  • FIG. 4 is a block diagram showing another example of the configuration of the reinforcement learning system 1.
  • the reinforcement learning system 1 includes a terminal 20 and a server 30.
  • the terminal 20 is, for example, a personal computer or a game machine, but is not limited to these, and may be a device other than the above.
  • the terminal 20 has an acquisition unit 11 .
  • the server 30 includes a generator 12 , a calculator 13 and a selector 14 .
  • the terminal 20 acquires the first state and supplies the acquired first state to the server 30 .
  • FIGS. 3 and 4 are illustrated as configuration examples of the reinforcement learning system 1 in this exemplary embodiment, the configuration of the reinforcement learning system 1 is not limited to those illustrated in FIGS. Various other configurations are applicable.
  • FIG. 5 is a block diagram showing the configuration of the reinforcement learning system 2.
  • the reinforcement learning system 2 includes a terminal 40 and a reinforcement learning device 50.
  • the terminal 40 and the reinforcement learning device 50 are configured to be communicable via a communication line N.
  • FIG. The specific configuration of the communication line N does not limit this exemplary embodiment, but as an example, it may be a wireless LAN (Local Area Network), a wired LAN, a WAN (Wide Area Network), a public line network, or a mobile data communication network. , or a combination of these networks can be used.
  • the terminal 40 is, for example, a general-purpose computer, and more specifically, for example, a control device for controlling construction machinery such as an excavator, a management device for managing transportation by a transportation device, or a game device for playing a computer game. is. Note that the terminal 40 is not limited to these, and may be a device other than the above.
  • the reinforcement learning device 50 is, for example, a server device.
  • the terminal 40 includes a communication section 41 , a control section 42 and an input reception section 43 .
  • the communication unit 41 transmits and receives information to and from the reinforcement learning device 50 via the communication line N under the control of the control unit 42 .
  • the transmission/reception of information between the control unit 42 and the reinforcement learning device 50 via the communication unit 41 is simply referred to as the transmission/reception of information between the control unit 42 and the reinforcement learning device 50 .
  • the control unit 42 includes a state provision unit 421, an action execution unit 422, and a reward provision unit 423.
  • the state providing unit 421 acquires the first state and provides the acquired first state to the reinforcement learning device 50 .
  • the first state obtained by the state provider 421 includes a plurality of elements accompanied by attributes.
  • the attribute is information indicating the characteristics and/or type of the element, and includes, for example, information indicating whether the element is a dynamic element that moves within the environment or a static element that does not move within the environment. Attributes may also be information indicating the types of elements such as people, automobiles, bicycles, and buildings. However, the attribute is not limited to the above example, and may be information other than the above.
  • the state providing unit 421 may acquire, as the first state, sensor information output by a sensor that detects the operation of a construction machine, transport device, or the like. Also, as an example, the state providing unit 421 may acquire the first state of an object that affects the progress of a computer game.
  • the first state acquired by the state providing unit 421 is not limited to the example described above, and may be a state other than the above.
  • the state providing unit 421 receives input of the first state via the input receiving unit 43 and provides the received first state to the reinforcement learning device 50 . Also, as an example, the state providing unit 421 may receive the first state from another device connected via the communication unit 41 and provide the received first state to the reinforcement learning device 50 .
  • the action execution unit 422 executes the action determined by the reinforcement learning device 50.
  • the action execution unit 422 outputs control information for causing the construction machine, the transport device, or the like to perform the action determined by the reinforcement learning device 50 .
  • the action execution unit 422 controls the action of an object that is the target of a user's operation in a computer game.
  • the actions executed by the reinforcement learning device 50 are not limited to the examples described above, and may be actions other than the above.
  • the reward providing unit 423 provides the reinforcement learning device 50 with the reward obtained when the agent executes the action determined by the reinforcement learning device 50 .
  • the reward providing unit 423 may determine the amount of excavation by the excavator, the time required for excavation, the time required for transportation by the transportation device, the presence or absence of contact with obstacles during transportation, the win or loss of the game, or the score of the game.
  • the indicated information is provided to the reinforcement learning device 50 as a reward.
  • the reward provided by the reward providing unit 423 is not limited to the example described above, and may be other rewards than the above.
  • the reward providing unit 423 provides the reinforcement learning device 50 with the reward obtained via the input receiving unit 43 . Also, the reward providing unit 423 may receive a reward from another device connected via the communication unit 41 and provide the received reward to the reinforcement learning device 50 .
  • the input reception unit 43 receives various inputs to the terminal 40.
  • the specific configuration of the input reception unit 43 does not limit this exemplary embodiment, but as an example, the input reception unit 43 can be configured to include an input device such as a keyboard and a touch pad. Further, the input reception unit 43 may be configured to include a data scanner that reads data via electromagnetic waves such as infrared rays and radio waves, a sensor that senses the state of the environment, and the like.
  • the reward providing unit 423 measures the time required for transportation by the transport device based on the sensing result acquired by the input receiving unit 43, and provides the reinforcement learning device 50 with a reward indicating the measurement result.
  • the input reception unit 43 supplies the input-received information to the control unit 42 via the above-described input device, data scanner, sensor, and the like.
  • the input reception unit 43 acquires the above-described state and reward, and supplies the acquired state and reward to the control unit 42 .
  • the reinforcement learning device 50 includes a communication section 51 , a control section 52 and a storage section 53 .
  • the communication unit 51 transmits and receives information to and from the reinforcement learning device 50 via the communication line N under the control of the control unit 52 .
  • the transmission and reception of information between the control unit 52 and the terminal 40 via the communication unit 51 is simply referred to as the transmission and reception of information between the control unit 52 and the terminal 40 .
  • the control unit 52 includes a reward acquisition unit 521, a state observation unit 522, a state randomization unit 523, a learning unit 524, an estimation unit 525, and a selection unit 526.
  • the state observer 522 is a configuration that implements an acquisition means in this exemplary embodiment.
  • the state randomization unit 523 is a configuration that implements the generating means in this exemplary embodiment.
  • the estimating unit 525 is a configuration that implements the calculating means in this exemplary embodiment.
  • the selection unit 526 is a configuration that implements selection means in this exemplary embodiment.
  • the reward acquisition unit 521 acquires the reward provided by the terminal 40 via the communication unit 51.
  • the state observation unit 522 acquires the first state provided by the terminal 40 via the communication unit 51 .
  • State randomization section 523 generates one or more second states by adding noise to the first state obtained by state observation section 522 .
  • the learning unit 524 learns the action-value function model 531 for updating the first action-value function.
  • the action-value function model 531 is used to estimate the first action-value function.
  • the estimation unit 525 calculates a first action value function according to a state sequence including a first state and one or more second states or a state sequence including a plurality of second states. Also, the estimation unit 525 calculates a second action-value function using the first action-value function.
  • the selection unit 526 selects an action using the second action value function, stores information indicating the selected action in the storage unit 53, and transmits information indicating the selected action to the terminal 40.
  • the storage unit 53 stores various data that the control unit 52 refers to.
  • the storage unit 53 stores an action-value function model 531 and learning data 532 .
  • the action-value function model 531 is a learning model for updating the first action-value function.
  • the learning data 532 is data used in reinforcement learning performed by the reinforcement learning device 50 .
  • Learning data 532 includes, by way of example, first states, second states, actions, and rewards.
  • FIG. 6 is a flowchart showing the flow of the reinforcement learning method S2 executed by the reinforcement learning system 2. As shown in FIG. The reinforcement learning system 2 repeatedly selects an action by repeating steps S21 to S29. Note that some steps may be performed in parallel or out of order.
  • step S ⁇ b>21 the state providing unit 421 acquires the first state s t and provides the acquired first state s t to the reinforcement learning device 50 .
  • step S ⁇ b>22 the state observation unit 522 acquires the first state s t from the terminal 40 .
  • step S23 the state randomization unit 523 generates one or more second states by adding noise to the first state st .
  • the noise added to the first state st by the state randomization unit 523 is, for example, a normal random number or a uniform random number.
  • the noise added to the first state st by the state randomization unit 523 is not limited to these, and may be noise other than the above.
  • the second state to which noise is added represents a state in which the first state st is slightly blurred.
  • the state randomization unit 523 generates the second state by selectively adding noise to a plurality of elements included in the first state st according to the attribute.
  • the state randomization unit 523 adds noise to elements associated with attributes that satisfy a predetermined condition.
  • the predetermined condition is, for example, an attribute indicating a dynamic element or an attribute indicating a static element.
  • the predetermined condition is not limited to the example described above, and may be another condition.
  • the state randomization unit 523 also generates a state sequence ⁇ s t (i) ⁇ (1 ⁇ i ⁇ n; i is a natural number and n is a natural number of 2 or more) including the generated second state.
  • a state sequence ⁇ s t (i) ⁇ is a state sequence including a first state s t and one or more second states, or a state sequence including a plurality of second states.
  • the state sequence ⁇ s t (i) ⁇ includes at least the second state and may or may not include the first state s t .
  • step S24 the estimation unit 525 calculates the first action value function Q(s t (i) , a) according to the state sequence ⁇ s t (i) ⁇ .
  • the estimation unit 525 calculates a first action value function Q(s t (i) , a ) for each of a plurality of states s t (i) included in the state sequence ⁇ s t (i ) ⁇ .
  • the estimation unit 525 updates the first action-value function Q(s t (i) , a) for the state s t (i) according to Equation (1) above.
  • the first action value function (s t (i) , a) is an m-dimensional vector (m is an integer of 2 or more), where m is the number of elements in set A (that is, the number of types of action a). ).
  • step S25 the estimation unit 525 calculates the second action-value function J(s t , a) based on the plurality of calculated first action-value functions Q(s t (i) , a).
  • the second action-value function J(s t , a) is given by the above equation (2) or (3) as an example. In other words, the estimator 525 calculates the second action value function given by Equation (2) or Equation (3) above.
  • the second action-value function given by the above formula (2) or the above-mentioned formula (3) is such that the larger the variation of the plurality of first action-value functions Q(s t (i) , a), the more the first action-value function It is a function that is lower than the expected value of the function Q(s t (i) , a).
  • step S26 the selection unit 526 selects the behavior a to select.
  • the selection unit 526 selects the action a by the policy given by the above equation (4).
  • the policy for selecting the action a is not limited to the policy given by the above formula (4), and other policies such as the ⁇ -greedy policy and the softmax technique may be used.
  • the selection unit 526 notifies the terminal 40 of the selected action a.
  • step S ⁇ b>27 the action execution unit 422 executes the action a notified from the reinforcement learning device 50 .
  • step S ⁇ b>28 the reward providing unit 423 provides the reinforcement learning device 50 with the reward rt obtained by executing the action selected by the reinforcement learning device 50 .
  • step S29 the reward obtaining unit 521 accumulates learning data including the state sequence ⁇ s t (i) ⁇ and the reward r t .
  • the reinforcement learning system 2 according to the present exemplary embodiment, a configuration is adopted in which the first action value function Q is calculated according to a plurality of states s t (i) including the second state to which noise is added. It is Therefore, according to the reinforcement learning system 2 according to the present exemplary embodiment, in addition to the effect of the reinforcement learning system 1 according to the first exemplary embodiment, an effect that a more appropriate action a can be selected can be obtained.
  • the reinforcement learning system 2 calculates the second action-value function J using the above equation (2)
  • the second action-value function J is It is an index sensitive to the risk (variation)
  • the reinforcement learning system 2 selecting the action a using the second action-value function J, the action a that is more sensitive to risk can be selected.
  • the reinforcement learning system 2 calculates the second action value function J using the above equation (3), the calculation process is overflow does not occur in By the reinforcement learning system 2 selecting the action a using the second action-value function J, the action a can be selected more appropriately, and the processing load associated with the selection of the action a can be reduced.
  • reinforcement learning system 3 A reinforcement learning system according to this exemplary embodiment (hereinafter referred to as "reinforcement learning system 3") is obtained by applying the reinforcement learning system 2 according to the second exemplary embodiment to autonomous play of a computer game.
  • the reinforcement learning system 3 has the same configuration as the reinforcement learning system 2 shown in FIG. 5 in the second exemplary embodiment described above.
  • the components of reinforcement learning system 3 are the same as those of reinforcement learning system 2, and the description thereof will not be repeated here.
  • the first state st includes, as an example, the state of an object that affects the progress of the game in a computer game.
  • Action a includes, for example, the action of an object operated by a computer game player.
  • the reward r t includes, for example, a reward for winning or losing a game or a game score.
  • FIG. 7 is a diagram showing a screen SC1, which is an example of a game screen of a computer game related to the reinforcement learning system 3.
  • the screen SC1 includes a first dynamic object C11, second dynamic objects C21-C23, first static objects C31-C34, and a second static object C4.
  • the first dynamic object C11, the second dynamic objects C21-C23, the first static objects C31-C34, and the second static object C4 are examples of objects that affect the progress of the game.
  • the player designates the moving direction of the first dynamic object C11 moving in the maze, and the second dynamic object C11 placed in the maze while dodging the tracking of the second dynamic objects C21 to C23.
  • the round is cleared when the static objects C31 to C34 are collected.
  • the first dynamic object C11 and the second dynamic objects C21 to C23 are objects that move on the screen while the game is in progress, and are examples of dynamic elements that move within the environment.
  • the first static objects C31 to C34 and the second static object C4 are objects that do not move on the screen while the game is in progress, and are examples of static elements that do not move within the environment.
  • the first dynamic object C11 is an object to be operated by the player.
  • the first dynamic object C11 moves in the maze at a constant speed during the progress of the game, and changes its moving direction according to the player's operation.
  • the second dynamic objects C21 to C23 are objects that move following the first dynamic object C11 during the progress of the game. Although three second dynamic objects C21 to C23 are illustrated in FIG. 7, the number of second dynamic objects is not limited to three, and may be more or less.
  • the first static objects C31 to C34 are objects placed in the maze and collected by the first dynamic object C11. When the first dynamic object C11 collides with the first static objects C31-C34, the first static objects C31-C34 are recovered by the first dynamic object C11. Although four first static objects C31 to C34 are illustrated in FIG. 7, the number of first static objects is not limited to four, and may be more or less.
  • a second static object C4 is a wall forming a maze.
  • the first state s t includes states for a first dynamic object C11, second dynamic objects C21-C23, first static objects C31-C34, and a second static object C4.
  • the first state includes states for dynamic elements that move within the environment and states for static elements that do not move within the environment. More specifically, the first state s t includes the position of the first dynamic object C11, the positions of the second dynamic objects C21-C23, the positions of the first static objects C31-C34, and the second static the position of object C4.
  • the first state s t is an image representing a game play screen.
  • FIG. 8 is a diagram showing an image Img11 as an example of the first state st .
  • the image Img11 is a grayscale image in which the elements included in the game screen are represented by pixel values from 0 to 255.
  • the image Img11 is divided into a predetermined number of squares, and each square is represented by a pixel value corresponding to the attribute of the element located in each square.
  • the position of the first dynamic object C11 has a pixel value of 255
  • the positions of the second dynamic objects C21 to C23 have a pixel value of 160
  • the positions of the first static objects C31 to C34 have a pixel value of 128.
  • the position of the path formed by the two static objects C4 is represented by a pixel value of 64
  • the immovable place is represented by a pixel value of 0.
  • the action a is the movement of the first dynamic object C11, and there are four types of movement: move up, move down, move right, and move left.
  • the reward r t is, for example, a predetermined additional value obtained when the score increases (eg, +1), and a predetermined subtracted value obtained when captured by the second dynamic objects C21 to C23 (eg, -10).
  • a predetermined additional value (for example, +1) may be obtained as a reward rt when the score is increased by the action, regardless of the degree of increase in the score in one action.
  • the reinforcement learning system 3 executes the reinforcement learning method S2 of FIG. 6 according to the exemplary embodiment 2 described above.
  • the characteristic operation of this exemplary embodiment will be mainly described below, and the description of the contents described in the second exemplary embodiment will not be repeated.
  • step S23 the state randomizer 523 generates the second state by adding noise to the states of the dynamic elements included in the first state st .
  • the state randomization unit 523 generates a second state by randomizing the position of the first dynamic object C11 and the positions of the second dynamic objects C21 to C23 by random walk.
  • the state randomization unit 523 divides the game screen into a predetermined number of squares (for example, into 33 ⁇ 33 squares), and divides the game screen into front, rear, left, and right directions (directions with roads) in which it can proceed.
  • the probability of advancing/not advancing by one square is selected with equal probability.
  • the state randomization unit 523 performs random walk ⁇ 2 times ( ⁇ is an integer equal to or greater than 1) for the position of the first dynamic object C11 and the second dynamic objects C21 to C23. Performing ⁇ 2 random walks moves the dynamic element by ⁇ squares on average.
  • the state sequence ⁇ s t (i) ⁇ generated by the state randomization unit 523 in step S23 includes the first state s t and (n ⁇ 1) second states obtained by randomizing the first state s t . , and a total of n states.
  • the first action value function (s t (i) , a ) is a four-dimensional vector.
  • step S26 the selection unit 526 selects one of four types of action a, up, down, left, and right, as the movement direction of the first dynamic object C11 at an intersection or corner (that is, a point where the movement direction can be changed). However, the selection unit 526 excludes directions in which the first dynamic object C11 cannot move.
  • ⁇ Evaluation of this exemplary embodiment> 9 to 12 are diagrams showing examples of evaluation results of autonomous play of a computer game according to the reinforcement learning system 3, respectively.
  • the life of the first dynamic object is one, and the game is over when the first dynamic object is captured by the second dynamic object. Also, the number of stages was one, and the game ended when the game was cleared, that is, when all the first static objects were collected.
  • the results of the autonomous play of the computer game by the reinforcement learning system 3 were evaluated under multiple conditions in which the values of ⁇ and ⁇ in the reinforcement learning of the reinforcement learning system 3 were changed. Moreover, the result of the autonomous play by the conventional reinforcement learning method, which is not the reinforcement learning system 3, was also used for comparison.
  • a conventional reinforcement learning method a DQN (deep Q-network) method with an improved action selection policy is used.
  • is the average number of movements in the random walk as described above.
  • the vertical axis indicates the score.
  • a graph g91 indicates an average score of autonomous play by conventional reinforcement learning.
  • Graphs g11 to g14 represent average values of scores obtained by autonomous play of the reinforcement learning system 3.
  • FIG. Graphs g11 to g14 have different values of the hyperparameter ⁇ in the formula representing the second action-value function J (formula (2) or formula (3) above).
  • Graphs g11 to g14 are graphs representing the average values of the scores when the hyperparameter ⁇ is set to 0, 0.001, 0.01, and 0.1, respectively.
  • the score of the reinforcement learning system 3 according to the present exemplary embodiment is higher than the score of the conventional reinforcement learning. , the score is higher.
  • the vertical axis indicates the recovery rate.
  • a graph g92 shows the average recovery rate of autonomous play by conventional reinforcement learning.
  • Graphs g21 to g24 represent the average recovery rates of the reinforcement learning system 3 through autonomous play.
  • Graphs g21 to g24 have different values of the hyperparameter ⁇ of the formula representing the second action-value function J (formula (2) or formula (3) above).
  • Graphs g21 to g24 are graphs showing average recovery rates when the hyperparameter ⁇ is set to 0, 0.001, 0.01, and 0.1, respectively.
  • the recovery rate of the reinforcement learning system 3 tends to be higher than the recovery rate of conventional reinforcement learning. is "0.01", the score is high.
  • FIG. 11 is a graph showing the relationship between the score and ⁇ in autonomous play.
  • the horizontal axis indicates ⁇ and the vertical axis indicates the score.
  • Graphs g31 to g34 respectively show the average values of the scores when ⁇ is 1 to 5 when the hyperparameter ⁇ is 0, 0.001, 0.01, and 0.1. show.
  • the average value of the score of the autonomous play by the conventional reinforcement learning is "2009.”
  • the score values when the value of ⁇ is 1 to 3 are often higher than the scores obtained by conventional reinforcement learning.
  • FIG. 12 is a graph showing the relationship between the collection rate of autonomous play and ⁇ .
  • the horizontal axis indicates ⁇ and the vertical axis indicates recovery rate.
  • Graphs g41 to g44 respectively represent the average values of the recovery rate for each value of ⁇ when the hyperparameter ⁇ is "0", “0.001", “0.01", and "0.1".
  • the average collection rate of autonomous play by conventional reinforcement learning is 67.5%.
  • the recovery rate when the value of ⁇ is 1 to 3 is often higher than the recovery rate of conventional reinforcement learning.
  • the recovery rate is higher than the others.
  • the reinforcement learning system 3 computes the first action-value function using the second state obtained by adding noise to the first state, so that the computer game action selection in the autonomous play can be performed more preferably.
  • reinforcement learning system 4 applies the reinforcement learning system 2 according to the second exemplary embodiment to control construction machinery such as an excavator that excavates earth and sand. It is what I did.
  • the reinforcement learning system 3 has the same configuration as the reinforcement learning system 2 shown in FIG. 5 in the second exemplary embodiment described above.
  • the components of reinforcement learning system 4 are the same as those of reinforcement learning system 2, and the description thereof will not be repeated here.
  • the reinforcement learning system 4 selects the operation of the construction machine, such as the excavation operation when the hydraulic excavator excavates earth and sand, through reinforcement learning.
  • the purpose of actions in reinforcement learning is to excavate a bucket full of earth and sand so that the vehicle does not tilt or drag during excavation.
  • the first state s t includes, as an example, the attitude and position of a construction machine such as a hydraulic excavator, the shape of the earth and sand to be excavated (3D data, etc.), and the inside of the bucket of the excavator. Including part or all of the amount of sediment.
  • the posture of the construction machine includes, for example, the angles of the bucket, arm, boom, and rotating body of the construction machine.
  • the position of the construction machine includes, for example, the position and direction of the crawler of the construction machine.
  • the action a includes, for example, attitude control of the construction machine (bucket, arm, boom, angle control of the rotating body, etc.).
  • the reward rt is a positive reward whose absolute value increases as the amount of excavation increases, and its absolute value increases as the degree of inclination of the body of the construction machine, the degree of dragging, or the time required for excavation increases. large negative rewards, including in part or in whole.
  • the state randomization unit 523 may add noise to all of the multiple elements included in the first state st , or may add noise to some of the elements.
  • the elements to which noise is added may include, for example, hydraulic excavator posture and 3D data of observed earth and sand.
  • the reinforcement learning system 4 computes the first action-value function using the second state obtained by adding noise to the first state s t , thereby determining the operation of the construction machine. The selection can be made better.
  • a reinforcement learning system applies the reinforcement learning system 2 according to the second exemplary embodiment to control of a transport device that transports packages. .
  • the transport device is, for example, an automated guided vehicle (AGV) that runs automatically.
  • AGV automated guided vehicle
  • the reinforcement learning system 5 has the same configuration as the reinforcement learning system 2 shown in FIG. 5 in the second exemplary embodiment described above.
  • the components of reinforcement learning system 5 are the same as those of reinforcement learning system 2, and the description thereof will not be repeated here.
  • Reinforcement learning system 5 shortens the transportation time as much as possible (increases the transportation speed) when transporting a load from a predetermined position to another position, and avoids static obstacles (shelves, loads, etc.) and moving objects on the way. Choose actions so that there is no contact with physical obstacles (people, other robots, etc.).
  • the first state s t includes, by way of example, the position, direction of movement, speed and angular velocity of a conveying device conveying the goods, the position of passages, the position of static obstacles, and the dynamic Including part or all of the position and movement speed of the obstacle.
  • Action a includes, for example, velocity control and angular velocity control of the conveying device.
  • the reward rt is, for example, a part or all of a positive reward obtained when transportation is completed, a negative reward obtained when contacting an obstacle, or a negative reward whose absolute value increases as the transportation time increases. including.
  • State randomization section 523 may add noise to all of the plurality of elements included in st in the first state, or may add noise to some of the elements.
  • the elements to which noise is added may include, for example, the position, orientation, velocity and angular velocity of the transport device, and also the position of static obstacles, or the position of dynamic obstacles. may include the position and velocity of
  • the state randomization unit 523 adds noise to obstacles positioned in the traveling direction of the conveying device or on the traveling route, and gives noise to obstacles positioned outside the traveling direction or outside the traveling route. You can choose not to.
  • the reinforcement learning system 5 computes the first action-value function using the second state obtained by adding noise to the first state st , thereby controlling the transport control of the transport device. can be performed more preferably.
  • reinforcement learning system 6 applies the reinforcement learning system 2 according to the second exemplary embodiment to the control of a forklift.
  • the reinforcement learning system 6 has the same configuration as the reinforcement learning system 2 shown in FIG. 5 in the second exemplary embodiment described above.
  • the components of reinforcement learning system 6 are the same as those of reinforcement learning system 2, and the description thereof will not be repeated here.
  • the reinforcement learning system 6 When transporting a pallet from a predetermined position to another position, the reinforcement learning system 6 shortens the transport time as much as possible (increases the transport speed) and avoids static obstacles (shelves, luggage, etc.) Choose actions so that there is no contact with physical obstacles (people, other rags, etc.).
  • the first state s t includes, by way of example, the position, direction of movement, speed and angular velocity of the forklift, the position of the path, the position of static obstacles, and the position and speed of dynamic obstacles.
  • Action a includes, for example, speed control and angular speed control of a forklift.
  • the reward rt is, for example, a part or all of a positive reward obtained when transportation is completed, a negative reward obtained when contacting an obstacle, or a negative reward whose absolute value increases as the transportation time increases. including.
  • State randomization section 523 may add noise to all of the plurality of elements included in st in the first state, or may add noise to some of the elements.
  • the elements to which noise is added may include, for example, the position, orientation, velocity and angular velocity of a forklift, and also the position of static obstacles, or the May include position and velocity.
  • the state randomization unit 523 for example, adds noise to obstacles positioned in the traveling direction of the forklift or on the traveling route, and does not add noise to obstacles positioned outside the traveling direction or outside the traveling route. You may do so.
  • the reinforcement learning system 5 calculates a first action-value function using a second state obtained by adding noise to the first state st , thereby making forklift control more suitable. can be done.
  • reinforcement learning system 7 has the same configuration as the reinforcement learning system 2 shown in FIG. 5 in the second exemplary embodiment.
  • the components of reinforcement learning system 6 are the same as those of reinforcement learning system 2, and the description thereof will not be repeated here.
  • the first state s t includes a plurality of elements accompanied by attributes.
  • the state randomization unit 523 when adding noise to the first state st , gives different weightings for adding noise depending on attributes.
  • the state randomization unit 523 may increase the weighting of dynamic elements that move within the environment and decrease the weighting of static elements that do not move within the environment.
  • the state randomization unit 523 may weight the position of a person among the dynamic elements that move within the environment more than weight the other dynamic elements.
  • the first action value function by calculating the first action value function using the second state to which noise is added with weighting according to the attribute of the element, variation in data according to the attribute is taken into account.
  • a first action value function can be calculated. By selecting an action using this first action-value function, it is possible to select an action in consideration of variations in data according to attributes.
  • the state randomization unit 523 may change the weighting of noise addition during execution of reinforcement learning.
  • the state randomization unit 523 performs control such that when a dynamic element is moving in the environment, the weight is increased, and when the dynamic element is not moving in the environment, the weight is decreased.
  • reinforcement learning device 10 Some or all of the functions of the reinforcement learning device 10, the terminal 20, the server 30, the terminal 40, and the reinforcement learning device 50 (hereinafter referred to as "reinforcement learning device 10, etc.") are realized by hardware such as integrated circuits (IC chips). may be implemented by software.
  • IC chips integrated circuits
  • the reinforcement learning device 10 and the like are implemented by, for example, a computer that executes program instructions that are software that implements each function.
  • a computer that executes program instructions that are software that implements each function.
  • An example of such a computer (hereinafter referred to as computer C) is shown in FIG.
  • Computer C comprises at least one processor C1 and at least one memory C2.
  • a program P for operating the computer C as the reinforcement learning device 10 or the like is recorded in the memory C2.
  • the processor C1 reads the program P from the memory C2 and executes it, thereby realizing each function of the reinforcement learning device 10 and the like.
  • processor C1 for example, CPU (Central Processing Unit), GPU (Graphic Processing Unit), DSP (Digital Signal Processor), MPU (Micro Processing Unit), FPU (Floating point number Processing Unit), PPU (Physics Processing Unit) , a microcontroller, or a combination thereof.
  • memory C2 for example, a flash memory, HDD (Hard Disk Drive), SSD (Solid State Drive), or a combination thereof can be used.
  • the computer C may further include a RAM (Random Access Memory) for expanding the program P during execution and temporarily storing various data.
  • Computer C may further include a communication interface for sending and receiving data to and from other devices.
  • Computer C may further include an input/output interface for connecting input/output devices such as a keyboard, mouse, display, and printer.
  • the program P can be recorded on a non-temporary tangible recording medium M that is readable by the computer C.
  • a recording medium M for example, a tape, disk, card, semiconductor memory, programmable logic circuit, or the like can be used.
  • the computer C can acquire the program P via such a recording medium M.
  • the program P can be transmitted via a transmission medium.
  • a transmission medium for example, a communication network or broadcast waves can be used.
  • Computer C can also obtain program P via such a transmission medium.
  • a reinforcement learning system comprising:
  • a more suitable action can be selected by calculating the first action value function using the second state obtained by adding noise to the first state.
  • Appendix 2 The calculation means calculates the first action value function according to the first state and the second state, The reinforcement learning system according to Appendix 1.
  • a more suitable action can be selected by calculating the first action value function using a plurality of states including the second state to which noise is added.
  • the calculation means calculates the first action-value function for each of the first state and the second state,
  • the selection means selects the action according to a second action-value function calculated based on a plurality of the first action-value functions.
  • the reinforcement learning system according to appendix 2.
  • a more suitable action can be selected by using the second action-value function calculated using the plurality of first action-value functions.
  • the first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles.
  • the reinforcement learning system according to any one of Appendices 1 to 3.
  • the first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
  • the reinforcement learning system according to any one of Appendices 1 to 3.
  • the first state includes a plurality of elements accompanied by attributes;
  • the generation means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
  • the reinforcement learning system according to any one of Appendices 1 to 5.
  • the first state includes a state related to a dynamic element moving within the environment;
  • the generating means generates the second state by adding noise to states of the dynamic elements included in the first state.
  • the reinforcement learning system according to appendix 6.
  • a reinforcement learning device comprising:
  • a more suitable action can be selected by calculating the first action value function using the second state obtained by adding noise to the first state.
  • the calculation means calculates the first action value function according to the first state and the second state, The reinforcement learning device according to appendix 8.
  • a more suitable action can be selected by calculating the first action value function using a plurality of states including the second state to which noise is added.
  • the calculation means calculates the first action value function for each of the plurality of states included in the state sequence,
  • the selection means selects the action according to a second action-value function calculated based on a plurality of the first action-value functions.
  • the reinforcement learning device according to appendix 9.
  • a more suitable action can be selected by using the second action-value function calculated using the plurality of first action-value functions.
  • the first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. including one The reinforcement learning device according to any one of Appendices 8 to 10.
  • the first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
  • the reinforcement learning device according to any one of Appendices 8 to 10.
  • the first state includes a plurality of elements accompanied by attributes;
  • the generation means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
  • the reinforcement learning device according to any one of Appendices 8 to 12.
  • the first state includes a state related to a dynamic element moving within the environment;
  • the generating means generates the second state by adding noise to states of the dynamic elements included in the first state.
  • the reinforcement learning device according to appendix 13.
  • a more suitable action can be selected by calculating the first action value function using the second state obtained by adding noise to the first state.
  • a more suitable action can be selected by calculating the first action value function using a plurality of states including the second state to which noise is added.
  • a more suitable action can be selected by using the second action-value function calculated using the plurality of first action-value functions.
  • the first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. including one 18.
  • the reinforcement learning method according to any one of Appendices 15 to 17.
  • the first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator, 18.
  • the reinforcement learning method according to any one of Appendices 15-17.
  • the first state includes a plurality of elements accompanied by attributes;
  • generating the second state generating the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute; 20.
  • the reinforcement learning method according to any one of Appendices 15 to 19.
  • the first state includes a state related to a dynamic element moving within the environment; generating the second state by adding noise to states of the dynamic element included in the first state in generating the second state;
  • the reinforcement learning method according to appendix 20 The reinforcement learning method according to appendix 20.
  • the first state includes a plurality of elements accompanied by attributes;
  • the generating means weights the addition of the noise differently depending on the attribute.
  • the reinforcement learning system according to any one of Appendices 1 to 5.
  • the first state includes part or all of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator, the action includes attitude control of the construction machine; 19.
  • the reinforcement learning system according to any one of Appendices 1 to 6 and 19.
  • the selection of the excavating motion of the excavator is more preferably performed by reinforcement learning. be able to.
  • the first state includes part or all of the position, direction of movement, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles.
  • the action includes velocity control and angular velocity control of the conveying device; 19.
  • the reinforcement learning system according to any one of Appendices 1 to 6 and 19.
  • the selection of the conveying operation of the conveying device by reinforcement learning is performed more preferably. be able to.
  • the first state includes a state of an object that affects the progress of the game in the computer game; the action includes an action of an object operated by a player of the computer game; 19.
  • the reinforcement learning system according to any one of Appendices 1 to 6 and 19.
  • a program that causes a computer to function as a reinforcement learning device causes the computer to: Acquisition means for acquiring a first state in an environment that is a target of reinforcement learning; generating means for generating a second state by adding noise to the first state; calculating means for calculating a first action-value function according to the second state; selection means for selecting an action according to the first action-value function;
  • Acquisition means for acquiring a first state in an environment that is a target of reinforcement learning generating means for generating a second state by adding noise to the first state
  • selection means for selecting an action according to the first action-value function
  • the calculation means calculates the first action value function according to the first state and the second state, 27.
  • the calculation means calculates the first action-value function for each of the first state and the second state,
  • the selection means selects the action according to a second action-value function calculated based on a plurality of the first action-value functions.
  • the first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. including one 29.
  • the first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator, 29.
  • the first state includes a plurality of elements accompanied by attributes;
  • the generating means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
  • the program according to any one of appendices 26 to 30, characterized in that:
  • the first state includes a state related to a dynamic element moving within the environment;
  • the generating means generates the second state by adding noise to states of the dynamic elements included in the first state.
  • said processor comprising: Acquisition processing for acquiring a first state in an environment that is a target of reinforcement learning; a generating process for generating a second state by adding noise to the first state; a calculation process for calculating a first action-value function according to the second state; a selection process for selecting an action according to the first action-value function; Reinforcement learning device that executes
  • the reinforcement learning device may further include a memory, and the memory stores a program for causing the processor to execute the acquisition process, the generation process, the calculation process, and the selection process. may be stored. Also, this program may be recorded in a computer-readable non-temporary tangible recording medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

In order to select a more suitable action, this reinforced learning system (1) comprises an acquisition unit (11) for acquiring a first state in an environment that is the object of reinforced learning, a generation unit (12) for generating a second state by adding noise to the first state, a calculation unit (13) for calculating a first action value function in accordance with the second state, and a selection unit (14) for selecting an action in accordance with the first action value function.

Description

強化学習システム、強化学習装置及び強化学習方法Reinforcement learning system, reinforcement learning device, and reinforcement learning method
 本発明は、強化学習システム、強化学習装置及び強化学習方法に関する。 The present invention relates to a reinforcement learning system, a reinforcement learning device, and a reinforcement learning method.
 ある状態において、次の行動を実施したときに得られる報酬を最大化する行動を学習していく、強化学習の研究が進められている。特許文献1には、ロボットアームによる組立作業において、凹部品と凸部品の画像と、部品を組み合わせる際の制御量とを強化学習により学習する技術が記載されている。また、特許文献2には、強化学習を用いて、アクセル操作量を学習し、状態に応じたスロットル開口度指令値及び遅角量からなる行動を選択する技術が記載されている。特許文献2にはまた、行動価値関数Qに関数近似器を用いてもよいことが記載されている。 Reinforcement learning research is underway to learn actions that maximize the reward obtained when the next action is taken in a certain state. Japanese Patent Laid-Open No. 2002-200002 describes a technique for learning images of concave parts and convex parts and control amounts when combining parts by reinforcement learning in assembly work by a robot arm. Further, Patent Literature 2 describes a technique of learning an accelerator operation amount using reinforcement learning, and selecting an action based on a throttle opening command value and a retardation amount according to the state. Patent Literature 2 also describes that a function approximator may be used for the action-value function Q.
国際公開第2018/146770号公報International Publication No. 2018/146770 日本国特開2021-67193号公報Japanese Patent Application Laid-Open No. 2021-67193
 しかしながら、特許文献1及び2に記載の技術は、より好適な行動を選択するという観点で改善の余地がある。強化学習において行動価値関数を正確に推定できていれば適切な行動を選択できるが、特許文献1及び2に記載の技術において推定される行動価値関数には誤差が含まれるためである。特に状態行動空間が巨大である場合、行動価値関数を正確に推定することは困難である。 However, the techniques described in Patent Documents 1 and 2 have room for improvement in terms of selecting more suitable actions. This is because an appropriate action can be selected if the action-value function can be accurately estimated in reinforcement learning, but the action-value function estimated in the techniques described in Patent Documents 1 and 2 includes errors. Especially when the state-action space is huge, it is difficult to accurately estimate the action-value function.
 本発明の一態様は、上記の問題に鑑みてなされたものであり、その目的の一例は、より好適な行動を選択できる技術を提供することである。 One aspect of the present invention has been made in view of the above problems, and an example of its purpose is to provide a technique that allows selection of more suitable actions.
 本発明の一側面に係る強化学習システムは、強化学習の対象である環境における第1の状態を取得する取得手段と、前記第1の状態にノイズを付加することによって第2の状態を生成する生成手段と、前記第2の状態に応じて、第1の行動価値関数を算出する算出手段と、前記第1の行動価値関数に応じて、行動を選択する選択手段と、を備える。 A reinforcement learning system according to one aspect of the present invention includes acquisition means for acquiring a first state in an environment that is a target of reinforcement learning, and adding noise to the first state to generate a second state. A generating means, a calculating means for calculating a first action-value function according to the second state, and a selecting means for selecting an action according to the first action-value function.
 本発明の一側面に係る強化学習装置は、強化学習の対象である環境における第1の状態を取得する取得手段と、前記第1の状態にノイズを付加することによって第2の状態を生成する生成手段と、前記第2の状態に応じて、第1の行動価値関数を算出する算出手段と、前記第1の行動価値関数に応じて、行動を選択する選択手段と、を備える。 A reinforcement learning device according to one aspect of the present invention includes acquisition means for acquiring a first state in an environment that is a target of reinforcement learning, and adding noise to the first state to generate a second state. A generating means, a calculating means for calculating a first action-value function according to the second state, and a selecting means for selecting an action according to the first action-value function.
 本発明の一側面に係る強化学習方法は、強化学習の対象である環境における前記第1の状態にノイズを付加することによって第2の状態を生成すること、前記第2の状態に応じて、第1の行動価値関数を算出すること、前記第1の行動価値関数に応じて、行動を選択すること、を含む。 A reinforcement learning method according to one aspect of the present invention includes generating a second state by adding noise to the first state in an environment that is a target of reinforcement learning, and depending on the second state, calculating a first action-value function; and selecting an action in response to said first action-value function.
 本発明の一態様によれば、より好適な行動を選択することができる。 According to one aspect of the present invention, a more suitable action can be selected.
本発明の例示的実施形態1に係る強化学習システムの構成を示すブロック図である。1 is a block diagram showing the configuration of a reinforcement learning system according to Exemplary Embodiment 1 of the present invention; FIG. 本発明の例示的実施形態1に係る強化学習方法の流れを示すフロー図である。FIG. 2 is a flow diagram showing the flow of a reinforcement learning method according to exemplary embodiment 1 of the present invention; 本発明の例示的実施形態1に係る強化学習システムの構成を例示するブロック図である。1 is a block diagram illustrating the configuration of a reinforcement learning system according to exemplary embodiment 1 of the present invention; FIG. 本発明の例示的実施形態1を実現する装置構成の一例を示すブロック図である。1 is a block diagram showing an example of a device configuration for realizing exemplary embodiment 1 of the present invention; FIG. 本発明の例示的実施形態2に係る強化学習システムの構成を示すブロック図である。FIG. 4 is a block diagram showing the configuration of a reinforcement learning system according to exemplary embodiment 2 of the present invention; 本発明の例示的実施形態2に係る強化学習方法の流れを示すフロー図である。FIG. 5 is a flow diagram showing the flow of a reinforcement learning method according to exemplary embodiment 2 of the present invention; 本発明の例示的実施形態3に係るゲーム画面の一例を示す図である。FIG. 11 is a diagram showing an example of a game screen according to exemplary embodiment 3 of the present invention; 本発明の例示的実施形態3に係る第1の状態を例示する図である。Fig. 3 illustrates a first state according to exemplary embodiment 3 of the present invention; 本発明の例示的実施形態3の適用例に係る評価結果の一例を示す図である。FIG. 10 is a diagram showing an example of an evaluation result according to an application example of exemplary embodiment 3 of the present invention; 本発明の例示的実施形態3の適用例に係る評価結果の一例を示す図である。FIG. 10 is a diagram showing an example of an evaluation result according to an application example of exemplary embodiment 3 of the present invention; 本発明の例示的実施形態3の適用例に係る評価結果の一例を示す図である。FIG. 10 is a diagram showing an example of an evaluation result according to an application example of exemplary embodiment 3 of the present invention; 本発明の例示的実施形態3の適用例に係る評価結果の一例を示す図である。FIG. 10 is a diagram showing an example of an evaluation result according to an application example of exemplary embodiment 3 of the present invention; 本発明の例示的実施形態1~7に係る強化学習装置、端末20、サーバ30として機能するコンピュータの構成を示すブロック図である。3 is a block diagram showing configurations of a computer functioning as a reinforcement learning device, a terminal 20, and a server 30 according to exemplary embodiments 1 to 7 of the present invention; FIG.
 〔例示的実施形態1〕
 本発明の第1の例示的実施形態について、図面を参照して詳細に説明する。本例示的実施形態は、後述する例示的実施形態の基本となる形態である。
[Exemplary embodiment 1]
A first exemplary embodiment of the invention will now be described in detail with reference to the drawings. This exemplary embodiment is the basis for the exemplary embodiments described later.
 <強化学習システムの構成>
 本例示的実施形態に係る強化学習システム1の構成について、図1を参照して説明する。図1は、強化学習システム1の構成を示すブロック図である。強化学習システム1は、強化学習により行動を選択するシステムである。強化学習システム1は、一例として、掘削機等の建設機械の建設動作を制御するシステム、搬送装置による搬送を制御するシステム、又はコンピュータゲームの自律プレイのためのシステムである。ただし、強化学習システム1の強化学習は上述した例に限定されるものではなく、強化学習システム1が行う強化学習は種々のシステムに適用可能である。行動は、強化学習におけるエージェントの行動であり、一例として、掘削機の掘削動作制御、搬送装置の搬送動作制御、又はコンピュータゲームの自律プレイ制御である。ただし、行動はこれらの例に限定されるものではなく、上記以外のものであってもよい。
<Configuration of reinforcement learning system>
A configuration of a reinforcement learning system 1 according to this exemplary embodiment will be described with reference to FIG. FIG. 1 is a block diagram showing the configuration of a reinforcement learning system 1. As shown in FIG. The reinforcement learning system 1 is a system that selects actions by reinforcement learning. The reinforcement learning system 1 is, for example, a system for controlling construction operations of construction machines such as excavators, a system for controlling transportation by a transportation device, or a system for autonomous play of computer games. However, the reinforcement learning of the reinforcement learning system 1 is not limited to the example described above, and the reinforcement learning performed by the reinforcement learning system 1 can be applied to various systems. The action is the action of an agent in reinforcement learning, and examples thereof include excavating motion control of an excavator, transport motion control of a transport device, or autonomous play control of a computer game. However, the actions are not limited to these examples, and may be other than the above.
 強化学習システム1は、図1に示すように、取得部11、生成部12、算出部13、及び選択部14を備える。取得部11は、本例示的実施形態において取得手段を実現する構成である。生成部12は、本例示的実施形態において生成手段を実現する構成である。算出部13は、本例示的実施形態において算出手段を実現する構成である。選択部14は、本例示的実施形態において選択手段を実現する構成である。 The reinforcement learning system 1 includes an acquisition unit 11, a generation unit 12, a calculation unit 13, and a selection unit 14, as shown in FIG. The acquisition unit 11 is a configuration that implements acquisition means in this exemplary embodiment. The generation unit 12 is a configuration that implements generation means in this exemplary embodiment. The calculator 13 is a configuration that implements a calculator in this exemplary embodiment. The selection unit 14 is a configuration that realizes selection means in this exemplary embodiment.
 取得部11は、第1の状態を取得する。第1の状態は、強化学習の対象である環境における状態である。例えば強化学習システム1が掘削機の掘削動作を選択するためのシステムである場合、第1の状態は一例として、土砂を掘削する掘削機の姿勢及び位置、掘削対象である土砂の形状、並びに掘削機のバケット内の土砂量、の一部又は全部を含む。また、強化学習システム1が搬送装置の搬送動作を選択するためのシステムである場合、第1の状態は、一例として、搬送装置の位置、移動方向、速度及び各角度、通路の位置、並びに静的な障害部又は動的な障害物の位置及び速度、の一部又は全部を含む。また、強化学習システム1がコンピュータゲームの自律プレイのためのシステムである場合、第1の状態は、一例として、コンピュータゲームにおいてゲームの進行に影響を与えるオブジェクトの状態を含む。ただし、第1の状態は上述したものに限定されず、他の状態であってもよい。第1の状態は、例えば、温度又は天気等の環境の状態を含んでもよい。 The acquisition unit 11 acquires the first state. The first state is the state in the environment that is the object of reinforcement learning. For example, if the reinforcement learning system 1 is a system for selecting an excavating operation of an excavator, the first state may be, for example, the posture and position of the excavator that excavates the earth and sand, the shape of the earth and sand to be excavated, and the excavation including part or all of the amount of soil in the bucket of the machine. Further, when the reinforcement learning system 1 is a system for selecting the transport operation of the transport device, the first state includes, for example, the position, moving direction, speed and angle of the transport device, the position of the passage, and the static location and velocity of dynamic obstacles or dynamic obstacles. Moreover, if the reinforcement learning system 1 is a system for autonomous play of a computer game, the first state includes, as an example, the state of an object that affects the progress of the game in the computer game. However, the first state is not limited to the one described above, and may be another state. The first condition may include environmental conditions such as temperature or weather, for example.
 生成部12は、第1の状態にノイズを付加することによって第2の状態を生成する。ノイズは一例として、正規乱数、又は一様乱数等の乱数である。ただし、生成部12が第1の状態に付加するノイズはこれらに限られず、上記以外のノイズであってもよい。生成部12は、第1の状態に含まれる要素の全てにノイズを付加してもよく、また、第1の状態に含まれる要素のうちの一部にノイズを付加してもよい。 The generator 12 generates the second state by adding noise to the first state. Noise is, for example, random numbers such as normal random numbers or uniform random numbers. However, the noise added to the first state by the generator 12 is not limited to these, and may be noise other than the above. The generator 12 may add noise to all of the elements included in the first state, or may add noise to some of the elements included in the first state.
 算出部13は、第2の状態に応じて、第1の行動価値関数を算出する。算出部13は、一例として、複数の第2の状態を含む状態列を用いて第1の行動価値関数を算出する。また、算出部13は、第1の状態、及び、1又は複数の第2の状態を含む状態列を用いて第1の行動価値関数を算出してもよい。換言すると、算出部13が第1の行動価値関数の算出のために用いる状態列は、1又は複数の第2の状態を含み、また、上記状態列に含まれる状態は第1の状態又は第2の状態である。以下の説明では、第1の状態及び第2の状態を各々区別する必要がない場合には、これらを単に「状態」ともいう。 The calculation unit 13 calculates the first action value function according to the second state. As an example, the calculation unit 13 calculates the first action value function using a state sequence including a plurality of second states. Further, the calculation unit 13 may calculate the first action value function using a state sequence including the first state and one or more second states. In other words, the state sequence used by the calculator 13 to calculate the first action-value function includes one or more second states, and the states included in the state sequence are either the first state or the first state. 2 state. In the following description, when there is no need to distinguish between the first state and the second state, they may simply be referred to as "states."
 第1の行動価値関数は、状態での行動を評価するための関数である。第1の行動価値関数は、一例として、Q学習(Q-learning)で用いられる行動価値関数であり、一例として以下の式(1)により更新される。ただし、第1の行動価値関数は式(1)により与えられるものに限られず、他の関数であってもよい。 The first action value function is a function for evaluating actions in a state. The first action-value function is, for example, an action-value function used in Q-learning, and is updated by the following equation (1), for example. However, the first action-value function is not limited to that given by Equation (1), and may be another function.
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 式(1)において、s (i)(1≦i≦n;i及びnは自然数)は状態列に含まれる状態(すなわち第1の状態又は第2の状態)であり、aは行動であり、Q(s (i),a)は第1の行動価値関数である。αは学習率、st+1 (i)は遷移後の状態、rt+1はエージェントが状態st+1 (i)に遷移したときに得る報酬、γ(0≦γ≦1)は割引率である。また、a´∈A、集合Aは状態s (i)においてエージェントが可能な行動の集合である。 In equation (1), s t (i) (1≤i≤n; i and n are natural numbers) is the state included in the state sequence (i.e., the first state or the second state), and a is the action , and Q(s t (i) , a) is the first action-value function. α is the learning rate, s t+1 (i) is the post-transition state, r t+1 is the reward the agent receives when it transitions to the state s t+1 (i) , and γ (0≦γ≦1) is the discount rate. Also, a′εA, set A is the set of possible actions of the agent in state s t (i) .
 報酬は、エージェントが行動することで環境から得られる報酬である。報酬は、一例として、掘削機の掘削量、掘削に要した時間、搬送に要した時間、搬送中における障害物への接触の有無、ゲームの勝敗、又はゲームのスコアに応じて、加算又は減算される値である。ただし、報酬はこれらの例に限定されるものではなく、上記以外のものであってもよい。 A reward is a reward obtained from the environment by an agent's actions. For example, the reward is added or subtracted according to the amount of excavation by the excavator, the time required for excavation, the time required for transportation, whether or not there was contact with an obstacle during transportation, the win or loss of the game, or the score of the game. is the value to be However, the reward is not limited to these examples, and may be other than the above.
 式(1)を用いる場合、算出部13は、状態列に含まれる状態のそれぞれについて、第1の行動価値関数を算出する。換言すると、算出部13は、状態列に含まれる状態の数だけ第1の行動価値関数を算出する。 When formula (1) is used, the calculation unit 13 calculates the first action value function for each state included in the state string. In other words, the calculation unit 13 calculates the first action-value function by the number of states included in the state sequence.
 選択部14は、第1の行動価値関数に応じて、行動を選択する。選択部14は、一例として、第1の行動価値関数を最大化する行動を選択する。選択部14は、εグリーディ手法、遺伝的アルゴリズムで用いられているルーレット選択、ボルツマン分布を利用したソフトマックス手法等により行動を選択してもよい。 The selection unit 14 selects an action according to the first action value function. As an example, the selection unit 14 selects an action that maximizes the first action-value function. The selection unit 14 may select an action by an ε greedy method, roulette selection used in genetic algorithms, a softmax method using Boltzmann distribution, or the like.
 また、複数の第1の行動価値関数を用いる場合、選択部14は、一例として、複数の第1の行動価値関数のいずれかを用いて行動を選択してもよく、また、算出部13が算出した複数の第1の行動価値関数を用いて第2の行動価値関数を算出し、算出した第2の行動価値関数を用いて行動を選択してもよい。第2の行動価値関数は、状態での行動を評価するための関数である。第2の行動価値関数は、一例として、複数の第1の行動価値関数の期待値であってもよく、また、一例として、複数の第1の行動価値関数のばらつきが大きいほど上記期待値よりも小さな値となる関数であってもよい。第2の行動価値関数は、一例として、下記式(2)、又は式(3)により与えられる。ただし、第2の行動価値関数は式(2)又は(3)により与えられるものに限られず、これら以外の他の関数であってもよい。 Further, when using a plurality of first action-value functions, the selection unit 14 may select an action using any one of the plurality of first action-value functions, as an example, and the calculation unit 13 may A second action-value function may be calculated using a plurality of calculated first action-value functions, and an action may be selected using the calculated second action-value function. A second action value function is a function for evaluating actions in a state. The second action-value function may be, for example, expected values of a plurality of first action-value functions. may also be a function with a small value. The second action value function is given by the following formula (2) or formula (3) as an example. However, the second action value function is not limited to the one given by Equation (2) or (3), and may be other functions.
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 式(2)及び式(3)において、J(s,a)は第2の行動価値関数、sは第1の状態、aは行動、θはハイパーパラメータ、Q(s (i),a)は第1の行動価値関数、s (i)は状態列に含まれる状態、Eは期待値、である。なお、式(3)は、式(2)をテイラー展開して2次の項までを採用し、3次以降を切り捨てたものである。 In equations (2) and (3), J(s t , a) is the second action-value function, s t is the first state, a is the action, θ is the hyperparameter, Q(s t (i) , a) is the first action-value function, s t (i) is the state included in the state sequence, and E is the expected value. Equation (3) is obtained by Taylor-expanding Equation (2), adopting terms up to the second order, and discarding the third and subsequent terms.
 選択部14が第2の行動価値関数を算出する場合、選択部14は、一例として、式(4)により与えられる方策を用いて、第2の行動価値関数を最大化する行動を選択する。なお、行動を選択する方策は式(4)により与えられる方策に限られず、他の方策であってもよい。選択部14は例えば、εグリーディ手法、遺伝的アルゴリズムで用いられているルーレット選択、又はボルツマン分布を利用したソフトマックス手法等により行動を選択してもよい。εグリーディ手法を用いる場合、方策は一例として以下の式(5)により与えられる。 When the selection unit 14 calculates the second action-value function, the selection unit 14 selects an action that maximizes the second action-value function, using the policy given by Equation (4) as an example. Note that the policy for selecting an action is not limited to the policy given by Equation (4), and may be another policy. The selection unit 14 may select actions by, for example, the ε greedy method, roulette selection used in genetic algorithms, or the softmax method using the Boltzmann distribution. When using the ε-greedy approach, the policy is given by the following equation (5) as an example.
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000005
 式(4)及び式(5)において、πは選択する次の行動、a´は第1の状態sにおいてエージェントが可能な行動である。また、式(5)において、ε(0<ε<1)は定数、vは、0≦v≦1を満たす乱数である。
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000005
In equations (4) and (5), π is the next action to choose, and a′ is the possible action of the agent in the first state s t . Also, in equation (5), ε (0<ε<1) is a constant, and v is a random number that satisfies 0≦v≦1.
 <強化学習システムの効果>
 本例示的実施形態に係る強化学習システム1によれば、第1の状態にノイズを付加した第2の状態を用いて行動価値関数を算出することにより、状態のばらつきを考慮した第1の行動価値関数を算出することができる。この第1の行動価値関数を用いて行動を選択することにより、強化学習システム1はより好適な行動を選択できる。
<Effect of Reinforcement Learning System>
According to the reinforcement learning system 1 according to the present exemplary embodiment, by calculating the action value function using the second state obtained by adding noise to the first state, the first action considering the variation of the state A value function can be calculated. By selecting an action using this first action-value function, the reinforcement learning system 1 can select a more suitable action.
 <強化学習方法の流れ>
 図2は、強化学習システム1が実行する強化学習方法S1の流れを示すフロー図である。強化学習システム1は、強化学習方法S1を繰り返すことにより、行動の選択を繰り返し行う。なお、すでに説明した内容についてはその説明を繰り返さない。
<Flow of reinforcement learning method>
FIG. 2 is a flowchart showing the flow of the reinforcement learning method S1 executed by the reinforcement learning system 1. As shown in FIG. The reinforcement learning system 1 repeatedly selects an action by repeating the reinforcement learning method S1. In addition, the description about the content already demonstrated is not repeated.
 強化学習方法S1は、ステップS11~S14を含む。ステップS11において、取得部11は、第1の状態を取得する。ステップS12において、生成部12は、第1の状態にノイズを付加することによって第2の状態を生成する。 The reinforcement learning method S1 includes steps S11 to S14. In step S11, the obtaining unit 11 obtains the first state. In step S12, the generator 12 generates a second state by adding noise to the first state.
 ステップS13において、算出部13は、第2の状態に応じて、第1の行動価値関数を算出する。ここで、繰り返しのn(nは自然数)回目において、算出部13が第1の行動価値関数を算出するために参照するデータとしては、一例として、(n-1)回目までに蓄積された状態、行動、及び報酬が用いられる。ステップS14において、選択部14は、第1の行動価値関数に応じて、行動を選択する。 In step S13, the calculation unit 13 calculates the first action value function according to the second state. Here, in the n-th repetition (n is a natural number), the data that the calculation unit 13 refers to in order to calculate the first action-value function is, for example, the state accumulated up to the (n-1)th repetition. , behavior, and reward are used. In step S14, the selection unit 14 selects an action according to the first action value function.
 <強化学習方法の効果>
 本例示的実施形態に係る強化学習方法S1によれば、第1の状態にノイズを付加した第2の状態を用いて行動価値関数を算出することにより、状態のばらつきを考慮した行動価値関数を算出することができる。この行動価値関数を用いて行動を選択することにより、より好適な行動を選択できる。
<Effect of reinforcement learning method>
According to the reinforcement learning method S1 according to the present exemplary embodiment, by calculating the action-value function using the second state in which noise is added to the first state, the action-value function considering the variation of the states can be calculated. can be calculated. By selecting an action using this action-value function, a more suitable action can be selected.
 <強化学習システムの装置構成例>
 続いて、本例示的実施形態に係る強化学習システム1の装置構成例について図面を参照しつつ説明する。図3は、強化学習システム1の構成の一例を示すブロック図である。図3の例では、強化学習システム1は強化学習装置10を備える。強化学習装置10は、取得部11、生成部12、算出部13、及び選択部14を備える。強化学習装置10は、一例として、サーバ装置、パーソナルコンピュータ、又はゲーム機器であるが、これらに限定されるものではなく、上記以外の装置であってもよい。強化学習装置10は一例として、通信インタフェースを介して第1の状態を受信することにより第1の状態を取得してもよい。
<Device configuration example of reinforcement learning system>
Next, a device configuration example of the reinforcement learning system 1 according to this exemplary embodiment will be described with reference to the drawings. FIG. 3 is a block diagram showing an example of the configuration of the reinforcement learning system 1. As shown in FIG. In the example of FIG. 3, the reinforcement learning system 1 includes a reinforcement learning device 10. FIG. The reinforcement learning device 10 includes an acquisition unit 11 , a generation unit 12 , a calculation unit 13 and a selection unit 14 . The reinforcement learning device 10 is, for example, a server device, a personal computer, or a game device, but is not limited to these, and may be a device other than the above. As an example, the reinforcement learning device 10 may acquire the first state by receiving the first state via a communication interface.
 図4は、強化学習システム1の構成の他の例を示すブロック図である。図4の例では、強化学習システム1は、端末20及びサーバ30を備える。端末20は一例として、パーソナルコンピュータ、又はゲーム機器であるが、これらに限定されるものではなく、上記以外の装置であってもよい。端末20は、取得部11を備える。サーバ30は、生成部12、算出部13、及び選択部14を備える。端末20は、第1の状態を取得し、取得した第1の状態をサーバ30に供給する。 FIG. 4 is a block diagram showing another example of the configuration of the reinforcement learning system 1. As shown in FIG. In the example of FIG. 4, the reinforcement learning system 1 includes a terminal 20 and a server 30. FIG. The terminal 20 is, for example, a personal computer or a game machine, but is not limited to these, and may be a device other than the above. The terminal 20 has an acquisition unit 11 . The server 30 includes a generator 12 , a calculator 13 and a selector 14 . The terminal 20 acquires the first state and supplies the acquired first state to the server 30 .
 本例示的実施形態では強化学習システム1の構成例として図3及び図4を例示したが、強化学習システム1の構成は、図3及び図4に例示したものに限定されるものではなく、これ以外の種々の構成が適用可能である。 Although FIGS. 3 and 4 are illustrated as configuration examples of the reinforcement learning system 1 in this exemplary embodiment, the configuration of the reinforcement learning system 1 is not limited to those illustrated in FIGS. Various other configurations are applicable.
 〔例示的実施形態2〕
 本発明の第2の例示的実施形態について、図面を参照して詳細に説明する。なお、例示的実施形態1にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を付し、その説明を繰り返さない。
[Exemplary embodiment 2]
A second exemplary embodiment of the invention will now be described in detail with reference to the drawings. Components having the same functions as the components described in exemplary embodiment 1 are denoted by the same reference numerals, and description thereof will not be repeated.
 <強化学習システムの構成>
 図5は、強化学習システム2の構成を示すブロック図である。図5に示すように、強化学習システム2は、端末40及び強化学習装置50を備える。端末40と強化学習装置50とは通信回線Nを介して通信可能に構成されている。通信回線Nの具体的構成は本例示的実施形態を限定するものではないが、一例として、無線LAN(Local Area Network)、有線LAN、WAN(Wide Area Network)、公衆回線網、モバイルデータ通信網、又は、これらのネットワークの組み合わせを用いることができる。
<Configuration of reinforcement learning system>
FIG. 5 is a block diagram showing the configuration of the reinforcement learning system 2. As shown in FIG. As shown in FIG. 5, the reinforcement learning system 2 includes a terminal 40 and a reinforcement learning device 50. FIG. The terminal 40 and the reinforcement learning device 50 are configured to be communicable via a communication line N. FIG. The specific configuration of the communication line N does not limit this exemplary embodiment, but as an example, it may be a wireless LAN (Local Area Network), a wired LAN, a WAN (Wide Area Network), a public line network, or a mobile data communication network. , or a combination of these networks can be used.
 端末40は、一例として汎用コンピュータであり、より具体的には、例えば掘削機等の建設機械を制御する制御装置、搬送装置による搬送を管理する管理装置、又はコンピュータゲームをプレイするためのゲーム機器である。なお、端末40はこれらに限定されるものではなく、上記以外の装置であってもよい。強化学習装置50は、一例としてサーバ装置である。 The terminal 40 is, for example, a general-purpose computer, and more specifically, for example, a control device for controlling construction machinery such as an excavator, a management device for managing transportation by a transportation device, or a game device for playing a computer game. is. Note that the terminal 40 is not limited to these, and may be a device other than the above. The reinforcement learning device 50 is, for example, a server device.
 <端末の構成>
 端末40は、通信部41、制御部42、及び入力受付部43を備える。通信部41は、制御部42の制御の下に、通信回線Nを介して強化学習装置50との間で情報を送受信する。以降、制御部42が通信部41を介して強化学習装置50との間で情報を送受信することを、単に、制御部42が強化学習装置50との間で情報を送受信する、とも記載する。
<Device configuration>
The terminal 40 includes a communication section 41 , a control section 42 and an input reception section 43 . The communication unit 41 transmits and receives information to and from the reinforcement learning device 50 via the communication line N under the control of the control unit 42 . Hereinafter, the transmission/reception of information between the control unit 42 and the reinforcement learning device 50 via the communication unit 41 is simply referred to as the transmission/reception of information between the control unit 42 and the reinforcement learning device 50 .
 制御部42は、状態提供部421、行動実行部422、及び報酬提供部423を備える。状態提供部421は、第1の状態を取得し、取得した第1の状態を強化学習装置50に提供する。本例示的実施形態において、状態提供部421が取得する第1の状態は、属性が付随する複数の要素を含む。属性は、要素の特徴及び/又は種類を示す情報であり、例えば環境内を移動する動的要素であるか、環境内を移動しない静的要素であるか、を示す情報を含む。また、属性は例えば、人、自動車、自転車、建物、といった要素の種類を示す情報であってもよい。ただし、属性は上述した例に限られず、上記以外の他の情報であってもよい。 The control unit 42 includes a state provision unit 421, an action execution unit 422, and a reward provision unit 423. The state providing unit 421 acquires the first state and provides the acquired first state to the reinforcement learning device 50 . In this exemplary embodiment, the first state obtained by the state provider 421 includes a plurality of elements accompanied by attributes. The attribute is information indicating the characteristics and/or type of the element, and includes, for example, information indicating whether the element is a dynamic element that moves within the environment or a static element that does not move within the environment. Attributes may also be information indicating the types of elements such as people, automobiles, bicycles, and buildings. However, the attribute is not limited to the above example, and may be information other than the above.
 一例として、状態提供部421は、建設機械又は搬送装置等の動作を検出するセンサが出力するセンサ情報を第1の状態として取得してもよい。また、一例として、状態提供部421は、コンピュータゲームにおいてゲームの進行に影響を与えるオブジェクトの第1状態を取得してもよい。ただし、状態提供部421が取得する第1の状態は上述した例に限られず、上記以外の状態であってもよい。 As an example, the state providing unit 421 may acquire, as the first state, sensor information output by a sensor that detects the operation of a construction machine, transport device, or the like. Also, as an example, the state providing unit 421 may acquire the first state of an object that affects the progress of a computer game. However, the first state acquired by the state providing unit 421 is not limited to the example described above, and may be a state other than the above.
 状態提供部421は、一例として、入力受付部43を介して第1の状態の入力を受け付け、受け付けた第1の状態を強化学習装置50に提供する。また、状態提供部421は、一例として、通信部41を介して接続された他の装置から第1の状態を受信し、受信した第1の状態を強化学習装置50に提供してもよい。 As an example, the state providing unit 421 receives input of the first state via the input receiving unit 43 and provides the received first state to the reinforcement learning device 50 . Also, as an example, the state providing unit 421 may receive the first state from another device connected via the communication unit 41 and provide the received first state to the reinforcement learning device 50 .
 行動実行部422は、強化学習装置50が決定した行動を実行する。一例として、行動実行部422は、強化学習装置50が決定した行動を建設機械又は搬送装置等に行わせるための制御情報を出力する。また、一例として、行動実行部422は、コンピュータゲームにおいてユーザ操作の対象であるオブジェクトの行動を制御する。ただし、強化学習装置50が実行する行動は上述した例に限られず、上記以外の行動であってもよい。 The action execution unit 422 executes the action determined by the reinforcement learning device 50. As an example, the action execution unit 422 outputs control information for causing the construction machine, the transport device, or the like to perform the action determined by the reinforcement learning device 50 . Also, as an example, the action execution unit 422 controls the action of an object that is the target of a user's operation in a computer game. However, the actions executed by the reinforcement learning device 50 are not limited to the examples described above, and may be actions other than the above.
 報酬提供部423は、強化学習装置50が決定した行動をエージェントが実行して得られた報酬を強化学習装置50に提供する。一例として、報酬提供部423は、掘削機の掘削量、掘削に要した時間、搬送装置が搬送に要した時間、搬送中における障害物への接触の有無、ゲームの勝敗、又はゲームのスコアを示す情報を報酬として強化学習装置50に提供する。ただし、報酬提供部423が提供する報酬は上述した例に限られず、上記以外の他の報酬であってもよい。 The reward providing unit 423 provides the reinforcement learning device 50 with the reward obtained when the agent executes the action determined by the reinforcement learning device 50 . As an example, the reward providing unit 423 may determine the amount of excavation by the excavator, the time required for excavation, the time required for transportation by the transportation device, the presence or absence of contact with obstacles during transportation, the win or loss of the game, or the score of the game. The indicated information is provided to the reinforcement learning device 50 as a reward. However, the reward provided by the reward providing unit 423 is not limited to the example described above, and may be other rewards than the above.
 報酬提供部423は、一例として、入力受付部43を介して取得した報酬を強化学習装置50に提供する。また、報酬提供部423は、通信部41を介して接続された他の装置から報酬を受信し、受信した報酬を強化学習装置50に提供してもよい。 As an example, the reward providing unit 423 provides the reinforcement learning device 50 with the reward obtained via the input receiving unit 43 . Also, the reward providing unit 423 may receive a reward from another device connected via the communication unit 41 and provide the received reward to the reinforcement learning device 50 .
 入力受付部43は、端末40に対する各種の入力を受け付ける。入力受付部43の具体的構成は本例示的実施形態を限定するものではないが、一例として、入力受付部43は、キーボード及びタッチパッド等の入力デバイスを備える構成とすることができる。また、入力受付部43は、赤外線や電波等の電磁波を介してデータの読み取りを行うデータスキャナ、及び、環境の状態をセンシングするセンサ等を備える構成としてもよい。報酬提供部423は一例として、入力受付部43が取得したセンシング結果に基づいて、搬送装置が搬送に要した時間等を測定し、測定結果を示す報酬を強化学習装置50に提供する。 The input reception unit 43 receives various inputs to the terminal 40. The specific configuration of the input reception unit 43 does not limit this exemplary embodiment, but as an example, the input reception unit 43 can be configured to include an input device such as a keyboard and a touch pad. Further, the input reception unit 43 may be configured to include a data scanner that reads data via electromagnetic waves such as infrared rays and radio waves, a sensor that senses the state of the environment, and the like. As an example, the reward providing unit 423 measures the time required for transportation by the transport device based on the sensing result acquired by the input receiving unit 43, and provides the reinforcement learning device 50 with a reward indicating the measurement result.
 入力受付部43は、上述した入力デバイス、データスキャナ、及びセンサ等を介して、入力を受け付けた情報を制御部42に供給する。入力受付部43は、一例として、上述した状態、及び上述した報酬を取得し、取得した状態及び報酬を制御部42に供給する。 The input reception unit 43 supplies the input-received information to the control unit 42 via the above-described input device, data scanner, sensor, and the like. As an example, the input reception unit 43 acquires the above-described state and reward, and supplies the acquired state and reward to the control unit 42 .
 <強化学習装置の構成>
 強化学習装置50は、通信部51、制御部52及び記憶部53を備える。通信部51は、制御部52の制御の下に、通信回線Nを介して強化学習装置50との間で情報を送受信する。以降、制御部52が通信部51を介して端末40との間で情報を送受信することを、単に、制御部52が端末40との間で情報を送受信する、とも記載する。
<Configuration of reinforcement learning device>
The reinforcement learning device 50 includes a communication section 51 , a control section 52 and a storage section 53 . The communication unit 51 transmits and receives information to and from the reinforcement learning device 50 via the communication line N under the control of the control unit 52 . In the following description, the transmission and reception of information between the control unit 52 and the terminal 40 via the communication unit 51 is simply referred to as the transmission and reception of information between the control unit 52 and the terminal 40 .
 制御部52は、報酬取得部521、状態観測部522、状態ランダム化部523、学習部524、推定部525、及び選択部526を備える。状態観測部522は、本例示的実施形態において取得手段を実現する構成である。状態ランダム化部523は、本例示的実施形態において生成手段を実現する構成である。推定部525は、本例示的実施形態において算出手段を実現する構成である。選択部526は、本例示的実施形態において選択手段を実現する構成である。 The control unit 52 includes a reward acquisition unit 521, a state observation unit 522, a state randomization unit 523, a learning unit 524, an estimation unit 525, and a selection unit 526. The state observer 522 is a configuration that implements an acquisition means in this exemplary embodiment. The state randomization unit 523 is a configuration that implements the generating means in this exemplary embodiment. The estimating unit 525 is a configuration that implements the calculating means in this exemplary embodiment. The selection unit 526 is a configuration that implements selection means in this exemplary embodiment.
 報酬取得部521は、通信部51を介して端末40が提供する報酬を取得する。状態観測部522は、通信部51を介して端末40が提供する第1の状態を取得する。状態ランダム化部523は、状態観測部522が取得した第1の状態にノイズを付加することによって1又は複数の第2の状態を生成する。学習部524は、第1の行動価値関数を更新するための行動価値関数モデル531を学習させる。行動価値関数モデル531は第1の行動価値関数の推定に用いられる。 The reward acquisition unit 521 acquires the reward provided by the terminal 40 via the communication unit 51. The state observation unit 522 acquires the first state provided by the terminal 40 via the communication unit 51 . State randomization section 523 generates one or more second states by adding noise to the first state obtained by state observation section 522 . The learning unit 524 learns the action-value function model 531 for updating the first action-value function. The action-value function model 531 is used to estimate the first action-value function.
 推定部525は、第1の状態と1又は複数の第2の状態とを含む状態列、又は、複数の第2の状態を含む状態列、に応じて、第1の行動価値関数を算出する。また、推定部525は、第1の行動価値関数を用いて第2の行動価値関数を算出する。 The estimation unit 525 calculates a first action value function according to a state sequence including a first state and one or more second states or a state sequence including a plurality of second states. . Also, the estimation unit 525 calculates a second action-value function using the first action-value function.
 選択部526は、第2の行動価値関数を用いて行動を選択し、選択した行動を示す情報を記憶部53に記憶するとともに、選択した行動を示す情報を端末40に送信する。 The selection unit 526 selects an action using the second action value function, stores information indicating the selected action in the storage unit 53, and transmits information indicating the selected action to the terminal 40.
 記憶部53は、制御部52が参照する各種のデータを記憶する。一例として、記憶部53は、行動価値関数モデル531、及び学習データ532を記憶する。行動価値関数モデル531は、第1の行動価値関数を更新するための学習モデルである。学習データ532は、強化学習装置50が行う強化学習で用いるデータである。学習データ532は、一例として、第1の状態、第2の状態、行動、及び報酬を含む。 The storage unit 53 stores various data that the control unit 52 refers to. As an example, the storage unit 53 stores an action-value function model 531 and learning data 532 . The action-value function model 531 is a learning model for updating the first action-value function. The learning data 532 is data used in reinforcement learning performed by the reinforcement learning device 50 . Learning data 532 includes, by way of example, first states, second states, actions, and rewards.
 <強化学習方法の流れ>
 図6は、強化学習システム2が実行する強化学習方法S2の流れを示すフロー図である。強化学習システム2は、ステップS21~ステップS29を繰り返すことにより、行動の選択を繰り返し行う。なお、一部のステップは並行して、又は順序を変えて実行されてもよい。
<Flow of reinforcement learning method>
FIG. 6 is a flowchart showing the flow of the reinforcement learning method S2 executed by the reinforcement learning system 2. As shown in FIG. The reinforcement learning system 2 repeatedly selects an action by repeating steps S21 to S29. Note that some steps may be performed in parallel or out of order.
 ステップS21において、状態提供部421は、第1の状態sを取得し、取得した第1の状態sを強化学習装置50に提供する。ステップS22において、状態観測部522は、端末40から第1の状態sを取得する。 In step S<b>21 , the state providing unit 421 acquires the first state s t and provides the acquired first state s t to the reinforcement learning device 50 . In step S<b>22 , the state observation unit 522 acquires the first state s t from the terminal 40 .
 ステップS23において、状態ランダム化部523は、第1の状態sにノイズを付加することによって、1又は複数の第2の状態を生成する。状態ランダム化部523が第1の状態sに付加するノイズは、一例として、正規乱数、又は一様乱数である。ただし、状態ランダム化部523が第1の状態sに付加するノイズはこれらに限られず、上記以外のノイズであってもよい。ノイズが付加された第2の状態は、第1の状態sに若干のブレが生じた状態を表す。 In step S23, the state randomization unit 523 generates one or more second states by adding noise to the first state st . The noise added to the first state st by the state randomization unit 523 is, for example, a normal random number or a uniform random number. However, the noise added to the first state st by the state randomization unit 523 is not limited to these, and may be noise other than the above. The second state to which noise is added represents a state in which the first state st is slightly blurred.
 本動作例において、状態ランダム化部523は、属性に応じ、第1の状態sに含まれる複数の要素に、選択的にノイズを付加することによって第2の状態を生成する。状態ランダム化部523は、一例として、所定の条件を満たす属性に付随した要素にノイズを付加する。所定の条件は例えば、動的要素を示す属性である、又は、静的要素を示す属性である、といった条件である。ただし、所定の条件は上述した例に限られず、他の条件であってもよい。 In this operation example, the state randomization unit 523 generates the second state by selectively adding noise to a plurality of elements included in the first state st according to the attribute. As an example, the state randomization unit 523 adds noise to elements associated with attributes that satisfy a predetermined condition. The predetermined condition is, for example, an attribute indicating a dynamic element or an attribute indicating a static element. However, the predetermined condition is not limited to the example described above, and may be another condition.
 また、状態ランダム化部523は、生成した第2の状態を含む状態列{s (i)}(1≦i≦n;iは自然数、nは2以上の自然数)を生成する。状態列{s (i)}は、第1の状態sと1又は複数の第2の状態とを含む状態列、又は、複数の第2の状態を含む状態列である。換言すると、状態列{s (i)}は、少なくとも第2の状態を含み、また、第1の状態sを含んでいても含んでいなくてもよい。 The state randomization unit 523 also generates a state sequence {s t (i) } (1≦i≦n; i is a natural number and n is a natural number of 2 or more) including the generated second state. A state sequence {s t (i) } is a state sequence including a first state s t and one or more second states, or a state sequence including a plurality of second states. In other words, the state sequence {s t (i) } includes at least the second state and may or may not include the first state s t .
 ステップS24において、推定部525は、状態列{s (i)}に応じて、第1の行動価値関数Q(s (i),a)を算出する。推定部525は一例として、状態列{s (i)}に含まれる複数の状態s (i)のそれぞれについて、第1の行動価値関数Q(s (i),a)を算出する。より具体的には、推定部525は一例として、上記式(1)により状態s (i)についての第1の行動価値関数Q(s (i),a)を更新する。本動作例において、第1の行動価値関数(s (i),a)はm次元(mは2以上の整数)のベクトルであり、mは集合Aの要素数(すなわち行動aの種類数)である。 In step S24, the estimation unit 525 calculates the first action value function Q(s t (i) , a) according to the state sequence {s t (i) }. As an example, the estimation unit 525 calculates a first action value function Q(s t (i) , a ) for each of a plurality of states s t (i) included in the state sequence {s t (i ) }. . More specifically, as an example, the estimation unit 525 updates the first action-value function Q(s t (i) , a) for the state s t (i) according to Equation (1) above. In this operation example, the first action value function (s t (i) , a) is an m-dimensional vector (m is an integer of 2 or more), where m is the number of elements in set A (that is, the number of types of action a). ).
 ステップS25において、推定部525は、算出した複数の第1の行動価値関数Q(s (i),a)に基づいて第2の行動価値関数J(s,a)を算出する。第2の行動価値関数J(s,a)は一例として、上記式(2)又は式(3)により与えられる。換言すると、推定部525は、上記式(2)又は式(3)により与えられる第2の行動価値関数を算出する。上記式(2)又は上記式(3)により与えられる第2の行動価値関数は、複数の第1の行動価値関数Q(s (i),a)のばらつきが大きいほど第1の行動価値関数Q(s (i),a)の期待値より低い値となる関数である。 In step S25, the estimation unit 525 calculates the second action-value function J(s t , a) based on the plurality of calculated first action-value functions Q(s t (i) , a). The second action-value function J(s t , a) is given by the above equation (2) or (3) as an example. In other words, the estimator 525 calculates the second action value function given by Equation (2) or Equation (3) above. The second action-value function given by the above formula (2) or the above-mentioned formula (3) is such that the larger the variation of the plurality of first action-value functions Q(s t (i) , a), the more the first action-value function It is a function that is lower than the expected value of the function Q(s t (i) , a).
 ステップS26において、選択部526は、第1の行動価値関数Q(s (i),a)に基づいて算出される第2の行動価値関数J(s,a)に応じて、行動aを選択する。選択部526は一例として、上記式(4)により与えられる方策により行動aを選択する。なお、行動aを選択する方策は上記式(4)により与えられる方策に限られず、εグリーディ方策、ソフトマックス手法等の他の方策が用いられてもよい。選択部526は、選択した行動aを端末40に通知する。 In step S26 , the selection unit 526 selects the behavior a to select. As an example, the selection unit 526 selects the action a by the policy given by the above equation (4). Note that the policy for selecting the action a is not limited to the policy given by the above formula (4), and other policies such as the ε-greedy policy and the softmax technique may be used. The selection unit 526 notifies the terminal 40 of the selected action a.
 ステップS27において、行動実行部422は、強化学習装置50から通知された行動aを実行する。ステップS28において、報酬提供部423は、強化学習装置50が選択した行動を実行して得られた報酬rを、強化学習装置50に提供する。ステップS29において、報酬取得部521は、状態列{s (i)}、及び報酬rを含む学習データを蓄積する。 In step S<b>27 , the action execution unit 422 executes the action a notified from the reinforcement learning device 50 . In step S<b>28 , the reward providing unit 423 provides the reinforcement learning device 50 with the reward rt obtained by executing the action selected by the reinforcement learning device 50 . In step S29, the reward obtaining unit 521 accumulates learning data including the state sequence {s t (i) } and the reward r t .
 <強化学習システムの効果>
 強化学習においては、状態が若干異なっているだけで行動価値関数の値が大きく異なる場合がある。換言すると、状態の若干の差分が行動価値関数の値に大きな影響を及ぼす場合がある。本例示的実施形態では、第1の状態sにあえて若干のノイズを加えた第2の状態を用いて第1の行動価値関数Qを算出することにより、状態のばらつきを考慮した第1の行動価値関数Qを算出することができる。この第1の行動価値関数Qを用いて行動aを選択することにより、本例示的実施形態によれば、行動aをより適切に選択することができる。
<Effect of Reinforcement Learning System>
In reinforcement learning, even if the state is slightly different, the value of the action-value function may be greatly different. In other words, a small difference in state can have a large impact on the value of the action-value function. In this exemplary embodiment, by calculating the first action value function Q using the second state in which some noise is added to the first state st , the first An action-value function Q can be calculated. By selecting the action a using this first action-value function Q, the action a can be selected more appropriately according to this exemplary embodiment.
 また、本例示的実施形態に係る強化学習システム2においては、ノイズを付加した第2の状態を含む複数の状態s (i)に応じて第1の行動価値関数Qを算出する構成が採用されている。このため、本例示的実施形態に係る強化学習システム2によれば、例示的実施形態1に係る強化学習システム1の奏する効果に加えて、より適切な行動aを選択できるという効果が得られる。 Further, in the reinforcement learning system 2 according to the present exemplary embodiment, a configuration is adopted in which the first action value function Q is calculated according to a plurality of states s t (i) including the second state to which noise is added. It is Therefore, according to the reinforcement learning system 2 according to the present exemplary embodiment, in addition to the effect of the reinforcement learning system 1 according to the first exemplary embodiment, an effect that a more appropriate action a can be selected can be obtained.
 また、本例示的実施形態において、強化学習システム2が上述の式(2)を用いて第2の行動価値関数Jを算出する場合、第2の行動価値関数Jは、高次の影響を含めたリスク(ばらつき)に敏感な指標となる。第2の行動価値関数Jを用いて強化学習システム2が行動aを選択することで、よりリスクに敏感な行動aの選択を行うことができる。 Also, in this exemplary embodiment, when the reinforcement learning system 2 calculates the second action-value function J using the above equation (2), the second action-value function J is It is an index sensitive to the risk (variation) By the reinforcement learning system 2 selecting the action a using the second action-value function J, the action a that is more sensitive to risk can be selected.
 また、本例示的実施形態において、強化学習システム2が上述の式(3)を用いて第2の行動価値関数Jを算出する場合、式(3)には指数演算が含まれないため計算処理において桁あふれが発生することがない。第2の行動価値関数Jを用いて強化学習システム2が行動aを選択することにより、行動aをより好適に選択できるとともに、行動aの選択に係る処理負荷が軽減される。 Further, in this exemplary embodiment, when the reinforcement learning system 2 calculates the second action value function J using the above equation (3), the calculation process is overflow does not occur in By the reinforcement learning system 2 selecting the action a using the second action-value function J, the action a can be selected more appropriately, and the processing load associated with the selection of the action a can be reduced.
 〔例示的実施形態3〕
 本発明の例示的実施形態3について、図面を参照して説明する。なお、例示的実施形態1~2にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を付記し、その説明を繰り返さない。
[Exemplary embodiment 3]
An exemplary embodiment 3 of the present invention will be described with reference to the drawings. Components having the same functions as those described in exemplary embodiments 1 and 2 are denoted by the same reference numerals, and description thereof will not be repeated.
 <強化学習システムの構成>
 本例示的実施形態に係る強化学習システム(以下「強化学習システム3」という)は、上記例示的実施形態2に係る強化学習システム2を、コンピュータゲームの自律プレイに適用したものである。強化学習システム3は、上述の例示的実施形態2において図5に示した強化学習システム2と同様の構成を有する。強化学習システム3の構成要素については、強化学習システム2の構成要素と同様であり、ここではその説明を繰り返さない。
<Configuration of reinforcement learning system>
A reinforcement learning system according to this exemplary embodiment (hereinafter referred to as "reinforcement learning system 3") is obtained by applying the reinforcement learning system 2 according to the second exemplary embodiment to autonomous play of a computer game. The reinforcement learning system 3 has the same configuration as the reinforcement learning system 2 shown in FIG. 5 in the second exemplary embodiment described above. The components of reinforcement learning system 3 are the same as those of reinforcement learning system 2, and the description thereof will not be repeated here.
 本例示的実施形態において、第1の状態sは、一例として、コンピュータゲームにおいてゲームの進行に影響を与えるオブジェクトの状態を含む。行動aは、一例として、コンピュータゲームのプレイヤにより操作されるオブジェクトの動作を含む。報酬rは、一例として、ゲームの勝敗、又はゲームのスコアに関する報酬を含む。 In this exemplary embodiment, the first state st includes, as an example, the state of an object that affects the progress of the game in a computer game. Action a includes, for example, the action of an object operated by a computer game player. The reward r t includes, for example, a reward for winning or losing a game or a game score.
 図7は、強化学習システム3に係るコンピュータゲームのゲーム画面の一例である画面SC1を示す図である。画面SC1は、第1動的オブジェクトC11、第2動的オブジェクトC21~C23、第1静的オブジェクトC31~C34、及び第2静的オブジェクトC4を含む。第1動的オブジェクトC11、第2動的オブジェクトC21~C23、第1静的オブジェクトC31~C34、及び第2静的オブジェクトC4は、ゲームの進行に影響を与えるオブジェクトの例である。 FIG. 7 is a diagram showing a screen SC1, which is an example of a game screen of a computer game related to the reinforcement learning system 3. FIG. The screen SC1 includes a first dynamic object C11, second dynamic objects C21-C23, first static objects C31-C34, and a second static object C4. The first dynamic object C11, the second dynamic objects C21-C23, the first static objects C31-C34, and the second static object C4 are examples of objects that affect the progress of the game.
 図7に係るコンピュータゲームは、迷路内を移動する第1動的オブジェクトC11の移動方向をゲームのプレイヤが指定し、第2動的オブジェクトC21~C23の追跡をかわしながら迷路内に配置された第1静的オブジェクトC31~C34を回収するとラウンドクリアとなるゲームである。 In the computer game shown in FIG. 7, the player designates the moving direction of the first dynamic object C11 moving in the maze, and the second dynamic object C11 placed in the maze while dodging the tracking of the second dynamic objects C21 to C23. In this game, the round is cleared when the static objects C31 to C34 are collected.
 第1動的オブジェクトC11、及び第2動的オブジェクトC21~C23は、ゲームの進行中において画面上を移動するオブジェクトであり、環境内を移動する動的要素の一例である。一方、第1静的オブジェクトC31~C34及び第2静的オブジェクトC4は、ゲームの進行中において画面上を移動しないオブジェクトであり、環境内を移動しない静的要素の一例である。第1動的オブジェクトC11は、プレイヤの操作対象のオブジェクトである。第1動的オブジェクトC11はゲームの進行中において迷路内を一定の速度で移動し、プレイヤの操作に応じて移動方向を変更する。第2動的オブジェクトC21~C23は、ゲームの進行中において第1動的オブジェクトC11を追従して移動するオブジェクトである。図7では3つの第2動的オブジェクトC21~C23を図示しているが、第2動的オブジェクトの数は3に限られず、これより多くても少なくてもよい。 The first dynamic object C11 and the second dynamic objects C21 to C23 are objects that move on the screen while the game is in progress, and are examples of dynamic elements that move within the environment. On the other hand, the first static objects C31 to C34 and the second static object C4 are objects that do not move on the screen while the game is in progress, and are examples of static elements that do not move within the environment. The first dynamic object C11 is an object to be operated by the player. The first dynamic object C11 moves in the maze at a constant speed during the progress of the game, and changes its moving direction according to the player's operation. The second dynamic objects C21 to C23 are objects that move following the first dynamic object C11 during the progress of the game. Although three second dynamic objects C21 to C23 are illustrated in FIG. 7, the number of second dynamic objects is not limited to three, and may be more or less.
 第1静的オブジェクトC31~C34は、迷路内に配置され、第1動的オブジェクトC11により回収されるオブジェクトである。第1動的オブジェクトC11が第1静的オブジェクトC31~C34に衝突することにより第1静的オブジェクトC31~C34が第1動的オブジェクトC11により回収される。図7では4つの第1静的オブジェクトC31~C34を図示しているが、第1静的オブジェクトの数は4に限られず、これより多くても少なくてもよい。第2静的オブジェクトC4は、迷路を構成する壁である。 The first static objects C31 to C34 are objects placed in the maze and collected by the first dynamic object C11. When the first dynamic object C11 collides with the first static objects C31-C34, the first static objects C31-C34 are recovered by the first dynamic object C11. Although four first static objects C31 to C34 are illustrated in FIG. 7, the number of first static objects is not limited to four, and may be more or less. A second static object C4 is a wall forming a maze.
 図7の例において、第1の状態sは、第1動的オブジェクトC11、第2動的オブジェクトC21~C23、第1静的オブジェクトC31~C34、及び第2静的オブジェクトC4に関する状態を含む。換言すると、第1の状態は、環境内を移動する動的要素に関する状態、及び、環境内を移動しない静的要素に関する状態を含む。より具体的には、第1の状態sは、第1動的オブジェクトC11の位置、第2動的オブジェクトC21~C23の位置、第1静的オブジェクトC31~C34の位置、及び第2静的オブジェクトC4の位置、を含む。 In the example of FIG. 7, the first state s t includes states for a first dynamic object C11, second dynamic objects C21-C23, first static objects C31-C34, and a second static object C4. . In other words, the first state includes states for dynamic elements that move within the environment and states for static elements that do not move within the environment. More specifically, the first state s t includes the position of the first dynamic object C11, the positions of the second dynamic objects C21-C23, the positions of the first static objects C31-C34, and the second static the position of object C4.
 本例示的実施形態において、第1の状態sは、ゲームのプレイ画面を表す画像である。
 図8は、第1の状態sの一例である画像Img11を示す図である。画像Img11は、ゲーム画面に含まれる要素を0~255の画素値により表現したグレースケール画像である。画像Img11は所定数のマスに分割されており、各マスに位置する要素の属性に応じた画素値で各マスが表現される。一例として、第1動的オブジェクトC11の位置は画素値が255、第2動的オブジェクトC21~C23の位置は画素値が160、第1静的オブジェクトC31~C34の位置は画素値が128、第2静的オブジェクトC4により形成される通路の位置は画素値が64、移動不可の場所は画素値が0、で表される。
In this exemplary embodiment, the first state s t is an image representing a game play screen.
FIG. 8 is a diagram showing an image Img11 as an example of the first state st . The image Img11 is a grayscale image in which the elements included in the game screen are represented by pixel values from 0 to 255. The image Img11 is divided into a predetermined number of squares, and each square is represented by a pixel value corresponding to the attribute of the element located in each square. As an example, the position of the first dynamic object C11 has a pixel value of 255, the positions of the second dynamic objects C21 to C23 have a pixel value of 160, and the positions of the first static objects C31 to C34 have a pixel value of 128. The position of the path formed by the two static objects C4 is represented by a pixel value of 64, and the immovable place is represented by a pixel value of 0.
 本例示的実施形態において、行動aは、第1動的オブジェクトC11の移動であり、上に移動、下に移動、右に移動、左に移動、の4種類である。報酬rは、一例として、スコアがアップした場合に得られる所定の加算値(例えば、+1)、及び第2動的オブジェクトC21~C23に捕獲された場合に得られる所定の減算値(例えば、-10)である。1回の行動においてアップしたスコアの程度に関わらず、行動によりスコアがアップした場合に所定の加算値(例えば、+1)が報酬rとして得られてもよい。 In this exemplary embodiment, the action a is the movement of the first dynamic object C11, and there are four types of movement: move up, move down, move right, and move left. The reward r t is, for example, a predetermined additional value obtained when the score increases (eg, +1), and a predetermined subtracted value obtained when captured by the second dynamic objects C21 to C23 (eg, -10). A predetermined additional value (for example, +1) may be obtained as a reward rt when the score is increased by the action, regardless of the degree of increase in the score in one action.
 <強化学習方法の流れ>
 強化学習システム3は、上述の例示的実施形態2に係る図6の強化学習方法S2を実行する。以下では、本例示的実施形態において特徴的な動作について主に説明し、上述の例示的実施形態2で説明した内容についてはその説明を繰り返さない。
<Flow of reinforcement learning method>
The reinforcement learning system 3 executes the reinforcement learning method S2 of FIG. 6 according to the exemplary embodiment 2 described above. The characteristic operation of this exemplary embodiment will be mainly described below, and the description of the contents described in the second exemplary embodiment will not be repeated.
 本例示的実施形態では、ステップS23において、状態ランダム化部523は、第1の状態sに含まれる動的要素の状態にノイズを付加することによって第2の状態を生成する。状態ランダム化部523は、一例として、第1動的オブジェクトC11の位置及び第2動的オブジェクトC21~C23の位置をランダムウォークによりランダム化した第2の状態を生成する。 In this exemplary embodiment, in step S23, the state randomizer 523 generates the second state by adding noise to the states of the dynamic elements included in the first state st . As an example, the state randomization unit 523 generates a second state by randomizing the position of the first dynamic object C11 and the positions of the second dynamic objects C21 to C23 by random walk.
 より具体的には、状態ランダム化部523は、一例として、ゲーム画面を所定数のマスに分割(例えば、33×33マスに分割)し、前後左右の進行できる方向(道のある方向)に1マス進む/進まない確率を、等確率に選択する。状態ランダム化部523は、第1動的オブジェクトC11の位置及び第2動的オブジェクトC21~C23についてσ回(σは1以上の整数)のランダムウォークを実施する。σ回のランダムウォークの実施により、動的要素は平均でσマスだけ移動する。 More specifically, as an example, the state randomization unit 523 divides the game screen into a predetermined number of squares (for example, into 33×33 squares), and divides the game screen into front, rear, left, and right directions (directions with roads) in which it can proceed. The probability of advancing/not advancing by one square is selected with equal probability. The state randomization unit 523 performs random walk σ 2 times (σ is an integer equal to or greater than 1) for the position of the first dynamic object C11 and the second dynamic objects C21 to C23. Performing σ 2 random walks moves the dynamic element by σ squares on average.
 ステップS23において状態ランダム化部523が生成する状態列{s (i)}は、第1の状態s、及び、第1の状態sをランダム化した(n-1)個の第2の状態、の計n個の状態を含む。また、行動aが上に移動、下に移動、右に移動、左に移動の4種類であるため、ステップS24で推定部525が算出する第1の行動価値関数(s (i),a)は、4次元のベクトルである。 The state sequence {s t (i) } generated by the state randomization unit 523 in step S23 includes the first state s t and (n−1) second states obtained by randomizing the first state s t . , and a total of n states. In addition, since there are four types of actions a: move up, move down, move right, and move left, the first action value function (s t (i) , a ) is a four-dimensional vector.
 ステップS26において、選択部526は、第1動的オブジェクトC11の移動方向を、交差点又は角(すなわち、移動方向を変更できる地点)において、行動aとして上下左右の4種類からいずれかを選択する。ただし、選択部526は、第1動的オブジェクトC11が移動できない方向は除外する。 In step S26, the selection unit 526 selects one of four types of action a, up, down, left, and right, as the movement direction of the first dynamic object C11 at an intersection or corner (that is, a point where the movement direction can be changed). However, the selection unit 526 excludes directions in which the first dynamic object C11 cannot move.
 <本例示的実施形態の評価>
 図9~図12はそれぞれ、強化学習システム3に係るコンピュータゲームの自律プレイの評価結果の一例を示す図である。本例示的実施形態に係るコンピュータゲームにおいて、第1動的オブジェクトのライフは1機とし、第1動的オブジェクトが第2動的オブジェクトに捕獲されるとゲームオーバーとした。また、ステージは1ステージとし、ゲームをクリアすれば、すなわち全ての第1静的オブジェクトを全て回収すれば終了とした。
<Evaluation of this exemplary embodiment>
9 to 12 are diagrams showing examples of evaluation results of autonomous play of a computer game according to the reinforcement learning system 3, respectively. In the computer game according to this exemplary embodiment, the life of the first dynamic object is one, and the game is over when the first dynamic object is captured by the second dynamic object. Also, the number of stages was one, and the game ended when the game was cleared, that is, when all the first static objects were collected.
 図9~図12の例では、強化学習システム3の強化学習におけるσ及びθの値を変更した複数の条件において強化学習システム3がコンピュータゲームの自律プレイを行った結果を評価した。また、強化学習システム3ではない、従来の強化学習の手法による自律プレイの結果も比較対象とした。従来の強化学習の手法としては、DQN(deep Q-network)の手法において行動選択の方策を改良したものを用いた。 In the examples of FIGS. 9 to 12, the results of the autonomous play of the computer game by the reinforcement learning system 3 were evaluated under multiple conditions in which the values of σ and θ in the reinforcement learning of the reinforcement learning system 3 were changed. Moreover, the result of the autonomous play by the conventional reinforcement learning method, which is not the reinforcement learning system 3, was also used for comparison. As a conventional reinforcement learning method, a DQN (deep Q-network) method with an improved action selection policy is used.
 図9は、σ=2の場合の自律プレイによるスコアを表すグラフである。σは、上述したようにランダムウォークにおける平均移動回数である。図9において、縦軸はスコアを示す。グラフg91は、従来の強化学習による自律プレイのスコアの平均値を示す。グラフg11~g14は、強化学習システム3の自律プレイによるスコアの平均値を表す。グラフg11~g14は、第2の行動価値関数Jを表す式(上記式(2)又は式(3))のハイパーパラメータθの値がそれぞれ異なっている。グラフg11~g14はそれぞれ、ハイパーパラメータθを「0」、「0.001」、「0.01」、「0.1」とした場合のスコアの平均値を表すグラフである。 FIG. 9 is a graph showing scores from autonomous play when σ=2. σ is the average number of movements in the random walk as described above. In FIG. 9, the vertical axis indicates the score. A graph g91 indicates an average score of autonomous play by conventional reinforcement learning. Graphs g11 to g14 represent average values of scores obtained by autonomous play of the reinforcement learning system 3. FIG. Graphs g11 to g14 have different values of the hyperparameter θ in the formula representing the second action-value function J (formula (2) or formula (3) above). Graphs g11 to g14 are graphs representing the average values of the scores when the hyperparameter θ is set to 0, 0.001, 0.01, and 0.1, respectively.
 グラフg91と、グラフg11~g14とを比較すると、従来の強化学習によるスコアよりも、本例示的実施形態に係る強化学習システム3のスコアのほうが高く、特にハイパーパラメータθの値を「0.01」とした場合のスコアが高くなっている。 Comparing the graph g91 with the graphs g11 to g14, the score of the reinforcement learning system 3 according to the present exemplary embodiment is higher than the score of the conventional reinforcement learning. , the score is higher.
 図10は、σ=2の場合の自律プレイによる第1静的オブジェクトの回収率を表すグラフである。図10において、縦軸は回収率を示す。グラフg92は、従来の強化学習による自律プレイの回収率の平均値を示す。グラフg21~g24は、強化学習システム3の自律プレイによる回収率の平均値を表す。グラフg21~g24は、第2の行動価値関数Jを表す式(上記式(2)又は式(3))のハイパーパラメータθの値がそれぞれ異なっている。グラフg21~g24はそれぞれ、ハイパーパラメータθを「0」、「0.001」、「0.01」、「0.1」とした場合の回収率の平均値を表すグラフである。 FIG. 10 is a graph showing the collection rate of the first static object by autonomous play when σ=2. In FIG. 10, the vertical axis indicates the recovery rate. A graph g92 shows the average recovery rate of autonomous play by conventional reinforcement learning. Graphs g21 to g24 represent the average recovery rates of the reinforcement learning system 3 through autonomous play. Graphs g21 to g24 have different values of the hyperparameter θ of the formula representing the second action-value function J (formula (2) or formula (3) above). Graphs g21 to g24 are graphs showing average recovery rates when the hyperparameter θ is set to 0, 0.001, 0.01, and 0.1, respectively.
 グラフg92と、グラフg21~g24とを比較すると、従来の強化学習による回収率よりも、本例示的実施形態に係る強化学習システム3の回収率のほうが高い傾向があり、特にハイパーパラメータθの値を「0.01」とした場合のスコアが高くなっている。 Comparing the graph g92 with the graphs g21 to g24, the recovery rate of the reinforcement learning system 3 according to this exemplary embodiment tends to be higher than the recovery rate of conventional reinforcement learning. is "0.01", the score is high.
 図11は、自律プレイによるスコアとσとの関係を表すグラフである。図11において、横軸はσを示し、縦軸はスコアを示す。グラフg31~g34はそれぞれ、ハイパーパラメータθが「0」、「0.001」、「0.01」、「0.1」である場合における、σが1~5の場合のスコアの平均値を表す。なお、従来の強化学習による自律プレイのスコアの平均値は「2009」である。 FIG. 11 is a graph showing the relationship between the score and σ in autonomous play. In FIG. 11, the horizontal axis indicates σ and the vertical axis indicates the score. Graphs g31 to g34 respectively show the average values of the scores when σ is 1 to 5 when the hyperparameter θ is 0, 0.001, 0.01, and 0.1. show. In addition, the average value of the score of the autonomous play by the conventional reinforcement learning is "2009."
 図11の例では、σの値が1~3の場合のスコア値が、従来の強化学習によるスコアよりも高くなっていることが多い。特に、θ=0.01、σ=2の場合のスコアが他と比較して高くなっている。 In the example of FIG. 11, the score values when the value of σ is 1 to 3 are often higher than the scores obtained by conventional reinforcement learning. In particular, the score for θ=0.01 and σ=2 is higher than the others.
 図12は、自律プレイによる回収率とσとの関係を表すグラフである。図12において、横軸はσを示し、縦軸は回収率を示す。グラフg41~g44はそれぞれ、ハイパーパラメータθが「0」、「0.001」、「0.01」、「0.1」である場合のσの値毎の回収率の平均値を表す。なお、従来の強化学習による自律プレイの回収率の平均値は67.5%である。 FIG. 12 is a graph showing the relationship between the collection rate of autonomous play and σ. In FIG. 12, the horizontal axis indicates σ and the vertical axis indicates recovery rate. Graphs g41 to g44 respectively represent the average values of the recovery rate for each value of σ when the hyperparameter θ is "0", "0.001", "0.01", and "0.1". The average collection rate of autonomous play by conventional reinforcement learning is 67.5%.
 図12の例では、σの値が1~3の場合の回収率が、従来の強化学習による回収率よりも高くなっているものが多い。特に、θ=0.01、σ=2の場合の回収率が他と比較して高くなっている。 In the example of FIG. 12, the recovery rate when the value of σ is 1 to 3 is often higher than the recovery rate of conventional reinforcement learning. In particular, when θ=0.01 and σ=2, the recovery rate is higher than the others.
 以上説明したように本例示的実施形態によれば、強化学習システム3は、第1の状態にノイズを付加した第2の状態を用いて第1の行動価値関数を算出することにより、コンピュータゲームの自律プレイにおける行動の選択をより好適に行うことができる。 As described above, according to this exemplary embodiment, the reinforcement learning system 3 computes the first action-value function using the second state obtained by adding noise to the first state, so that the computer game action selection in the autonomous play can be performed more preferably.
 〔例示的実施形態4〕
 本発明の例示的実施形態4について説明する。なお、例示的実施形態1~3にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を用いてその説明を繰り返さない。
[Exemplary embodiment 4]
An exemplary embodiment 4 of the present invention will now be described. Components having the same functions as the components described in exemplary embodiments 1 to 3 are denoted by the same reference numerals, and description thereof will not be repeated.
 本例示的実施形態に係る強化学習システム(以下「強化学習システム4」という)は、上記例示的実施形態2に係る強化学習システム2を、土砂を掘削する掘削機等の建設機械の制御に適用したものである。強化学習システム3は、上述の例示的実施形態2において図5に示した強化学習システム2と同様の構成を有する。強化学習システム4の構成要素については、強化学習システム2の構成要素と同様であり、ここではその説明を繰り返さない。 A reinforcement learning system according to the present exemplary embodiment (hereinafter referred to as "reinforcement learning system 4") applies the reinforcement learning system 2 according to the second exemplary embodiment to control construction machinery such as an excavator that excavates earth and sand. It is what I did. The reinforcement learning system 3 has the same configuration as the reinforcement learning system 2 shown in FIG. 5 in the second exemplary embodiment described above. The components of reinforcement learning system 4 are the same as those of reinforcement learning system 2, and the description thereof will not be repeated here.
 強化学習システム4は、油圧ショベルが土砂を掘削する場合の掘削動作等の建設機械の動作を強化学習により選択する。強化学習における行動の目的は、一例として、バケット一杯に土砂を掘削し、掘削の際に車体が傾いたり引きずられたりしないようにすることである。 The reinforcement learning system 4 selects the operation of the construction machine, such as the excavation operation when the hydraulic excavator excavates earth and sand, through reinforcement learning. As an example, the purpose of actions in reinforcement learning is to excavate a bucket full of earth and sand so that the vehicle does not tilt or drag during excavation.
 本例示的実施形態において、第1の状態sは、一例として、油圧ショベル等の建設機械の姿勢及び位置、掘削対象である土砂の形状(3Dデータ、等)、並びに掘削機のバケット内の土砂量、の一部又は全部を含む。建設機械の姿勢は、一例として、建設機械のバケット、アーム、ブーム、及び上記旋回体の角度を含む。建設機械の位置は、一例として、建設機械のクローラの位置及び方向を含む。 In this exemplary embodiment, the first state s t includes, as an example, the attitude and position of a construction machine such as a hydraulic excavator, the shape of the earth and sand to be excavated (3D data, etc.), and the inside of the bucket of the excavator. Including part or all of the amount of sediment. The posture of the construction machine includes, for example, the angles of the bucket, arm, boom, and rotating body of the construction machine. The position of the construction machine includes, for example, the position and direction of the crawler of the construction machine.
 行動aは、一例として、建設機械の姿勢制御(バケット、アーム、ブーム、旋回体の角度制御、等)を含む。報酬rは、一例として、掘削量が多いほどその絶対値が大きい正の報酬、及び、建設機械の車体の傾きの程度、引きずられの程度又は掘削にかかった時間が大きいほどその絶対値が大きい負の報酬、の一部又は全部を含む。 The action a includes, for example, attitude control of the construction machine (bucket, arm, boom, angle control of the rotating body, etc.). For example, the reward rt is a positive reward whose absolute value increases as the amount of excavation increases, and its absolute value increases as the degree of inclination of the body of the construction machine, the degree of dragging, or the time required for excavation increases. large negative rewards, including in part or in whole.
 状態ランダム化部523は、第1の状態sに含まれる複数の要素の全てにノイズを付加してもよく、また、一部の要素にノイズを付加してもよい。一部の要素にノイズを付加する場合、ノイズが付加される要素は、例えば、油圧ショベル姿勢、観測した土砂の3Dデータを含んでもよい。 The state randomization unit 523 may add noise to all of the multiple elements included in the first state st , or may add noise to some of the elements. When noise is added to some elements, the elements to which noise is added may include, for example, hydraulic excavator posture and 3D data of observed earth and sand.
 本例示的実施形態によれば、強化学習システム4は、第1の状態sにノイズを付加した第2の状態を用いて第1の行動価値関数を算出することにより、建設機械の動作の選択をより好適に行うことができる。 According to this exemplary embodiment, the reinforcement learning system 4 computes the first action-value function using the second state obtained by adding noise to the first state s t , thereby determining the operation of the construction machine. The selection can be made better.
 〔例示的実施形態5〕
 本発明の例示的実施形態5について説明する。なお、例示的実施形態1~4にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を用いてその説明を繰り返さない。
[Exemplary embodiment 5]
Exemplary Embodiment 5 of the present invention will now be described. Components having the same functions as the components described in exemplary embodiments 1 to 4 are denoted by the same reference numerals, and the description thereof will not be repeated.
 本例示的実施形態に係る強化学習システム(以下「強化学習システム5」という)は、上記例示的実施形態2に係る強化学習システム2を、荷物を搬送する搬送装置の制御に適用するものである。搬送装置は、一例として、自動走行する無人搬送車(AGV:Automated Guided Vehicle)である。強化学習システム5は、上述の例示的実施形態2において図5に示した強化学習システム2と同様の構成を有する。強化学習システム5の構成要素については、強化学習システム2の構成要素と同様であり、ここではその説明を繰り返さない。 A reinforcement learning system according to this exemplary embodiment (hereinafter referred to as "reinforcement learning system 5") applies the reinforcement learning system 2 according to the second exemplary embodiment to control of a transport device that transports packages. . The transport device is, for example, an automated guided vehicle (AGV) that runs automatically. The reinforcement learning system 5 has the same configuration as the reinforcement learning system 2 shown in FIG. 5 in the second exemplary embodiment described above. The components of reinforcement learning system 5 are the same as those of reinforcement learning system 2, and the description thereof will not be repeated here.
 強化学習システム5は、所定の位置から別の位置へと荷物を搬送する場合に、できるだけ搬送時間を短く(搬送速度を速く)、かつ、途中で静的障害物(棚、荷物等)及び動的障害物(人、他のロボット、等)への接触がないように行動を選択する。 Reinforcement learning system 5 shortens the transportation time as much as possible (increases the transportation speed) when transporting a load from a predetermined position to another position, and avoids static obstacles (shelves, loads, etc.) and moving objects on the way. Choose actions so that there is no contact with physical obstacles (people, other robots, etc.).
 本例示的実施形態において、第1の状態sは、一例として、搬送物を搬送する搬送装置の位置、移動方向、速度、及び角速度、通路の位置、静的障害物の位置、並びに動的障害物の位置及び移動速度、の一部又は全部を含む。行動aは、一例として、搬送装置の速度制御及び角速度制御を含む。報酬rは、一例として、搬送完了時に得られる正の報酬、障害物への接触時に得られる負の報酬、又は、搬送時間が長いほどその絶対値が大きい負の報酬、の一部又は全部を含む。 In this exemplary embodiment, the first state s t includes, by way of example, the position, direction of movement, speed and angular velocity of a conveying device conveying the goods, the position of passages, the position of static obstacles, and the dynamic Including part or all of the position and movement speed of the obstacle. Action a includes, for example, velocity control and angular velocity control of the conveying device. The reward rt is, for example, a part or all of a positive reward obtained when transportation is completed, a negative reward obtained when contacting an obstacle, or a negative reward whose absolute value increases as the transportation time increases. including.
 状態ランダム化部523は、第1の状態にsに含まれる複数の要素の全てにノイズを付加してもよく、また、一部の要素にノイズを付加してもよい。一部の要素にノイズを付加する場合、ノイズが付加される要素は、例えば、搬送装置の位置、方向、速度及び角速度を含んでもよく、また、静的障害物の位置、又は動的障害物の位置及び速度を含んでもよい。また、状態ランダム化部523は例えば、搬送装置の進行方向や走行経路上に位置する障害物に対してノイズを付与し、進行方向外や走行経路以外に位置する障害物に対し、ノイズを付与しないようにしてもよい。 State randomization section 523 may add noise to all of the plurality of elements included in st in the first state, or may add noise to some of the elements. When adding noise to some elements, the elements to which noise is added may include, for example, the position, orientation, velocity and angular velocity of the transport device, and also the position of static obstacles, or the position of dynamic obstacles. may include the position and velocity of In addition, the state randomization unit 523, for example, adds noise to obstacles positioned in the traveling direction of the conveying device or on the traveling route, and gives noise to obstacles positioned outside the traveling direction or outside the traveling route. You can choose not to.
 本例示的実施形態によれば、強化学習システム5は、第1の状態sにノイズを付加した第2の状態を用いて第1の行動価値関数を算出することにより、搬送装置の搬送制御をより好適に行うことができる。 According to this exemplary embodiment, the reinforcement learning system 5 computes the first action-value function using the second state obtained by adding noise to the first state st , thereby controlling the transport control of the transport device. can be performed more preferably.
 〔例示的実施形態6〕
 本発明の例示的実施形態6について説明する。なお、例示的実施形態1~5にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を用いてその説明を繰り返さない。
[Exemplary embodiment 6]
An exemplary embodiment 6 of the present invention will now be described. Components having the same functions as the components described in the exemplary embodiments 1 to 5 are denoted by the same reference numerals, and the description thereof will not be repeated.
 本例示的実施形態に係る強化学習システム(以下「強化学習システム6」という)は、上記例示的実施形態2に係る強化学習システム2を、フォークリフトの制御に適用するものである。強化学習システム6は、上述の例示的実施形態2において図5に示した強化学習システム2と同様の構成を有する。強化学習システム6の構成要素については、強化学習システム2の構成要素と同様であり、ここではその説明を繰り返さない。 The reinforcement learning system according to this exemplary embodiment (hereinafter referred to as "reinforcement learning system 6") applies the reinforcement learning system 2 according to the second exemplary embodiment to the control of a forklift. The reinforcement learning system 6 has the same configuration as the reinforcement learning system 2 shown in FIG. 5 in the second exemplary embodiment described above. The components of reinforcement learning system 6 are the same as those of reinforcement learning system 2, and the description thereof will not be repeated here.
 強化学習システム6は、所定の位置から別の位置へとパレットを搬送する場合に、できるだけ搬送時間を短く(搬送速度を速く)、かつ、途中で静的障害物(棚、荷物等)及び動的障害物(人、他のボロッと、等)への接触がないように行動を選択する。 When transporting a pallet from a predetermined position to another position, the reinforcement learning system 6 shortens the transport time as much as possible (increases the transport speed) and avoids static obstacles (shelves, luggage, etc.) Choose actions so that there is no contact with physical obstacles (people, other rags, etc.).
 本例示的実施形態において、第1の状態sは、一例として、フォークリフトの位置、移動方向、速度、及び角速度、通路の位置、静的障害物の位置、並びに動的障害物の位置及び速度、の一部又は全部を含む。行動aは、一例として、フォークリフトの速度制御及び角速度制御を含む。報酬rは、一例として、搬送完了時に得られる正の報酬、障害物への接触時に得られる負の報酬、又は、搬送時間が長いほどその絶対値が大きい負の報酬、の一部又は全部を含む。 In this exemplary embodiment, the first state s t includes, by way of example, the position, direction of movement, speed and angular velocity of the forklift, the position of the path, the position of static obstacles, and the position and speed of dynamic obstacles. , including part or all of Action a includes, for example, speed control and angular speed control of a forklift. The reward rt is, for example, a part or all of a positive reward obtained when transportation is completed, a negative reward obtained when contacting an obstacle, or a negative reward whose absolute value increases as the transportation time increases. including.
 状態ランダム化部523は、第1の状態にsに含まれる複数の要素の全てにノイズを付加してもよく、また、一部の要素にノイズを付加してもよい。一部の要素にノイズを付加する場合、ノイズが付加される要素は、例えば、フォークリフトの位置、方向、速度及び角速度を含んでもよく、また、静的障害物の位置、又は動的障害物の位置及び速度を含んでもよい。また、状態ランダム化部523は例えば、フォークリフトの進行方向や走行経路上に位置する障害物に対してノイズを付与し、進行方向外や走行経路以外に位置する障害物に対し、ノイズを付与しないようにしてもよい。 State randomization section 523 may add noise to all of the plurality of elements included in st in the first state, or may add noise to some of the elements. When adding noise to some elements, the elements to which noise is added may include, for example, the position, orientation, velocity and angular velocity of a forklift, and also the position of static obstacles, or the May include position and velocity. In addition, the state randomization unit 523, for example, adds noise to obstacles positioned in the traveling direction of the forklift or on the traveling route, and does not add noise to obstacles positioned outside the traveling direction or outside the traveling route. You may do so.
 本例示的実施形態によれば、強化学習システム5は、第1の状態sにノイズを付加した第2の状態を用いて第1の行動価値関数を算出することにより、フォークリフト制御をより好適に行うことができる。 According to this exemplary embodiment, the reinforcement learning system 5 calculates a first action-value function using a second state obtained by adding noise to the first state st , thereby making forklift control more suitable. can be done.
 〔例示的実施形態7〕
 本発明の例示的実施形態7について説明する。なお、例示的実施形態1~6にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を用いてその説明を繰り返さない。
[Exemplary Embodiment 7]
A seventh exemplary embodiment of the present invention will now be described. Components having the same functions as the components described in exemplary embodiments 1 to 6 are denoted by the same reference numerals, and the description thereof will not be repeated.
 本例示的実施形態に係る強化学習システム(以下「強化学習システム7」という)は、上述の例示的実施形態2において図5に示した強化学習システム2と同様の構成を有する。強化学習システム6の構成要素については、強化学習システム2の構成要素と同様であり、ここではその説明を繰り返さない。 The reinforcement learning system according to this exemplary embodiment (hereinafter referred to as "reinforcement learning system 7") has the same configuration as the reinforcement learning system 2 shown in FIG. 5 in the second exemplary embodiment. The components of reinforcement learning system 6 are the same as those of reinforcement learning system 2, and the description thereof will not be repeated here.
 本例示的実施形態において、第1の状態sは、属性が付随する複数の要素を含む。また、状態ランダム化部523は、第1の状態sにノイズを付加する際に、属性によりノイズの付加の重み付けを異ならせる。状態ランダム化部523は、一例として、環境内を移動する動的要素の重み付けを大きくする一方、環境内を移動しない静的要素の重み付けを小さくしてもよい。また、一例として、状態ランダム化部523は、環境内を移動する動的要素のうち、人の位置の重み付けを他の動的要素の重み付けよりも大きくしてもよい。 In this exemplary embodiment, the first state s t includes a plurality of elements accompanied by attributes. In addition, the state randomization unit 523, when adding noise to the first state st , gives different weightings for adding noise depending on attributes. As an example, the state randomization unit 523 may increase the weighting of dynamic elements that move within the environment and decrease the weighting of static elements that do not move within the environment. Also, as an example, the state randomization unit 523 may weight the position of a person among the dynamic elements that move within the environment more than weight the other dynamic elements.
 本例示的実施形態によれば、要素の属性に応じた重み付けでノイズを付加した第2の状態を用いて第1の行動価値関数を算出することにより、属性に応じたデータのばらつきを考慮した第1の行動価値関数を算出することができる。この第1の行動価値関数を用いて行動を選択することにより、属性に応じたデータのばらつきを考慮した行動の選択を行うことができる。 According to this exemplary embodiment, by calculating the first action value function using the second state to which noise is added with weighting according to the attribute of the element, variation in data according to the attribute is taken into account. A first action value function can be calculated. By selecting an action using this first action-value function, it is possible to select an action in consideration of variations in data according to attributes.
 また、本例示的実施形態において、状態ランダム化部523は、ノイズの付加の重み付けを強化学習の実行中に変更してもよい。一例として、状態ランダム化部523は、動的要素が環境中を移動している場合は重み付けを大きくする一方、環境中を移動していない動的要素については重み付けを小さくする、といった制御を行ってもよい。 In addition, in this exemplary embodiment, the state randomization unit 523 may change the weighting of noise addition during execution of reinforcement learning. As an example, the state randomization unit 523 performs control such that when a dynamic element is moving in the environment, the weight is increased, and when the dynamic element is not moving in the environment, the weight is decreased. may
 〔ソフトウェアによる実現例〕
 強化学習装置10、端末20、サーバ30、端末40、強化学習装置50(以下「強化学習装置10等」という)の一部又は全部の機能は、集積回路(ICチップ)等のハードウェアによって実現してもよいし、ソフトウェアによって実現してもよい。
[Example of realization by software]
Some or all of the functions of the reinforcement learning device 10, the terminal 20, the server 30, the terminal 40, and the reinforcement learning device 50 (hereinafter referred to as "reinforcement learning device 10, etc.") are realized by hardware such as integrated circuits (IC chips). may be implemented by software.
 後者の場合、強化学習装置10等は、例えば、各機能を実現するソフトウェアであるプログラムの命令を実行するコンピュータによって実現される。このようなコンピュータの一例(以下、コンピュータCと記載する)を図13に示す。コンピュータCは、少なくとも1つのプロセッサC1と、少なくとも1つのメモリC2と、を備えている。メモリC2には、コンピュータCを強化学習装置10等として動作させるためのプログラムPが記録されている。コンピュータCにおいて、プロセッサC1は、プログラムPをメモリC2から読み取って実行することにより、強化学習装置10等の各機能が実現される。 In the latter case, the reinforcement learning device 10 and the like are implemented by, for example, a computer that executes program instructions that are software that implements each function. An example of such a computer (hereinafter referred to as computer C) is shown in FIG. Computer C comprises at least one processor C1 and at least one memory C2. A program P for operating the computer C as the reinforcement learning device 10 or the like is recorded in the memory C2. In the computer C, the processor C1 reads the program P from the memory C2 and executes it, thereby realizing each function of the reinforcement learning device 10 and the like.
 プロセッサC1としては、例えば、CPU(Central Processing Unit)、GPU(Graphic Processing Unit)、DSP(Digital Signal Processor)、MPU(Micro Processing Unit)、FPU(Floating point number Processing Unit)、PPU(Physics Processing Unit)、マイクロコントローラ、又は、これらの組み合わせなどを用いることができる。メモリC2としては、例えば、フラッシュメモリ、HDD(Hard Disk Drive)、SSD(Solid State Drive)、又は、これらの組み合わせなどを用いることができる。 As the processor C1, for example, CPU (Central Processing Unit), GPU (Graphic Processing Unit), DSP (Digital Signal Processor), MPU (Micro Processing Unit), FPU (Floating point number Processing Unit), PPU (Physics Processing Unit) , a microcontroller, or a combination thereof. As the memory C2, for example, a flash memory, HDD (Hard Disk Drive), SSD (Solid State Drive), or a combination thereof can be used.
 なお、コンピュータCは、プログラムPを実行時に展開したり、各種データを一時的に記憶したりするためのRAM(Random Access Memory)を更に備えていてもよい。また、コンピュータCは、他の装置との間でデータを送受信するための通信インタフェースを更に備えていてもよい。また、コンピュータCは、キーボードやマウス、ディスプレイやプリンタなどの入出力機器を接続するための入出力インタフェースを更に備えていてもよい。 Note that the computer C may further include a RAM (Random Access Memory) for expanding the program P during execution and temporarily storing various data. Computer C may further include a communication interface for sending and receiving data to and from other devices. Computer C may further include an input/output interface for connecting input/output devices such as a keyboard, mouse, display, and printer.
 また、プログラムPは、コンピュータCが読み取り可能な、一時的でない有形の記録媒体Mに記録することができる。このような記録媒体Mとしては、例えば、テープ、ディスク、カード、半導体メモリ、又はプログラマブルな論理回路などを用いることができる。コンピュータCは、このような記録媒体Mを介してプログラムPを取得することができる。また、プログラムPは、伝送媒体を介して伝送することができる。このような伝送媒体としては、例えば、通信ネットワーク、又は放送波などを用いることができる。コンピュータCは、このような伝送媒体を介してプログラムPを取得することもできる。 In addition, the program P can be recorded on a non-temporary tangible recording medium M that is readable by the computer C. As such a recording medium M, for example, a tape, disk, card, semiconductor memory, programmable logic circuit, or the like can be used. The computer C can acquire the program P via such a recording medium M. Also, the program P can be transmitted via a transmission medium. As such a transmission medium, for example, a communication network or broadcast waves can be used. Computer C can also obtain program P via such a transmission medium.
 〔付記事項1〕
 本発明は、上述した実施形態に限定されるものでなく、請求項に示した範囲で種々の変更が可能である。例えば、上述した実施形態に開示された技術的手段を適宜組み合わせて得られる実施形態についても、本発明の技術的範囲に含まれる。
[Appendix 1]
The present invention is not limited to the above-described embodiments, and various modifications are possible within the scope of the claims. For example, embodiments obtained by appropriately combining the technical means disclosed in the embodiments described above are also included in the technical scope of the present invention.
 〔付記事項2〕
 上述した実施形態の一部又は全部は、以下のようにも記載され得る。ただし、本発明は、以下の記載する態様に限定されるものではない。
[Appendix 2]
Some or all of the above-described embodiments may also be described as follows. However, the present invention is not limited to the embodiments described below.
 (付記1)
 強化学習の対象である環境における第1の状態を取得する取得手段と、
 前記第1の状態にノイズを付加することによって第2の状態を生成する生成手段と、
 前記第2の状態に応じて、第1の行動価値関数を算出する算出手段と、
 前記第1の行動価値関数に応じて、行動を選択する選択手段と、
を備えることを特徴とする強化学習システム。
(Appendix 1)
Acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
generating means for generating a second state by adding noise to the first state;
calculating means for calculating a first action-value function according to the second state;
selection means for selecting an action according to the first action-value function;
A reinforcement learning system comprising:
 上記の構成によれば、第1の状態にノイズを付加した第2の状態を用いて第1の行動価値関数を算出することにより、より好適な行動を選択できる。 According to the above configuration, a more suitable action can be selected by calculating the first action value function using the second state obtained by adding noise to the first state.
 (付記2)
 前記算出手段は、前記第1の状態と前記第2の状態とに応じて、前記第1の行動価値関数を算出する、
付記1に記載の強化学習システム。
(Appendix 2)
The calculation means calculates the first action value function according to the first state and the second state,
The reinforcement learning system according to Appendix 1.
 上記の構成によれば、ノイズを付加した第2の状態を含む複数の状態を用いて第1の行動価値関数を算出することにより、より好適な行動を選択できる。 According to the above configuration, a more suitable action can be selected by calculating the first action value function using a plurality of states including the second state to which noise is added.
 (付記3)
 前記算出手段は、前記第1の状態及び前記第2の状態のそれぞれについて、前記第1の行動価値関数を算出し、
 前記選択手段は、複数の前記第1の行動価値関数に基づいて算出される第2の行動価値関数に応じて、前記行動を選択する、
付記2に記載の強化学習システム。
(Appendix 3)
The calculation means calculates the first action-value function for each of the first state and the second state,
The selection means selects the action according to a second action-value function calculated based on a plurality of the first action-value functions.
The reinforcement learning system according to appendix 2.
 上記の構成によれば、複数の第1の行動価値関数を用いて算出される第2の行動価値関数を用いることにより、より好適な行動を選択できる。 According to the above configuration, a more suitable action can be selected by using the second action-value function calculated using the plurality of first action-value functions.
 (付記4)
 前記第1の状態は、搬送物を搬送する搬送装置の位置、移動方向、速度、及び角速度、通路の位置、並びに静的又は動的な障害物の位置及び速度、のうちの少なくとも何れか1つを含む、
付記1から3の何れか1つに記載の強化学習システム。
(Appendix 4)
The first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. including one
The reinforcement learning system according to any one of Appendices 1 to 3.
 上記の構成によれば、強化学習による搬送装置の搬送動作の選択をより好適に行うことができる。 According to the above configuration, it is possible to more preferably select the transport operation of the transport device by reinforcement learning.
 (付記5)
 前記第1の状態は、建設機械の姿勢及び位置、掘削対象である土砂の形状、並びに掘削機のバケット内の土砂量、のうちの少なくとも何れか1つを含む、
付記1から3のいずれか1つに記載の強化学習システム。
(Appendix 5)
The first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
The reinforcement learning system according to any one of Appendices 1 to 3.
 上記の構成によれば、強化学習による建設機械の建設動作の選択をより好適に行うことができる。 According to the above configuration, it is possible to more preferably select the construction motion of the construction machine by reinforcement learning.
 (付記6)
 前記第1の状態は、属性が付随する複数の要素を含み、
 前記生成手段は、前記属性に応じ、前記第1の状態に含まれる複数の要素に、選択的にノイズを付加することによって前記第2の状態を生成する、
付記1から5の何れか1つに記載の強化学習システム。
(Appendix 6)
the first state includes a plurality of elements accompanied by attributes;
The generation means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
6. The reinforcement learning system according to any one of Appendices 1 to 5.
 上記の構成によれば、所定の条件を満たす属性に付随した要素についてのデータのばらつきを考慮した第1の行動価値関数を算出できる。 According to the above configuration, it is possible to calculate the first action value function that takes into account the variation in the data of the elements associated with the attribute that satisfies the predetermined condition.
 (付記7)
 前記第1の状態は、環境内を移動する動的要素に関する状態を含み、
 前記生成手段は、前記第1の状態に含まれる前記動的要素の状態にノイズを付加することによって前記第2の状態を生成する、
付記6に記載の強化学習システム。
(Appendix 7)
the first state includes a state related to a dynamic element moving within the environment;
The generating means generates the second state by adding noise to states of the dynamic elements included in the first state.
The reinforcement learning system according to appendix 6.
 上記の構成によれば、動的要素についてのデータのばらつきを考慮した第1の行動価値関数を算出できる。 According to the above configuration, it is possible to calculate the first action value function that takes into account the variation in the data of the dynamic elements.
 (付記8)
 強化学習の対象である環境における第1の状態を取得する取得手段と、
 前記第1の状態にノイズを付加することによって第2の状態を生成する生成手段と、
 前記第2の状態に応じて、第1の行動価値関数を算出する算出手段と、
 前記第1の行動価値関数に応じて、行動を選択する選択手段と、
を備えることを特徴とする強化学習装置。
(Appendix 8)
Acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
generating means for generating a second state by adding noise to the first state;
calculating means for calculating a first action-value function according to the second state;
selection means for selecting an action according to the first action-value function;
A reinforcement learning device comprising:
 上記の構成によれば、第1の状態にノイズを付加した第2の状態を用いて第1の行動価値関数を算出することにより、より好適な行動を選択できる。 According to the above configuration, a more suitable action can be selected by calculating the first action value function using the second state obtained by adding noise to the first state.
 (付記9)
 前記算出手段は、前記第1の状態と前記第2の状態とに応じて、前記第1の行動価値関数を算出する、
付記8に記載の強化学習装置。
(Appendix 9)
The calculation means calculates the first action value function according to the first state and the second state,
The reinforcement learning device according to appendix 8.
 上記の構成によれば、ノイズを付加した第2の状態を含む複数の状態を用いて第1の行動価値関数を算出することにより、より好適な行動を選択できる。 According to the above configuration, a more suitable action can be selected by calculating the first action value function using a plurality of states including the second state to which noise is added.
 (付記10)
 前記算出手段は、前記状態列に含まれる複数の状態のそれぞれについて、前記第1の行動価値関数を算出し、
 前記選択手段は、複数の前記第1の行動価値関数に基づいて算出される第2の行動価値関数に応じて、前記行動を選択する、
付記9に記載の強化学習装置。
(Appendix 10)
The calculation means calculates the first action value function for each of the plurality of states included in the state sequence,
The selection means selects the action according to a second action-value function calculated based on a plurality of the first action-value functions.
The reinforcement learning device according to appendix 9.
 上記の構成によれば、複数の第1の行動価値関数を用いて算出される第2の行動価値関数を用いることにより、より好適な行動を選択できる。 According to the above configuration, a more suitable action can be selected by using the second action-value function calculated using the plurality of first action-value functions.
 (付記11)
 前記第1の状態は、搬送物を搬送する搬送装置の位置、移動方向、速度、及び角速度、通路の位置、並びに静的又は動的な障害物の位置及び速度、のうちの少なくとも何れか1つを含む、
付記8から10の何れか1つに記載の強化学習装置。
(Appendix 11)
The first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. including one
The reinforcement learning device according to any one of Appendices 8 to 10.
 上記の構成によれば、強化学習による搬送装置の搬送動作の選択をより好適に行うことができる。 According to the above configuration, it is possible to more preferably select the transport operation of the transport device by reinforcement learning.
 (付記12)
 前記第1の状態は、建設機械の姿勢及び位置、掘削対象である土砂の形状、並びに掘削機のバケット内の土砂量、のうちの少なくとも何れか1つを含む、
付記8から10のいずれか1つに記載の強化学習装置。
(Appendix 12)
The first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
The reinforcement learning device according to any one of Appendices 8 to 10.
 上記の構成によれば、強化学習による建設機械の建設動作の選択をより好適に行うことができる。 According to the above configuration, it is possible to more preferably select the construction motion of the construction machine by reinforcement learning.
 (付記13)
 前記第1の状態は、属性が付随する複数の要素を含み、
 前記生成手段は、前記属性に応じ、前記第1の状態に含まれる複数の要素に、選択的にノイズを付加することによって前記第2の状態を生成する、
付記8から12の何れか1つに記載の強化学習装置。
(Appendix 13)
the first state includes a plurality of elements accompanied by attributes;
The generation means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
13. The reinforcement learning device according to any one of Appendices 8 to 12.
 上記の構成によれば、所定の条件を満たす属性に付随した要素についてのデータのばらつきを考慮した第1の行動価値関数を算出できる。 According to the above configuration, it is possible to calculate the first action value function that takes into account the variation in the data of the elements associated with the attribute that satisfies the predetermined condition.
 (付記14)
 前記第1の状態は、環境内を移動する動的要素に関する状態を含み、
 前記生成手段は、前記第1の状態に含まれる前記動的要素の状態にノイズを付加することによって前記第2の状態を生成する、
付記13に記載の強化学習装置。
(Appendix 14)
the first state includes a state related to a dynamic element moving within the environment;
The generating means generates the second state by adding noise to states of the dynamic elements included in the first state.
The reinforcement learning device according to appendix 13.
 上記の構成によれば、動的要素についてのデータのばらつきを考慮した第1の行動価値関数を算出できる。 According to the above configuration, it is possible to calculate the first action value function that takes into account the variation in the data of the dynamic elements.
 (付記15)
 強化学習の対象である環境における第1の状態を取得すること、
 前記第1の状態にノイズを付加することによって第2の状態を生成すること、
 前記第2の状態に応じて、第1の行動価値関数を算出すること、
 前記第1の行動価値関数に応じて、行動を選択すること、
を含む強化学習方法。
(Appendix 15)
Obtaining a first state in an environment that is subject to reinforcement learning;
generating a second state by adding noise to the first state;
calculating a first action-value function according to the second state;
selecting an action according to the first action-value function;
Reinforcement learning methods including.
 上記の構成によれば、第1の状態にノイズを付加した第2の状態を用いて第1の行動価値関数を算出することにより、より好適な行動を選択できる。 According to the above configuration, a more suitable action can be selected by calculating the first action value function using the second state obtained by adding noise to the first state.
 (付記16)
 前記第1の行動価値関数を算出することにおいて、
  前記第1の状態と前記第2の状態とに応じて、前記第1の行動価値関数を算出する、
付記15に記載の強化学習方法。
(Appendix 16)
In calculating the first action value function,
calculating the first action value function according to the first state and the second state;
The reinforcement learning method according to appendix 15.
 上記の構成によれば、ノイズを付加した第2の状態を含む複数の状態を用いて第1の行動価値関数を算出することにより、より好適な行動を選択できる。 According to the above configuration, a more suitable action can be selected by calculating the first action value function using a plurality of states including the second state to which noise is added.
 (付記17)
 前記第1の行動価値関数を算出することにおいて、前記状態列に含まれる複数の状態のそれぞれについて、前記第1の行動価値関数を算出し、
 前記行動を選択することにおいて、複数の前記第1の行動価値関数に基づいて算出される第2の行動価値関数に応じて、前記行動を選択する、
付記16に記載の強化学習方法。
(Appendix 17)
In calculating the first action-value function, calculating the first action-value function for each of a plurality of states included in the state sequence;
selecting the action according to a second action-value function calculated based on a plurality of the first action-value functions;
The reinforcement learning method according to appendix 16.
 上記の構成によれば、複数の第1の行動価値関数を用いて算出される第2の行動価値関数を用いることにより、より好適な行動を選択できる。 According to the above configuration, a more suitable action can be selected by using the second action-value function calculated using the plurality of first action-value functions.
 (付記18)
 前記第1の状態は、搬送物を搬送する搬送装置の位置、移動方向、速度、及び角速度、通路の位置、並びに静的又は動的な障害物の位置及び速度、のうちの少なくとも何れか1つを含む、
付記15から17の何れか1つに記載の強化学習方法。
(Appendix 18)
The first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. including one
18. The reinforcement learning method according to any one of Appendices 15 to 17.
 上記の構成によれば、強化学習による搬送装置の搬送動作の選択をより好適に行うことができる。 According to the above configuration, it is possible to more preferably select the transport operation of the transport device by reinforcement learning.
 (付記19)
 前記第1の状態は、建設機械の姿勢及び位置、掘削対象である土砂の形状、並びに掘削機のバケット内の土砂量、のうちの少なくとも何れか1つを含む、
付記15から17のいずれか1つに記載の強化学習方法。
(Appendix 19)
The first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
18. The reinforcement learning method according to any one of Appendices 15-17.
 上記の構成によれば、強化学習による建設機械の建設動作の選択をより好適に行うことができる。 According to the above configuration, it is possible to more preferably select the construction motion of the construction machine by reinforcement learning.
 (付記20)
 前記第1の状態は、属性が付随する複数の要素を含み、
 前記第2の状態を生成することにおいて、前記属性に応じ、前記第1の状態に含まれる複数の要素に、選択的にノイズを付加することによって前記第2の状態を生成する、
付記15から19の何れか1つに記載の強化学習方法。
(Appendix 20)
the first state includes a plurality of elements accompanied by attributes;
In generating the second state, generating the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute;
20. The reinforcement learning method according to any one of Appendices 15 to 19.
 上記の構成によれば、所定の条件を満たす属性に付随した要素についてのデータのばらつきを考慮した第1の行動価値関数を算出できる。 According to the above configuration, it is possible to calculate the first action value function that takes into account the variation in the data of the elements associated with the attribute that satisfies the predetermined condition.
 (付記21)
 前記第1の状態は、環境内を移動する動的要素に関する状態を含み、
 前記第2の状態を生成することにおいて、前記第1の状態に含まれる前記動的要素の状態にノイズを付加することによって前記第2の状態を生成する、
付記20に記載の強化学習方法。
(Appendix 21)
the first state includes a state related to a dynamic element moving within the environment;
generating the second state by adding noise to states of the dynamic element included in the first state in generating the second state;
The reinforcement learning method according to appendix 20.
 上記の構成によれば、動的要素についてのデータのばらつきを考慮した第1の行動価値関数を算出できる。 According to the above configuration, it is possible to calculate the first action value function that takes into account the variation in the data of the dynamic elements.
 (付記22)
 前記第1の状態は、属性が付随する複数の要素を含み、
 前記生成手段は、前記属性により前記ノイズの付加の重み付けを異ならせる、
付記1から5の何れか1つに記載の強化学習システム。
(Appendix 22)
the first state includes a plurality of elements accompanied by attributes;
The generating means weights the addition of the noise differently depending on the attribute.
6. The reinforcement learning system according to any one of Appendices 1 to 5.
 上記の構成によれば、要素の属性に応じた重み付けでノイズを付加した第2の状態を用いることで、属性に応じたデータのばらつきを考慮した第1の行動価値関数を算出できる。 According to the above configuration, by using the second state in which noise is added with weighting according to the attributes of the elements, it is possible to calculate the first action value function that takes into account the variation in the data according to the attributes.
 (付記23)
 前記第1の状態は、建設機械の姿勢及び位置、掘削対象である土砂の形状、並びに掘削機のバケット内の土砂量、の一部又は全部を含み、
 前記行動は、前記建設機械の姿勢制御を含む、
付記1から6、及び付記19の何れか1つに記載の強化学習システム。
(Appendix 23)
The first state includes part or all of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
the action includes attitude control of the construction machine;
19. The reinforcement learning system according to any one of Appendices 1 to 6 and 19.
 上記の構成によれば、第1の状態にノイズを付加した第2の状態を用いて第1の行動価値関数を算出することにより、強化学習による掘削機の掘削動作の選択をより好適に行うことができる。 According to the above configuration, by calculating the first action value function using the second state obtained by adding noise to the first state, the selection of the excavating motion of the excavator is more preferably performed by reinforcement learning. be able to.
 (付記24)
 前記第1の状態は、搬送物を搬送する搬送装置の位置、移動方向、速度、及び角速度、通路の位置、並びに静的又は動的な障害物の位置及び速度、の一部又は全部を含み、
 前記行動は、前記搬送装置の速度制御及び角速度制御を含む、
付記1から6、及び付記19の何れか1つに記載の強化学習システム。
(Appendix 24)
The first state includes part or all of the position, direction of movement, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. ,
the action includes velocity control and angular velocity control of the conveying device;
19. The reinforcement learning system according to any one of Appendices 1 to 6 and 19.
 上記の構成によれば、第1の状態にノイズを付加した第2の状態を用いて第1の行動価値関数を算出することにより、強化学習による搬送装置の搬送動作の選択をより好適に行うことができる。 According to the above configuration, by calculating the first action value function using the second state obtained by adding noise to the first state, the selection of the conveying operation of the conveying device by reinforcement learning is performed more preferably. be able to.
 (付記25)
 前記第1の状態は、コンピュータゲームにおいてゲームの進行に影響を与えるオブジェクトの状態を含み、
 前記行動は、前記コンピュータゲームのプレイヤにより操作されるオブジェクトの動作を含む、
付記1から6、及び付記19の何れか1つに記載の強化学習システム。
(Appendix 25)
the first state includes a state of an object that affects the progress of the game in the computer game;
the action includes an action of an object operated by a player of the computer game;
19. The reinforcement learning system according to any one of Appendices 1 to 6 and 19.
 上記の構成によれば、第1の状態にノイズを付加した第2の状態を用いて第1の行動価値関数を算出することにより、コンピュータゲームの自律プレイにおけるオブジェクトの動作の選択をより好適に行うことができる。 According to the above configuration, by calculating the first action value function using the second state obtained by adding noise to the first state, it is possible to more preferably select the action of the object in the autonomous play of the computer game. It can be carried out.
 (付記26)
 コンピュータを強化学習装置として機能させるプログラムであって、
 前記プログラムは、前記コンピュータを、
 強化学習の対象である環境における第1の状態を取得する取得手段と、
 前記第1の状態にノイズを付加することによって第2の状態を生成する生成手段と、
 前記第2の状態に応じて、第1の行動価値関数を算出する算出手段と、
 前記第1の行動価値関数に応じて、行動を選択する選択手段と、
として機能させることを特徴とするプログラム。
(Appendix 26)
A program that causes a computer to function as a reinforcement learning device,
The program causes the computer to:
Acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
generating means for generating a second state by adding noise to the first state;
calculating means for calculating a first action-value function according to the second state;
selection means for selecting an action according to the first action-value function;
A program characterized by functioning as
 (付記27)
 前記算出手段は、前記第1の状態と前記第2の状態とに応じて、前記第1の行動価値関数を算出する、
ことを特徴とする付記26に記載のプログラム。
(Appendix 27)
The calculation means calculates the first action value function according to the first state and the second state,
27. The program according to appendix 26, characterized by:
 (付記28)
 前記算出手段は、前記第1の状態及び前記第2の状態のそれぞれについて、前記第1の行動価値関数を算出し、
 前記選択手段は、複数の前記第1の行動価値関数に基づいて算出される第2の行動価値関数に応じて、前記行動を選択する、
ことを特徴とする付記27に記載のプログラム。
(Appendix 28)
The calculation means calculates the first action-value function for each of the first state and the second state,
The selection means selects the action according to a second action-value function calculated based on a plurality of the first action-value functions.
The program according to appendix 27, characterized by:
 (付記29)
 前記第1の状態は、搬送物を搬送する搬送装置の位置、移動方向、速度、及び角速度、通路の位置、並びに静的又は動的な障害物の位置及び速度、のうちの少なくとも何れか1つを含む、
付記26から28の何れか1つに記載のプログラム。
(Appendix 29)
The first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. including one
29. A program according to any one of appendices 26-28.
 (付記30)
 前記第1の状態は、建設機械の姿勢及び位置、掘削対象である土砂の形状、並びに掘削機のバケット内の土砂量、のうちの少なくとも何れか1つを含む、
付記26から28のいずれか1つに記載のプログラム。
(Appendix 30)
The first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
29. A program according to any one of appendices 26-28.
 (付記31)
 前記第1の状態は、属性が付随する複数の要素を含み、
 前記生成手段は、前記属性に応じ、前記第1の状態に含まれる複数の要素に、選択的にノイズを付加することによって前記第2の状態を生成する、
ことを特徴とする付記26から30の何れか1つに記載のプログラム。
(Appendix 31)
the first state includes a plurality of elements accompanied by attributes;
The generating means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
31. The program according to any one of appendices 26 to 30, characterized in that:
 (付記32)
 前記第1の状態は、環境内を移動する動的要素に関する状態を含み、
 前記生成手段は、前記第1の状態に含まれる前記動的要素の状態にノイズを付加することによって前記第2の状態を生成する、
ことを特徴とする付記27に記載のプログラム。
(Appendix 32)
the first state includes a state related to a dynamic element moving within the environment;
The generating means generates the second state by adding noise to states of the dynamic elements included in the first state.
The program according to appendix 27, characterized by:
 〔付記事項3〕
 上述した実施形態の一部又は全部は、更に、以下のように表現することもできる。
[Appendix 3]
Some or all of the embodiments described above can also be expressed as follows.
 少なくとも1つのプロセッサを備え、前記プロセッサは、
 強化学習の対象である環境における第1の状態を取得する取得処理と、
 前記第1の状態にノイズを付加することによって第2の状態を生成する生成処理と、
 前記第2の状態に応じて、第1の行動価値関数を算出する算出処理と、
 前記第1の行動価値関数に応じて、行動を選択する選択処理と、
を実行する強化学習装置。
at least one processor, said processor comprising:
Acquisition processing for acquiring a first state in an environment that is a target of reinforcement learning;
a generating process for generating a second state by adding noise to the first state;
a calculation process for calculating a first action-value function according to the second state;
a selection process for selecting an action according to the first action-value function;
Reinforcement learning device that executes
 なお、この強化学習装置は、更にメモリを備えていてもよく、このメモリには、前記取得処理と、前記生成処理と、前記算出処理と、前記選択処理とを前記プロセッサに実行させるためのプログラムが記憶されていてもよい。また、このプログラムは、コンピュータ読み取り可能な一時的でない有形の記録媒体に記録されていてもよい。 The reinforcement learning device may further include a memory, and the memory stores a program for causing the processor to execute the acquisition process, the generation process, the calculation process, and the selection process. may be stored. Also, this program may be recorded in a computer-readable non-temporary tangible recording medium.
1、2、3、4、5、6、7 強化学習システム
10、50 強化学習装置
11 取得部
12 生成部
13 算出部
14、526 選択部
20、40 端末
30 サーバ
41、51 通信部
42、52 制御部
43 入力受付部
53 記憶部
421 状態提供部
422 行動実行部
423 報酬提供部
521 報酬取得部
522 状態観測部
523 状態ランダム化部
524 学習部
525 推定部

 
1, 2, 3, 4, 5, 6, 7 Reinforcement learning system 10, 50 Reinforcement learning device 11 Acquisition unit 12 Generation unit 13 Calculation unit 14, 526 Selection unit 20, 40 Terminal 30 Server 41, 51 Communication unit 42, 52 Control unit 43 Input reception unit 53 Storage unit 421 State provision unit 422 Action execution unit 423 Reward provision unit 521 Reward acquisition unit 522 State observation unit 523 State randomization unit 524 Learning unit 525 Estimation unit

Claims (21)

  1.  強化学習の対象である環境における第1の状態を取得する取得手段と、
     前記第1の状態にノイズを付加することによって第2の状態を生成する生成手段と、
     前記第2の状態に応じて、第1の行動価値関数を算出する算出手段と、
     前記第1の行動価値関数に応じて、行動を選択する選択手段と、
    を備えることを特徴とする強化学習システム。
    Acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
    generating means for generating a second state by adding noise to the first state;
    calculating means for calculating a first action-value function according to the second state;
    selection means for selecting an action according to the first action-value function;
    A reinforcement learning system comprising:
  2.  前記算出手段は、前記第1の状態と前記第2の状態とに応じて、前記第1の行動価値関数を算出する、
    請求項1に記載の強化学習システム。
    The calculation means calculates the first action value function according to the first state and the second state,
    The reinforcement learning system according to claim 1.
  3.  前記算出手段は、前記第1の状態及び前記第2の状態のそれぞれについて、前記第1の行動価値関数を算出し、
     前記選択手段は、複数の前記第1の行動価値関数に基づいて算出される第2の行動価値関数に応じて、前記行動を選択する、
    請求項2に記載の強化学習システム。
    The calculation means calculates the first action-value function for each of the first state and the second state,
    The selection means selects the action according to a second action-value function calculated based on a plurality of the first action-value functions.
    The reinforcement learning system according to claim 2.
  4.  前記第1の状態は、搬送物を搬送する搬送装置の位置、移動方向、速度、及び角速度、通路の位置、並びに静的又は動的な障害物の位置及び速度、のうちの少なくとも何れか1つを含む、
    請求項1から3の何れか1項に記載の強化学習システム。
    The first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. including one
    The reinforcement learning system according to any one of claims 1 to 3.
  5.  前記第1の状態は、建設機械の姿勢及び位置、掘削対象である土砂の形状、並びに掘削機のバケット内の土砂量、のうちの少なくとも何れか1つを含む、
    請求項1から3のいずれか1項に記載の強化学習システム。
    The first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
    The reinforcement learning system according to any one of claims 1 to 3.
  6.  前記第1の状態は、属性が付随する複数の要素を含み、
     前記生成手段は、前記属性に応じ、前記第1の状態に含まれる複数の要素に、選択的にノイズを付加することによって前記第2の状態を生成する、
    請求項1から5の何れか1項に記載の強化学習システム。
    the first state includes a plurality of elements accompanied by attributes;
    The generating means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
    The reinforcement learning system according to any one of claims 1 to 5.
  7.  前記第1の状態は、環境内を移動する動的要素に関する状態を含み、
     前記生成手段は、前記第1の状態に含まれる前記動的要素の状態にノイズを付加することによって前記第2の状態を生成する、
    請求項6に記載の強化学習システム。
    the first state includes a state related to a dynamic element moving within the environment;
    The generating means generates the second state by adding noise to states of the dynamic elements included in the first state.
    The reinforcement learning system according to claim 6.
  8.  強化学習の対象である環境における第1の状態を取得する取得手段と、
     前記第1の状態にノイズを付加することによって第2の状態を生成する生成手段と、
     前記第2の状態に応じて、第1の行動価値関数を算出する算出手段と、
     前記第1の行動価値関数に応じて、行動を選択する選択手段と、
    を備えることを特徴とする強化学習装置。
    Acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
    generating means for generating a second state by adding noise to the first state;
    calculating means for calculating a first action-value function according to the second state;
    selection means for selecting an action according to the first action-value function;
    A reinforcement learning device comprising:
  9.  前記算出手段は、前記第1の状態と前記第2の状態とに応じて、前記第1の行動価値関数を算出する、
    請求項8に記載の強化学習装置。
    The calculation means calculates the first action value function according to the first state and the second state,
    The reinforcement learning device according to claim 8.
  10.  前記算出手段は、前記第1の状態及び前記第2の状態のそれぞれについて、前記第1の行動価値関数を算出し、
     前記選択手段は、複数の前記第1の行動価値関数に基づいて算出される第2の行動価値関数に応じて、前記行動を選択する、
    請求項9に記載の強化学習装置。
    The calculation means calculates the first action-value function for each of the first state and the second state,
    The selection means selects the action according to a second action-value function calculated based on a plurality of the first action-value functions.
    The reinforcement learning device according to claim 9.
  11.  前記第1の状態は、搬送物を搬送する搬送装置の位置、移動方向、速度、及び角速度、通路の位置、並びに静的又は動的な障害物の位置及び速度、のうちの少なくとも何れか1つを含む、
    請求項8から10の何れか1項に記載の強化学習装置。
    The first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. including one
    The reinforcement learning device according to any one of claims 8 to 10.
  12.  前記第1の状態は、建設機械の姿勢及び位置、掘削対象である土砂の形状、並びに掘削機のバケット内の土砂量、のうちの少なくとも何れか1つを含む、
    請求項8から10のいずれか1項に記載の強化学習装置。
    The first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
    The reinforcement learning device according to any one of claims 8 to 10.
  13.  前記第1の状態は、属性が付随する複数の要素を含み、
     前記生成手段は、前記属性に応じ、前記第1の状態に含まれる複数の要素に、選択的にノイズを付加することによって前記第2の状態を生成する、
    請求項8から12の何れか1項に記載の強化学習装置。
    the first state includes a plurality of elements accompanied by attributes;
    The generation means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
    The reinforcement learning device according to any one of claims 8 to 12.
  14.  前記第1の状態は、環境内を移動する動的要素に関する状態を含み、
     前記生成手段は、前記第1の状態に含まれる前記動的要素の状態にノイズを付加することによって前記第2の状態を生成する、
    請求項13に記載の強化学習装置。
    the first state includes a state related to a dynamic element moving within the environment;
    The generating means generates the second state by adding noise to states of the dynamic elements included in the first state.
    The reinforcement learning device according to claim 13.
  15.  強化学習の対象である環境における第1の状態を取得すること、
     前記第1の状態にノイズを付加することによって第2の状態を生成すること、
     前記第2の状態に応じて、第1の行動価値関数を算出すること、
     前記第1の行動価値関数に応じて、行動を選択すること、
    を含む強化学習方法。
    Obtaining a first state in an environment that is subject to reinforcement learning;
    generating a second state by adding noise to the first state;
    calculating a first action-value function according to the second state;
    selecting an action according to the first action-value function;
    Reinforcement learning methods including.
  16.  前記第1の行動価値関数を算出することにおいて、
      前記第1の状態と前記第2の状態とに応じて、前記第1の行動価値関数を算出する、
    請求項15に記載の強化学習方法。
    In calculating the first action value function,
    calculating the first action value function according to the first state and the second state;
    The reinforcement learning method according to claim 15.
  17.  前記第1の行動価値関数を算出することにおいて、第1の状態及び前記第2の状態のそれぞれについて、前記第1の行動価値関数を算出し、
     前記行動を選択することにおいて、複数の前記第1の行動価値関数に基づいて算出される第2の行動価値関数に応じて、前記行動を選択する、
    請求項16に記載の強化学習方法。
    calculating the first action-value function for each of the first state and the second state in calculating the first action-value function;
    selecting the action according to a second action-value function calculated based on a plurality of the first action-value functions;
    The reinforcement learning method according to claim 16.
  18.  前記第1の状態は、搬送物を搬送する搬送装置の位置、移動方向、速度、及び角速度、通路の位置、並びに静的又は動的な障害物の位置及び速度、のうちの少なくとも何れか1つを含む、
    請求項15から17の何れか1項に記載の強化学習方法。
    The first state is at least one of the position, moving direction, speed, and angular speed of a conveying device that conveys a conveyed object, the position of a path, and the position and speed of static or dynamic obstacles. including one
    The reinforcement learning method according to any one of claims 15-17.
  19.  前記第1の状態は、建設機械の姿勢及び位置、掘削対象である土砂の形状、並びに掘削機のバケット内の土砂量、のうちの少なくとも何れか1つを含む、
    請求項15から17のいずれか1項に記載の強化学習方法。
    The first state includes at least one of the posture and position of the construction machine, the shape of the earth and sand to be excavated, and the amount of earth and sand in the bucket of the excavator,
    The reinforcement learning method according to any one of claims 15-17.
  20.  前記第1の状態は、属性が付随する複数の要素を含み、
     前記第2の状態を生成することにおいて、前記属性に応じ、前記第1の状態に含まれる複数の要素に、選択的にノイズを付加することによって前記第2の状態を生成する、
    請求項15から19の何れか1項に記載の強化学習方法。
    the first state includes a plurality of elements accompanied by attributes;
    In generating the second state, generating the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute;
    The reinforcement learning method according to any one of claims 15-19.
  21.  前記第1の状態は、環境内を移動する動的要素に関する状態を含み、
     前記第2の状態を生成することにおいて、前記第1の状態に含まれる前記動的要素の状態にノイズを付加することによって前記第2の状態を生成する、
    請求項20に記載の強化学習方法。
    the first state includes a state related to a dynamic element moving within the environment;
    generating the second state by adding noise to states of the dynamic element included in the first state in generating the second state;
    The reinforcement learning method according to claim 20.
PCT/JP2021/033360 2021-09-10 2021-09-10 Reinforced learning system, reinforced learning device, and reinforced learning method WO2023037504A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2021/033360 WO2023037504A1 (en) 2021-09-10 2021-09-10 Reinforced learning system, reinforced learning device, and reinforced learning method
JP2023546676A JPWO2023037504A5 (en) 2021-09-10 Reinforcement learning system, reinforcement learning device, reinforcement learning method and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/033360 WO2023037504A1 (en) 2021-09-10 2021-09-10 Reinforced learning system, reinforced learning device, and reinforced learning method

Publications (1)

Publication Number Publication Date
WO2023037504A1 true WO2023037504A1 (en) 2023-03-16

Family

ID=85506183

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/033360 WO2023037504A1 (en) 2021-09-10 2021-09-10 Reinforced learning system, reinforced learning device, and reinforced learning method

Country Status (1)

Country Link
WO (1) WO2023037504A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019143385A (en) * 2018-02-21 2019-08-29 清水建設株式会社 Estimation device and estimation method
JP2020052513A (en) * 2018-09-25 2020-04-02 本田技研工業株式会社 Model parameter learning device, control device and model parameter learning method
US20200159878A1 (en) * 2018-11-16 2020-05-21 Starkey Laboratories, Inc. Ear-wearable device shell modeling
JP2020091757A (en) * 2018-12-06 2020-06-11 富士通株式会社 Reinforcement learning program, reinforcement learning method, and reinforcement learning device
JP2020091611A (en) * 2018-12-04 2020-06-11 富士通株式会社 Action determination program, action determination method, and action determination device
JP2020177416A (en) * 2019-04-17 2020-10-29 株式会社日立製作所 Machine automatic operation control method and system
JP2021077326A (en) * 2019-11-07 2021-05-20 ネイバー コーポレーションNAVER Corporation Training system and method for visual navigation, and navigation robot

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019143385A (en) * 2018-02-21 2019-08-29 清水建設株式会社 Estimation device and estimation method
JP2020052513A (en) * 2018-09-25 2020-04-02 本田技研工業株式会社 Model parameter learning device, control device and model parameter learning method
US20200159878A1 (en) * 2018-11-16 2020-05-21 Starkey Laboratories, Inc. Ear-wearable device shell modeling
JP2020091611A (en) * 2018-12-04 2020-06-11 富士通株式会社 Action determination program, action determination method, and action determination device
JP2020091757A (en) * 2018-12-06 2020-06-11 富士通株式会社 Reinforcement learning program, reinforcement learning method, and reinforcement learning device
JP2020177416A (en) * 2019-04-17 2020-10-29 株式会社日立製作所 Machine automatic operation control method and system
JP2021077326A (en) * 2019-11-07 2021-05-20 ネイバー コーポレーションNAVER Corporation Training system and method for visual navigation, and navigation robot

Also Published As

Publication number Publication date
JPWO2023037504A1 (en) 2023-03-16

Similar Documents

Publication Publication Date Title
KR102296507B1 (en) Method for tracking object by using convolutional neural network including tracking network and computing device using the same
CN110462628B (en) Method and system for estimating operation of working vehicle, method for manufacturing classification model, learning data, and method for manufacturing learning data
CN110779656B (en) Machine stability detection and control system
JP5172326B2 (en) System and method for adaptive path planning
US9361590B2 (en) Information processing apparatus, information processing method, and program
CN111971691A (en) Graph neural network representing a physical system
US11409287B2 (en) Neural task planner for autonomous vehicles
CN114859911B (en) Four-foot robot path planning method based on DRL
US11068787B2 (en) Training neural networks using evolution based strategies and novelty search
KR20220154785A (en) Learning options for action selection using meta-gradients in multi-task reinforcement learning
JP7297842B2 (en) Methods and systems that use trained models based on parameters indicative of risk measures to determine device behavior for given situations
CN115848365B (en) Vehicle controller, vehicle and vehicle control method
WO2019129355A1 (en) Method for predicting a motion of an object, method for calibrating a motion model, method for deriving a predefined quantity and method for generating a virtual reality view
Klein Data-driven meets navigation: Concepts, models, and experimental validation
Hasan et al. Automatic estimation of inertial navigation system errors for global positioning system outage recovery
CN116645396A (en) Track determination method, track determination device, computer-readable storage medium and electronic device
WO2023037504A1 (en) Reinforced learning system, reinforced learning device, and reinforced learning method
US20240019250A1 (en) Motion estimation apparatus, motion estimation method, path generation apparatus, path generation method, and computer-readable recording medium
Jaafra et al. Robust reinforcement learning for autonomous driving
Langaa et al. Expert initialized reinforcement learning with application to robotic assembly
EP4288905A1 (en) Neural network reinforcement learning with diverse policies
JP6640615B2 (en) Orbit calculation device and orbit calculation program
KR102261055B1 (en) Method and system for optimizing design parameter of image to maximize click through rate
Jaafra et al. Seeking for robustness in reinforcement learning: application on Carla simulator
JP3960286B2 (en) Adaptive controller, adaptive control method, and adaptive control program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21956801

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023546676

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21956801

Country of ref document: EP

Kind code of ref document: A1