WO2021186500A1 - 学習装置、学習方法、及び、記録媒体 - Google Patents
学習装置、学習方法、及び、記録媒体 Download PDFInfo
- Publication number
- WO2021186500A1 WO2021186500A1 PCT/JP2020/011465 JP2020011465W WO2021186500A1 WO 2021186500 A1 WO2021186500 A1 WO 2021186500A1 JP 2020011465 W JP2020011465 W JP 2020011465W WO 2021186500 A1 WO2021186500 A1 WO 2021186500A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- learning
- policy
- difficulty level
- target system
- control
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 99
- 238000004364 calculation method Methods 0.000 claims abstract description 71
- 238000011156 evaluation Methods 0.000 claims abstract description 49
- 230000007704 transition Effects 0.000 claims abstract description 29
- 230000008569 process Effects 0.000 claims description 61
- 230000002708 enhancing effect Effects 0.000 abstract 1
- 239000003795 chemical substances by application Substances 0.000 description 66
- 238000006243 chemical reaction Methods 0.000 description 65
- 230000001186 cumulative effect Effects 0.000 description 42
- 230000002787 reinforcement Effects 0.000 description 40
- 230000009471 action Effects 0.000 description 38
- 230000006870 function Effects 0.000 description 25
- 238000012545 processing Methods 0.000 description 22
- 239000013598 vector Substances 0.000 description 14
- 230000000875 corresponding effect Effects 0.000 description 13
- 238000004422 calculation algorithm Methods 0.000 description 11
- 230000006399 behavior Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 9
- 230000007613 environmental effect Effects 0.000 description 9
- 230000008859 change Effects 0.000 description 7
- 238000013500 data storage Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 238000009826 distribution Methods 0.000 description 5
- 230000002596 correlated effect Effects 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 238000004088 simulation Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000011960 computer-aided design Methods 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 239000012636 effector Substances 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Definitions
- the present invention relates to, for example, a learning device or the like that learns about a control content or the like that controls a controlled object.
- Machine learning is used in various situations such as image recognition and machine control. Machine learning is attracting attention and is being enthusiastically developed as it has the potential to realize complex and advanced decision-making that is difficult to achieve with human design.
- Reinforcement learning for example, realizes decision-making beyond the human level in a system that automatically determines the behavior of a computer player in a game. Reinforcement learning realizes complicated movements that are difficult to design by humans in a system that automatically determines the movements of a robot system.
- the framework for executing reinforcement learning includes the target system itself (or an environment that simulates the target system) and an agent that determines the behavior (hereinafter referred to as behavior) of the target system.
- learning data is a set of action, observation, and reward.
- the reward is given, for example, according to the similarity between the state of the target system and the desired state. In this case, the higher the similarity between the state of the target system and the desired state, the higher the reward. The lower the similarity between the state of the target system and the desired state, the lower the reward.
- Observations and rewards are obtained from the environment each time an agent acts.
- reinforcement learning acts by trial and error, and searches for various behaviors so that the reward obtained by the behavior is high.
- the learning is to iteratively update the policy, which is a mathematical model that determines the behavior of the agent, using the learning data obtained by the search.
- the policy is updated so that the cumulative reward that can be earned by the series of actions is maximized from the start of the action to the completion of the action.
- the system disclosed in Patent Document 1 has a user interface that allows parameters to be changed during learning calculation. More specifically, the system disclosed in Patent Document 1 has a user interface that can change the weighting coefficient of each evaluation index constituting the reward function in the middle of the learning calculation. When the system detects that learning has stagnated, it alerts the user to change the weighting factor.
- the system disclosed in Patent Document 2 includes a calculation process that changes parameters for the environment each time a learning calculation in reinforcement learning is executed. Specifically, the system determines whether or not to change the parameter based on the learning result, and when it is determined to change, adjusts the parameter of the environment by the update amount preset by the user.
- the system includes a teacher agent that changes the probability distribution of the parameters of the environment with respect to the reinforcement learning agent (referred to as a student agent here).
- the teacher agent calculates the machine learning based on the learning status of the student agent and the parameters of the corresponding environment, and obtains the probability distribution for the parameters of the environment where a higher learning status can be obtained. calculate.
- the teacher agent performs a clustering calculation of the Gaussian mixture model.
- the teacher agent updates the probability distribution of the parameters of the environment by selecting one from the plurality of normal distributions obtained by clustering based on the bandit algorithm.
- One of the objects of the present invention is to provide a learning device or the like capable of efficient learning.
- the learning device is a learning device that learns the policy for determining the control content of the target system, and according to the policy, the observation information about the target system, the state transition method of the target system, and the control content.
- the control to be applied to the target system, the determination means to determine the difficulty level to be set for the target system, and the determined control are determined by using the difficulty level corresponding to the high evaluation of.
- a learning progress calculation means for calculating the learning progress of a policy using a plurality of original evaluations of the state before and after the transition of the target system and the determined control according to the difficulty level, the original evaluation, and the determined difficulty level.
- the learning method is a learning method for learning a policy for determining the control content of the target system by a computer, and the observation information about the target system and the state transition method of the target system are performed according to the policy.
- the control to be applied to the target system and the difficulty level to be set for the target system are determined by using the difficulty level corresponding to the high evaluation of the control content and the determined control and determination.
- the learning progress of the policy was calculated using multiple original evaluations of the state before and after the transition of the target system and the determined control according to the determined difficulty level, and the original evaluation and the determined difficulty level were calculated.
- the revised evaluation is calculated using the learning progress, and the policy is updated using the observation information, the determined control, the determined difficulty level, and the revised evaluation.
- the learning program is a program for learning the policy for determining the control content of the target system, and the observation information about the target system and the state transition method of the target system are provided to the computer according to the policy.
- the process of calculating the learning progress of the policy using multiple original evaluations of the state before and after the transition of the target system and the determined control according to the determined difficulty level, the original evaluation, the determined difficulty level, and The process of calculating the revised evaluation using the calculated learning progress, the process of updating the policy using the observation information, the determined control, the determined difficulty level, and the revised evaluation are executed. Let me.
- Patent Document 1 and Patent Document 2 The inventor of the present application has found a problem in the techniques described in Patent Document 1 and Patent Document 2 in which a user sets parameters in detail according to a learning situation.
- the technique receives parameters from the user, for example, but the inventor has found the problem that the user cannot set the parameters appropriately.
- learning efficiency is lowered when parameters cannot be set appropriately.
- the inventor has found such a problem and has come to derive a means for solving the problem.
- Curriculum learning is a method based on the learning process of learning the easy and then the difficult.
- Curriculum learning is a machine learning method in which a task with a high degree of difficulty is learned by starting with a task with a low degree of difficulty.
- a low difficulty task represents, for example, a task with a high probability of success or a high expected achievement.
- a task with a high degree of difficulty represents, for example, a task that realizes a desired state or desired control.
- FIG. 1 is a schematic block diagram showing a configuration of a learning system 1 including a learning device 100 according to a first embodiment of the present invention.
- the learning system 1 is roughly classified into a learning device 100, an environmental device 200, and a user interface (hereinafter referred to as "I / F") 300.
- the learning device 100 includes a learning unit 110, a learning data acquisition unit 120, and an input / output control unit 130.
- the learning unit 110 includes a policy updating unit 111, a learning setting storage unit 112, a learning data storage unit 113, and a policy storage unit 114.
- the learning data acquisition unit 120 includes an agent calculation unit 121, an agent setting storage unit 122, a conversion unit 123, and a conversion setting storage unit 124.
- the environmental device 200 has an environment unit 210.
- the environment unit 210 executes the processing of the environment device 200.
- the learning device 100 is communicably connected to the environment device 200 and the user I / F 300 via a communication line.
- Communication lines include, for example, dedicated lines, the Internet, VPN (Virtual Private Network), LAN (Local Area Network), USB (Universal Serial Bus), Wi-Fi (registered trademark), BlueTooth (registered trademark), and other communication lines. It may be configured in any form regardless of the form of occupancy and the physical form of the communication line such as a wired line and a wireless line.
- the learning device 100 generates a policy that is a model for determining the control content for operating the target system such as the controlled object as desired according to the learning process as described later.
- the learning device 100 generates a policy that realizes processing as a control controller of the target system. That is, the learning device 100 also has a function as a control device for controlling the controlled object. Therefore, for example, the user can design and implement the control controller of the target system by generating the policy using the learning device 100.
- the target system is a system to be controlled.
- the target system is a system that controls individual devices that make up the system, such as a robot system.
- the target system may be a system that controls an object or an instance in a program, such as a game system.
- the target system is not limited to these examples.
- the control in the robot system is, for example, angular velocity control or torque control of each joint of the arm-type robot.
- the control may be, for example, motor control of each module of the humanoid robot.
- the control may be, for example, rotor control of a flying robot.
- the control in the game system is, for example, automatic operation of a computer player, adjustment of the difficulty level of the game, and the like. Some examples of control have been given, but control is not limited to these examples.
- the environmental device 200 is a target system or a simulated system that simulates the target system.
- the simulation system is, for example, a hardware emulator, a software emulator, a hardware simulator, a software simulator, or the like of the target system.
- the simulation system is not limited to these examples.
- the target system is an arm-type robot
- the control is pick-and-place (an end effector attached to the tip of the arm-type robot approaches the object, grips the object, and then puts the object in place.
- An example is a series of control tasks) of transporting a robot and placing an object in place.
- the simulation system is, for example, a system that executes software simulation in which CAD (Computer Aided Design) data of an arm-type robot and a physics engine, which is software capable of performing numerical calculation of dynamics, are combined.
- CAD Computer Aided Design
- physics engine which is software capable of performing numerical calculation of dynamics
- calculation processing is executed on a computer such as a personal computer (PC) or a workstation (WorkStation).
- the configuration of the learning system 1 is not limited to the configuration shown in FIG.
- the learning device 100 may have an environment unit 210. Specifically, when a system that simulates a target system is used and a software emulator or software simulator is used, the learning device 100 has an environment unit 210 that executes processing related to the software emulator or software simulator. You may.
- the user I / F 300 receives operations such as setting the learning device 100, executing the learning process, and writing the policy from the outside.
- the user I / F 300 is, for example, a computer such as a personal computer, a workstation, a tablet, or a smartphone.
- the user I / F 300 may be an input device such as a keyboard, a mouse, or a touch panel display.
- the user I / F 300 is not limited to these examples.
- the input / output control unit 130 receives operation commands such as setting of the learning device 100, execution of learning processing, and writing of policies from the user I / F 300 from the outside.
- the input / output control unit 130 issues an operation command to the learning setting storage unit 112, the policy storage unit 114, the agent setting storage unit 122, the conversion setting storage unit 124, and the like according to the operation command received from the user I / F 300.
- the learning setting storage unit 112 stores the settings related to policy learning in the policy update unit 111 according to the operation command received from the input / output control unit 130.
- the settings related to policy learning are, for example, hyperparameters related to learning.
- the policy update unit 111 reads the settings related to the policy learning from the learning setting storage unit 112 at the time of the policy update process.
- the agent setting storage unit 122 stores the settings related to the learning data acquisition process in the agent calculation unit 121 according to the operation command received from the input / output control unit 130.
- the settings related to the learning data storage process are, for example, hyperparameters related to the learning data acquisition process.
- the agent calculation unit 121 reads the settings related to the learning data acquisition process from the agent setting storage unit 122 during the learning data acquisition process.
- the conversion setting storage unit 124 stores the settings related to the conversion process in the conversion unit 123 according to the operation command received from the input / output control unit 130.
- the settings related to the conversion process are, for example, hyperparameters related to the conversion process.
- the conversion unit 123 reads the settings related to the conversion process from the conversion setting storage unit 124 during the learning data acquisition process.
- the learning device 100 communicates with the environment device 200 according to the settings input by the user via the user I / F 300, and executes the learning calculation process using the learning data acquired via the communication. As a result, the learning device 100 generates a policy.
- the learning device 100 is realized by, for example, a computer such as a personal computer or a workstation.
- Policy is a parameterized model with high approximation ability.
- the policy can calculate the model parameters by learning calculation.
- the policy is realized using a learnable model such as a neural network. The policy is not limited to this.
- the input to the policy is an observation that can be measured for the target system.
- the inputs to the policy are the angle of each joint of the robot, the angular velocity of each joint, the torque of each joint, the image data of the camera attached for recognizing the surrounding environment, LIDER ( Point cloud data, etc. acquired by Laser Imaging Detection and Ringing).
- the input to the policy is not limited to these examples.
- the output from the policy is the behavior for the environment, that is, the control input value that can control the target system.
- the output from the policy is the target speed of each joint of the robot, the target angular velocity of each joint, the input torque of each joint, and the like.
- the output from the policy is not limited to these examples.
- the learning of the policy is executed according to the reinforcement learning algorithm.
- the reinforcement learning algorithm is, for example, the policy gradient method. More specifically, the reinforcement learning algorithm is an algorithm such as DDPG (Deep Deterministic Policy Gradient), PPO (Proxy Policy Optimization), or SAC (Soft Actor Critic).
- DDPG Deep Deterministic Policy Gradient
- PPO Proxy Policy Optimization
- SAC Soft Actor Critic
- the reinforcement learning algorithm is not limited to these examples, and may be any algorithm that can execute learning of the policy that is the control controller of the target system.
- FIG. 2 is a block diagram showing a functional configuration of reinforcement learning.
- the agent 401 inputs the observation o that can be acquired from the environment 402 into the policy, and calculates the output for the input observation o. In other words, the agent 401 calculates the action a with respect to the input observation o. The agent 401 inputs the calculated action a into the environment 402.
- the state of the environment 402 transitions through a predetermined time step according to the input action a.
- the environment 402 calculates the observation o and the reward r regarding the state after the transition, respectively, and outputs the calculated observation o and the reward r to a device such as the agent 401.
- the reward r is a numerical value indicating the goodness (or preference) of the control of the action a with respect to the state of the environment 402.
- the agent 401 stores a set of the observation o input to the policy, the action a input to the environment 402, and the reward r output from the environment 402 as learning data.
- the agent 401 stores the pair of the observation o, which is the basis for calculating the action a, the action a, and the reward r for the action a as learning data.
- the agent 401 uses the observation o received from the environment 402 to execute the same process as the above-described process, such as the process of calculating the action a.
- the policy updating unit 111 updates the policy using the learning data according to a reinforcement learning algorithm such as the policy gradient method when the learning data can be acquired as much as necessary for the learning calculation.
- the agent 401 acquires the learning data according to the policy updated by the policy updating unit 111.
- agent 401 corresponding to learning data acquisition unit 120 in FIG. 1 are executed alternately or in parallel.
- FIG. 3 is a diagram conceptually representing the processing in the learning device 100 according to the first embodiment.
- the learning device 100 executes processing according to the reinforcement learning method while adjusting the difficulty parameter (hereinafter referred to as "difficulty parameter").
- the difficulty level is a numerical value or a numerical value group related (or correlated) with the probability of obtaining a reward in the reinforcement learning method.
- the difficulty level may be a numerical value or a group of numerical values related to (or correlated with) the expected value of the reward obtained in the reinforcement learning method.
- the lower the difficulty level the higher the probability of getting a reward, or the higher the expected value of the reward to be obtained.
- the higher the difficulty level the lower the probability of getting a reward, or the lower the expected value of the reward.
- the difficulty level the farther away from the desired environmental conditions.
- the higher the difficulty level the closer to the desired environmental conditions.
- the difficulty parameter represents, for example, the low probability that the agent will get the reward, or the low expected value of the reward that the agent will get.
- the difficulty level parameter is a parameter related to the state transition method of the environment.
- Agent 501 efficiently calculates the action a and the difficulty level d by one process (the process of calculating the "extended action” described later) according to one common policy with respect to the above-mentioned difficulty level. Learning data can be acquired. The reason for this is that the agent 501 determines the combination of the action and the difficulty level so that the reward obtained is high, so that it is possible to prevent the reward from not being obtained due to the difficulty level being set too high. Is. Also, compared to the method of setting a fixed difficulty level while acquiring learning data, the difficulty level is adjusted each time the agent 501 calculates an action, so the difficulty level is finely appropriate according to the state of the environment 502. You can set the degree.
- the agent 501 adjusts the difficulty level as described above according to the learning progress, it is possible to efficiently acquire the learning data.
- the learning progress represents a numerical value or a numerical value group related to the cumulative reward expected to be acquired by the agent 501 by the policy at the time of learning data acquisition.
- the larger the numerical value or the numerical value group the later the learning progress.
- the smaller the numerical value or the numerical value group the earlier the learning progress.
- the agent 501 can realize efficient reinforcement learning by setting the difficulty level lower as the learning progress is earlier and setting the difficulty level higher as the learning progress is later. That is, the agent 501 can realize efficient reinforcement learning by adjusting the difficulty level according to the learning progress.
- the learning progress is a numerical value or a group of numerical values related (or linked or correlated) with the probability that the agent 501 will obtain a reward.
- the learning progress is a numerical value or a group of numerical values related to (or linked to or correlated with) the expected value of the reward acquired by the agent 501.
- reinforcement learning with a difficulty adjustment function see FIG. 3
- reinforcement learning without a difficulty adjustment function see FIG. 2
- the difference is that, for example, the behavior, observations, and rewards sent and received between the agent and the environment are transformed by a series of computational processes.
- This conversion process is performed in order to acquire learning data to be used when learning the policy so that the agent outputs an appropriate difficulty level using the policy and gradually outputs a higher difficulty level as the learning progresses.
- This is a series of calculation processes centered on converting the difficulty level into a numerical value that can be input to the environment, calculating parameters corresponding to the learning progress, and adjusting the reward according to the difficulty level and the learning progress.
- the details of the conversion process in reinforcement learning with the difficulty adjustment function will be described below.
- Agent 501 outputs extended action a'.
- the extended action a' is represented using, for example, a column vector.
- the extended action a' has elements of the action a for control input to the environment 502 and the difficulty level d of the control in the environment 502. It is assumed that the action a and the difficulty level d are each represented by using a column vector. In this case, it is assumed that each element of the action a corresponds to the control input of each control target in the environment 502. It is assumed that each element of the difficulty level d corresponds to the numerical value of each element that determines the difficulty level of control in the environment 502.
- each element of the action a corresponds to, for example, the torque input of each joint of the robot.
- the difficulty level d corresponds to, for example, each parameter related to the difficulty level of gripping, such as the friction coefficient and elastic modulus of the object to be gripped.
- the parameter corresponding to the difficulty level d is specified by the user, for example.
- the conversion f d 503 converts the difficulty level d into the environment parameter ⁇ and the converted difficulty level ⁇ .
- the environment parameter ⁇ is a parameter related to the state transition method (transition characteristic) of the environment 502, and as will be described later with reference to Eq. (1), the state transition method of the environment 502 is desired. It is a parameter that can control from the state transition method to the state transition process away from the desired state. It is assumed that the environment parameter ⁇ is expressed using a column vector. In this case, it is assumed that each element of the environment parameter ⁇ corresponds to each element of the difficulty level d.
- the environment parameter ⁇ is input to the environment 502 to change its characteristics.
- the characteristic is the process of state transition to the input action a of the environment 502.
- each element of the environment parameter ⁇ corresponds to each parameter that determines the characteristics of the environment 502.
- the characteristics of the environment 502 for the parameters specified by the user such as the friction coefficient and elastic modulus of the object to be gripped are set, and the environment parameter ⁇ having the numerical values is set to the environment 502. Change by typing in.
- the equation (1) can be specifically used. Further, the conversion is not limited to the example of the equation (1), and may be a non-linear conversion. For example, d in the equation (1) may be replaced with (d ⁇ d).
- the symbol " ⁇ " is the Hadamard product, which represents the product of each element of the column vector.
- Each element of the difficulty level d takes a value of 0 or more and 1 or less, and the larger the value, the higher the difficulty level of control in the environment 502 for the numerical value of the corresponding environment parameter ⁇ .
- I is a column vector in which each element having the same dimension as the difficulty level d is 1.
- ⁇ start and ⁇ target are column vectors having the same dimension as difficulty level d, respectively.
- the ⁇ start and ⁇ target are numerical values of each element, and parameters that can control the characteristics of the corresponding environment 502 are set by the user, for example.
- ⁇ start is an environment parameter in the environment 502 when the difficulty level d is the lowest that can be specified (for example, when d is a zero vector).
- the ⁇ target is an environment parameter in the environment 502 when the difficulty level d can be specified at the highest difficulty level (for example, when d is I).
- the ⁇ target is set by the user to be as close to or consistent as possible with the environment parameters when ultimately using the policy as the control controller.
- the converted difficulty level ⁇ is a column vector or scalar value input to the conversion fr 504, and is converted into a feature amount representing the difficulty level by the conversion f d 503.
- an example of converting the difficulty level ⁇ after conversion into a scalar value will be described.
- the equation (2) can be specifically used as an example of converting the difficulty level d by the conversion f d 503 to the difficulty level ⁇ after conversion.
- the converted difficulty level ⁇ represents the average of the absolute values of each element of the difficulty level d.
- the process of calculating the difficulty level ⁇ after conversion may be, for example, a process of calculating one numerical value representing the characteristics of the plurality of numerical values from a plurality of numerical values such as a vector, and is not limited to the equation (2).
- the process of calculating the difficulty level ⁇ after conversion may be realized, for example, by replacing the L1 norm of the equation (1) with an L2 norm or the like, or by using another non-linear transformation. Alternatively, it may be realized by converting it into a vector having a dimension lower than d.
- the environment 502 outputs the observation o and the reward when the action a and the environment parameter ⁇ are input and the processing step progresses and the state transitions.
- the reward will be described assuming that the reward is the unadjusted reward r.
- the pre-adjustment reward r represents the reward in reinforcement learning without the difficulty adjustment function.
- Observation o is represented by a column vector. In this case, each element of observation o represents a numerical value of an observable parameter in the state of environment 502.
- the conversion f r 504 calculates the adjusted reward r'by discounting or increasing the pre-adjustment reward r according to the difficulty level and the learning progress.
- the conversion fr 504 calculates the adjusted reward r'so that the lower the difficulty level, the smaller the discount or the larger the premium when the learning progress is low.
- the conversion fr 504 calculates the adjusted reward r'so that the higher the difficulty level is, the smaller the discount or the larger the premium is when the learning progress is high.
- the conversion fr 504 calculates the adjusted reward r'by inputting the pre-adjustment reward r, the post-conversion difficulty level ⁇ , and the moving average ⁇ of the cumulative pre-adjustment reward R.
- the moving average ⁇ of the cumulative pre-adjustment reward R corresponds to the learning progress.
- the equation (3) can be used as an example of the conversion fr 504.
- the function f c is a function that outputs the ratio of discounting the pre-adjustment reward r according to the difficulty level and the learning progress.
- Function f c is preferably differentiable to efficiently learning calculation of the policy.
- FIG. 6 shows a graph showing a part of the contour lines as an example of the function f c.
- FIG. 6 is a diagram showing a graph showing an example of the function f c using contour lines.
- Function f c may be used of any shape by the user setting. For example, it is possible to set the discount to zero regardless of the difficulty level in the area where the learning progress is low. It is also possible to set the discount rate to be larger as the difficulty level is lower in the area where the learning progress is high.
- the horizontal axis represents the moving average ⁇ (learning progress) of the cumulative pre-adjustment reward R, and the right side indicates that the average is high, and the left side indicates that the average is low.
- the vertical axis represents the difficulty level ⁇ after conversion, and the higher the level, the higher the difficulty level, and the lower the level, the lower the difficulty level.
- Numbers in Figure 6 represents the value of f c ( ⁇ , ⁇ ). The closer f c ( ⁇ , ⁇ ) is to 1, the smaller the discount (or the larger the premium). The closer f c ( ⁇ , ⁇ ) is to 0, the more discounts (or less premiums).
- the conversion f r 504 is not limited to the example of Equation (3), for example, f c (r, [delta], mu) may be a function expressed in the form of.
- Cumulative calculation f R 505 calculates the cumulative pre-adjustment reward R by inputting the pre-adjustment reward r.
- the cumulative pre-adjustment reward R represents the cumulative responsibility reward in reinforcement learning without the difficulty adjustment function.
- Cumulative calculation f R 505 calculates the cumulative pre-adjustment reward R for each episode. At the start of the episode, the initial value of the cumulative pre-adjustment reward R is set to, for example, 0.
- the cumulative calculation f R 505 is calculated by adding the pre-adjustment reward r to the cumulative pre-adjustment reward R each time the pre-adjustment reward r is input. That is, the cumulative calculation f R 505 calculates the total value of the pre-adjustment reward r (cumulative pre-adjustment reward R) for each episode.
- the episode represents one process in which the agent 501 acquires learning data through trial and error.
- the episode represents, for example, the process from the initial state of the environment 502 at which the agent 501 starts acquiring training data to the condition of satisfying a predetermined end condition.
- the episode ends when a predetermined end condition is met.
- the environment 502 is reset to its initial state and a new episode begins.
- the predetermined end condition may be, for example, a condition that the number of steps from the start of the episode of the agent 501 exceeds a preset threshold value. Further, the predetermined termination condition may be a condition such that the state of the environment 502 violates a preset constraint condition due to the action a of the agent 501.
- the predetermined termination conditions are not limited to these examples.
- the predetermined end condition may be a condition in which a plurality of the above-mentioned conditions are combined. An example of the constraint condition is when the arm-type robot invades a preset prohibited area.
- the reward history buffer 506 stores a plurality of cumulative pre-adjustment rewards R calculated for each episode. It is assumed that the reward history buffer 506, which calculates the feature amount corresponding to the learning progress using these, has a built-in calculation function. Examples of the feature quantity include a moving average ⁇ of the cumulative pre-adjustment reward R and a moving standard deviation ⁇ . The feature amount corresponding to the learning progress is not limited to these examples.
- the reward history buffer 506 samples the latest ones from a plurality of stored cumulative pre-adjustment rewards R by the number of window sizes preset by the user (that is, for a predetermined number of steps), and moves the average ⁇ . And the moving standard deviation ⁇ are calculated.
- Conversion f o 507 includes the observation o, and difficulty d, the moving average ⁇ cumulative unadjusted reward R and moving standard deviation sigma, a process of outputting the extended observation o 'is a column vector bound in the column direction show. Therefore, the extended observation o'is the observation o in the reinforcement learning without the difficulty adjustment function, the difficulty level d in the reinforcement learning with the difficulty adjustment function, the moving average in the cumulative pre-adjustment reward R in the reinforcement learning without the difficulty adjustment function, and Includes its moving average deviation ⁇ . That is, the extended observation o'is an extension of the observation o of reinforcement learning without the difficulty adjustment function by adding the difficulty level and the learning progress so that the policy can output an appropriate difficulty level d.
- the policy can output the difficulty level d in consideration of the balance with the reward acquired as the current learning progress of the policy.
- the output of the policy may be determined without explicitly considering the learning progress, and in this case, it is not necessary to include the learning progress in the extended observation o'.
- the above is a series of calculations of conversion processing in reinforcement learning with difficulty adjustment function.
- the agent 501 transmits a set of the extended action a', the extended observation o', and the adjusted reward r'obtained by the conversion process to the learning unit 110 as learning data. Then, the learning unit 110 updates the policy using this learning data. On the other hand, in the reinforcement learning without the difficulty adjustment function, the policy is updated by using the learning data representing the set of the action a, the observation o, and the reward r.
- the learning unit 110 executes the calculation according to the procedure shown in FIG.
- FIG. 4 is a flowchart showing an example of a procedure in which the learning unit 110 updates the policy using the learning data acquired by the learning data acquisition unit 120.
- the policy update unit 111 reads the learning data group acquired by the agent 501 stored in the learning data storage unit 113 (step S101).
- the policy update unit 111 updates the policy using the read learning data group (step S102).
- the update is calculated using the algorithms such as DDPG, PPO, and SAC mentioned above.
- the algorithm for updating is not limited to these examples.
- the policy update unit 111 determines the learning end condition (step S103).
- a learning end condition a condition for ending when the number of policy updates exceeds a threshold set preset by the user can be mentioned.
- step S103: No the process returns to step S101.
- step S103: Yes the policy updated to end the learning process, the moving average ⁇ of the cumulative pre-adjustment reward R output from the reward history buffer 506, and the moving standard deviation.
- the pair with ⁇ is transmitted to the policy storage unit 114 and stored (step S104).
- step S104 After the process of step S104 is executed, the learning device 100 ends the process of FIG.
- the learning data acquisition unit 120 executes the calculation according to the procedure shown in FIG.
- FIG. 5 is a flowchart showing an example of a procedure in which the learning data acquisition unit 120 cooperates with the environment device 200 and the environment unit 210 to acquire the learning data used by the learning unit 110 for the policy calculation.
- the procedure shown in FIG. 5 is an example. Since the flow shown in FIG. 5 includes a step in which processing can be performed in parallel and a step in which processing can be performed by exchanging the execution order, the calculation procedure of the learning data acquisition unit 120 is shown in FIG. Not limited to the procedure.
- the conversion unit 123 initializes the cumulative pre-adjustment reward R to 0.
- the agent calculation unit 121 resets the environment unit 210 to the initial state and starts the episode (step S201).
- the conversion unit 123 calculates the initial value of the extended observation o'and transmits it to the agent calculation unit 121 (step S202).
- the method for calculating the initial value of the extended observation o ' difficulty determined in advance and the observation o from the environment unit 210 d, the binding f o using a moving average ⁇ and moving standard deviation ⁇ of the cumulative unadjusted reward R
- a method of calculating according to the indicated processing can be mentioned.
- the agent calculation unit 121 inputs the extended observation o'in the policy and calculates the extended action a'(step S203). As the extended observation o'to be input to the policy, the one acquired in the step immediately before step S203 (step S202 or step S211) is used.
- the conversion unit 123 decomposes the extended action a'calculated in step S204 into the action a and the difficulty level d (step S204).
- the conversion unit 123 inputs the difficulty level d into the conversion f d and calculates the environment parameter ⁇ and the converted difficulty level ⁇ (step S205).
- the conversion unit 123 inputs the action a and the environment parameter ⁇ to the environment unit 210, and advances the time step of the environment unit 210 to the next time step (step S206).
- the conversion unit 123 acquires the observation o output from the environment unit 210 and the pre-adjustment reward r (step S207).
- the cumulative calculation f R 505 adds the pre-adjustment reward r to the cumulative pre-adjustment reward R (step S208).
- converted f o 507 from compensation history buffer 506 obtains the moving average ⁇ cumulative unadjusted reward R and moving standard deviation sigma (step S209).
- the conversion fr 504 calculates the adjusted reward r'by inputting the pre-adjustment reward r, the post-conversion difficulty level ⁇ , and the moving average ⁇ of the cumulative pre-adjustment reward R (step S210).
- converted f o 507 is observed o, difficulty d, the moving average ⁇ cumulative unadjusted compensation, and combines the moving standard deviation ⁇ of the cumulative adjustment before compensation and extended observation o '(step S211).
- the agent calculation unit 121 transmits and stores the set of the extended action a', the extended observation o', and the adjusted reward r'as learning data to the learning data storage unit 113 (step S212).
- the agent calculation unit 121 determines whether or not the episode has ended using the episode end condition (step S213). When the agent calculation unit 121 determines that the episode has not ended (step S213: No), the process returns to step S203. When the agent calculation unit 121 determines that the episode has ended (step S213: Yes), the conversion unit 123 stores the cumulative pre-adjustment reward R in the reward history buffer 506, and a plurality of stored rewards R in the reward history buffer 506. Using the cumulative pre-adjustment reward R, the moving average ⁇ and the moving standard deviation ⁇ of the cumulative pre-adjustment reward R are calculated and updated (step S214). When step S214 is complete, the episode ends and processing returns to step S201.
- the series of processes of the learning data acquisition unit 120 shown in FIG. 5 is interrupted and ends when the series of processes of the learning unit 110 shown in FIG. 4 is completed.
- the learning device of the present embodiment is a learning device that learns the policy for determining the control content of the target system, and according to the policy, the observation information about the target system and the method of state transition of the target system.
- a determination means for determining the control to be applied to the target system and the difficulty level to be set for the target system, and the determined control, using the ease with which the evaluation of the control content is highly evaluated and the difficulty level corresponding to the high evaluation.
- the learning progress calculation means for calculating the learning progress of the policy using multiple original evaluations of the state before and after the transition of the target system and the determined control according to the determined difficulty level, and the original evaluation were determined.
- the learning device of the present embodiment can perform efficient learning.
- control system including the learning device 100 of the second embodiment of the present invention will be described.
- the control system is an example of the target system.
- the configuration of the control system is the same as that of the learning system 1.
- the environmental device 200 may have a policy storage unit 114 and a learning data acquisition unit 120.
- the environmental device 200 is a control system.
- the policy storage unit 114 stores the policy learned by the learning system 1, the moving average ⁇ of the cumulative pre-adjustment reward R, and the moving standard deviation ⁇ of the cumulative pre-adjustment reward R.
- the agent calculation unit 121 uses the moving average ⁇ and the moving standard deviation ⁇ of the cumulative pre-adjustment reward R stored in the policy storage unit 114 as inputs according to the policy stored in the policy storage unit 114, and performs inference calculation. Is processed.
- the agent calculation unit 121 and the conversion unit 123 perform a series of calculation processes, and input the action a and the environment parameter ⁇ to the environment unit 210.
- the environment unit 210 changes the state according to the input action a and the environment parameter ⁇ , and outputs, for example, the observation o about the state after the transition.
- the conversion unit 123 converts the observation o into the extended observation o'.
- the calculated extended observation o' is input, and the agent calculation unit 121, the conversion unit 123, and the environment unit 210 perform the above-mentioned series of processes.
- This series of processes is the desired control for the control system. That is, the agent calculation unit 121 and the conversion unit 123 determine the operation of the control system according to the policy stored in the policy storage unit 114, and control the control system so as to perform the determined operation. As a result, the control system performs the desired operation.
- those that cannot be changed by the environment parameter ⁇ may ignore the setting of the environment parameter by the environment parameter ⁇ .
- Parameters that can ignore the setting using the environment parameter ⁇ are, for example, parameters that can be easily changed in the simulation or emulation of the friction coefficient and elastic modulus of an object, but cannot be changed in the actual system. be.
- Conversion f o 507 instead of the moving average ⁇ cumulative unadjusted reward R output from the compensation history buffer 506 and moving standard deviation sigma, and moving average ⁇ of the cumulative adjustment before compensation R the policy storage unit 114 stores Input the moving standard deviation ⁇ . Therefore, the conversion unit 123 does not have to perform the calculation process of the reward history buffer 506.
- the conversion unit 123 does not have to perform the calculation processing of the conversion fr 504 and the cumulative f R 505. This is because the agent calculation unit 121 does not need a process of transmitting the learning data to the learning data storage unit 113 and storing it.
- the above is the calculation process of the learning device 100 in the control system.
- the learning device 100 according to the second embodiment can make the learned policy function as a control controller as a part of the control system.
- the control system includes, for example, a pick-and-place control system for an arm-type robot, a walking control system for a humanoid robot, a flight attitude control system for a flight-type robot, and the like.
- the control system is not limited to these examples.
- the configuration of the learning device 100 is not limited to the configuration using a computer.
- the learning device 100 may be configured by using dedicated hardware such as being configured by using an ASIC (Application Specific Integrated Circuit).
- ASIC Application Specific Integrated Circuit
- the present invention can also realize arbitrary processing by causing a CPU (Central Processing Unit) to execute a computer program. It is also possible to execute a program together with an auxiliary arithmetic unit such as a GPU (Graphic Processing Unit) as well as a CPU to realize the program.
- the program can be stored and supplied to the computer using various types of non-transitory computer readable medium.
- Non-temporary computer-readable media include various types of tangible storage media (tangible storage medium). Examples of non-temporary computer-readable media include magnetic recording media (eg flexible disks, magnetic tapes, optical discs), magneto-optical recording media (eg magneto-optical disks), CD-ROMs (Read Only Memory), CD-Rs, CDs.
- DVD Digital Versatile Disc
- BD Blu-ray (registered trademark) Disc
- semiconductor memory for example, mask ROM, PROM (Programmable ROM), EPROM (Erasable PROM), flash ROM, RAM (Random) Access Memory
- FIG. 7 is a block diagram showing a main part of the learning device.
- the learning device 800 uses the observation information about the target system (for example, observation o) according to the policy, and the difficulty level corresponding to the high evaluation of the state transition method and control of the target system.
- a determination unit (determination means) 801 (in the embodiment, an agent calculation unit 121) that determines a control (for example, action a) to be applied to the target system and a difficulty level (for example, difficulty level ⁇ ) to be set for the target system.
- the original evaluation eg, pre-adjustment reward
- the state before and after the transition of the target system and the determined control according to the determined control and the determined difficulty level (for example, difficulty level ⁇ ).
- Learning progress calculation unit (learning progress calculation means) 802 (in the embodiment, conversion unit 123, particularly cumulative calculation) for calculating the learning progress of the policy (for example, the moving average ⁇ of the cumulative pre-adjustment reward R) using a plurality of r). It is realized by f R 505 and the reward history buffer 506), the original evaluation, the determined difficulty level, and the calculated learning progress, and the revised evaluation (for example, the adjusted reward r') is performed.
- the calculation unit (calculation means) 803 (in the embodiment , realized by the conversion unit 123, particularly the conversion fr 504), the observation information, the determined control, the determined difficulty level, and the modification. It is provided with a policy updating unit (polish updating means) 804 (in the embodiment, realized by the policy updating unit 111) that updates the policy by using the evaluation.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Feedback Control In General (AREA)
Abstract
Description
<第1の実施形態>
110 学習部
111 ポリシ更新部
112 学習設定記憶部
113 学習データ記憶部
114 ポリシ記憶部
120 学習データ取得部
121 エージェント計算部
122 エージェント設定記憶部
123 変換部
124 変換設定記憶部
130 入出力制御部
200 環境装置
210 環境部
300 ユーザI/F
401 エージェント
402 環境
501 エージェント
502 環境
503 変換fd
504 変換計算部fr
505 累積計算部fR
506 報酬履歴バッファ
507 結合計算部fo
601 調整関数
800 学習装置
801 決定部
802 学習進度算出部
803 算出部
804 ポリシ更新部
Claims (6)
- 対象システムの制御内容を決定するポリシを学習する学習装置であって、
前記ポリシに従って、前記対象システムに関する観測情報と、前記対象システムの状態遷移の仕方と制御内容についての評価の高くなり易さとに対応付く難易度とを用いて、前記対象システムに対して施す制御と、前記対象システムに対して設定する難易度とを決定する決定手段と、
決定された前記制御と決定された前記難易度とに従って前記対象システムが遷移する前後の状態と決定された前記制御とについての元評価を複数用いてポリシの学習進度を算出する学習進度算出手段と、
前記元評価と、決定された前記難易度と、算出された前記学習進度とを用いて、改評価を算出する算出手段と、
前記観測情報と、決定された前記制御と、決定された前記難易度と、前記改評価とを用いて、前記ポリシを更新するポリシ更新手段と
を備える学習装置。 - 前記決定手段は、さらに前記学習進度を用いて、前記対象システムに対して施す制御と、前記対象システムに対して設定する難易度とを決定する
請求項1記載の学習装置。 - 前記算出手段は、前記元評価の値が同じとなる場合について、前記学習進度が高く、決定された前記難易度が低いほど、前記改評価を小さい値として算出する
請求項1または2に記載の学習装置。 - 前記算出手段は、前記元評価の値が同じとなる場合について、前記学習進度が低く、決定された前記難易度が高いほど、前記改評価を小さい値として算出する
請求項1から請求項3のうちのいずれか1項に記載の学習装置。 - 対象システムの制御内容を決定するポリシを学習する学習方法であって、
前記ポリシに従って、前記対象システムに関する観測情報と、前記対象システムの状態遷移の仕方と制御内容についての評価の高くなり易さとに対応付く難易度とを用いて、前記対象システムに対して施す制御と、前記対象システムに対して設定する難易度とを決定し、
決定された前記制御と決定された前記難易度とに従って前記対象システムが遷移する前後の状態と決定された前記制御とについての元評価を複数用いてポリシの学習進度を算出し、
前記元評価と、決定された前記難易度と、算出された前記学習進度とを用いて、改評価を算出する算出し、
前記観測情報と、決定された前記制御と、決定された前記難易度と、前記改評価とを用いて、前記ポリシを更新する
学習方法。 - 対象システムの制御内容を決定するポリシを学習する学習プログラムが格納されたコンピュータ読み取り可能な記録媒体であって、
前記学習プログラムは、コンピュータに、
前記ポリシに従って、前記対象システムに関する観測情報と、前記対象システムの状態遷移の仕方と制御内容についての評価の高くなり易さとに対応付く難易度とを用いて、前記対象システムに対して施す制御と、前記対象システムに対して設定する難易度とを決定する処理と、
決定された前記制御と決定された前記難易度とに従って前記対象システムが遷移する前後の状態と決定された前記制御とについての元評価を複数用いてポリシの学習進度を算出する処理と、
前記元評価と、決定された前記難易度と、算出された前記学習進度とを用いて、改評価を算出する処理と、
前記観測情報と、決定された前記制御と、決定された前記難易度と、前記改評価とを用いて、前記ポリシを更新する処理とを実行させる
記録媒体。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/909,835 US20240202569A1 (en) | 2020-03-16 | 2020-03-16 | Learning device, learning method, and recording medium |
JP2022508616A JP7468619B2 (ja) | 2020-03-16 | 2020-03-16 | 学習装置、学習方法、及び、記録媒体 |
PCT/JP2020/011465 WO2021186500A1 (ja) | 2020-03-16 | 2020-03-16 | 学習装置、学習方法、及び、記録媒体 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2020/011465 WO2021186500A1 (ja) | 2020-03-16 | 2020-03-16 | 学習装置、学習方法、及び、記録媒体 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021186500A1 true WO2021186500A1 (ja) | 2021-09-23 |
Family
ID=77770726
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2020/011465 WO2021186500A1 (ja) | 2020-03-16 | 2020-03-16 | 学習装置、学習方法、及び、記録媒体 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20240202569A1 (ja) |
JP (1) | JP7468619B2 (ja) |
WO (1) | WO2021186500A1 (ja) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114357884A (zh) * | 2022-01-05 | 2022-04-15 | 厦门宇昊软件有限公司 | 一种基于深度强化学习的反应温度控制方法和系统 |
CN114404977A (zh) * | 2022-01-25 | 2022-04-29 | 腾讯科技(深圳)有限公司 | 行为模型的训练方法、结构扩容模型的训练方法 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017183587A1 (ja) * | 2016-04-18 | 2017-10-26 | 日本電信電話株式会社 | 学習装置、学習方法および学習プログラム |
JP2019219741A (ja) * | 2018-06-15 | 2019-12-26 | 株式会社日立製作所 | 学習制御方法及び計算機システム |
-
2020
- 2020-03-16 WO PCT/JP2020/011465 patent/WO2021186500A1/ja active Application Filing
- 2020-03-16 JP JP2022508616A patent/JP7468619B2/ja active Active
- 2020-03-16 US US17/909,835 patent/US20240202569A1/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017183587A1 (ja) * | 2016-04-18 | 2017-10-26 | 日本電信電話株式会社 | 学習装置、学習方法および学習プログラム |
JP2019219741A (ja) * | 2018-06-15 | 2019-12-26 | 株式会社日立製作所 | 学習制御方法及び計算機システム |
Non-Patent Citations (1)
Title |
---|
JIANG, LU ET AL.: "Self-Paced Learning with Diversity", vol. 27, 2014, pages 1 - 9, XP055706933, Retrieved from the Internet <URL:https://papers.nips.cc/paper/5568-self-paced-learning-with-diversity.pdf> [retrieved on 20200727] * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114357884A (zh) * | 2022-01-05 | 2022-04-15 | 厦门宇昊软件有限公司 | 一种基于深度强化学习的反应温度控制方法和系统 |
CN114404977A (zh) * | 2022-01-25 | 2022-04-29 | 腾讯科技(深圳)有限公司 | 行为模型的训练方法、结构扩容模型的训练方法 |
CN114404977B (zh) * | 2022-01-25 | 2024-04-16 | 腾讯科技(深圳)有限公司 | 行为模型的训练方法、结构扩容模型的训练方法 |
Also Published As
Publication number | Publication date |
---|---|
US20240202569A1 (en) | 2024-06-20 |
JPWO2021186500A1 (ja) | 2021-09-23 |
JP7468619B2 (ja) | 2024-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Pham et al. | Optlayer-practical constrained optimization for deep reinforcement learning in the real world | |
Fu et al. | One-shot learning of manipulation skills with online dynamics adaptation and neural network priors | |
EP3924884B1 (en) | System and method for robust optimization for trajectory-centric model-based reinforcement learning | |
Qazani et al. | A model predictive control-based motion cueing algorithm with consideration of joints’ limitations for hexapod motion platform | |
US20210107144A1 (en) | Learning method, learning apparatus, and learning system | |
WO2021186500A1 (ja) | 学習装置、学習方法、及び、記録媒体 | |
Qazani et al. | Optimising control and prediction horizons of a model predictive control-based motion cueing algorithm using butterfly optimization algorithm | |
KR101912918B1 (ko) | 학습 로봇, 그리고 이를 이용한 작업 솜씨 학습 방법 | |
EP3704550B1 (en) | Generation of a control system for a target system | |
KR20220137732A (ko) | 적응형 리턴 계산 방식을 사용한 강화 학습 | |
CN112016678B (zh) | 用于增强学习的策略生成网络的训练方法、装置和电子设备 | |
Yang et al. | Online adaptive teleoperation via motion primitives for mobile robots | |
Qazani et al. | Whale optimization algorithm for weight tuning of a model predictive control-based motion cueing algorithm | |
CN114529010A (zh) | 一种机器人自主学习方法、装置、设备及存储介质 | |
Kolaric et al. | Local policy optimization for trajectory-centric reinforcement learning | |
Ng et al. | Model predictive control and transfer learning of hybrid systems using lifting linearization applied to cable suspension systems | |
Ganai et al. | Learning stabilization control from observations by learning lyapunov-like proxy models | |
CN114378820B (zh) | 一种基于安全强化学习的机器人阻抗学习方法 | |
KR102570962B1 (ko) | 로봇 제어 장치 및 이의 동작 방법 | |
CN115421387A (zh) | 一种基于逆强化学习的可变阻抗控制系统及控制方法 | |
García et al. | Incremental reinforcement learning for multi-objective robotic tasks | |
WO2022180785A1 (ja) | 学習装置、学習方法及び記憶媒体 | |
Mainampati et al. | Implementation of human in the loop on the TurtleBot using reinforced learning methods and robot operating system (ROS) | |
CN114397817A (zh) | 网络训练、机器人控制方法及装置、设备及存储介质 | |
CN110298449B (zh) | 计算机进行通用学习的方法、装置和计算机可读存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20925507 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2022508616 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 17909835 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20925507 Country of ref document: EP Kind code of ref document: A1 |