CN108415254B - Waste recycling robot control method based on deep Q network - Google Patents

Waste recycling robot control method based on deep Q network Download PDF

Info

Publication number
CN108415254B
CN108415254B CN201810199112.2A CN201810199112A CN108415254B CN 108415254 B CN108415254 B CN 108415254B CN 201810199112 A CN201810199112 A CN 201810199112A CN 108415254 B CN108415254 B CN 108415254B
Authority
CN
China
Prior art keywords
information
robot
action
network
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810199112.2A
Other languages
Chinese (zh)
Other versions
CN108415254A (en
Inventor
朱斐
吴文
伏玉琛
周小科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201810199112.2A priority Critical patent/CN108415254B/en
Publication of CN108415254A publication Critical patent/CN108415254A/en
Application granted granted Critical
Publication of CN108415254B publication Critical patent/CN108415254B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Manipulator (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a waste recycling robot control method and a waste recycling robot control device based on a deep Q network, which are characterized in that: the sensing system: the system is used for sensing the position information of an object in front of the robot and expressing the position information through image information; the control system is: the robot is used for controlling a grabbing arm of the robot to grab an object and placing the object in the accommodating mechanism; the operating system: receiving information of a control system and executing various actions; the driving system is: the power supply is used for providing power for the operation system to execute various actions of the control system; the sensing system collects environmental information and driving system information, transmits the environmental information and the driving system information to the control system, the control system calculates and processes the environmental information and the driving system information according to the received information, and sends the information to the operation and driving system to drive the robot to execute corresponding actions. The invention applies the reinforcement learning algorithm in the field of artificial intelligence, and can autonomously learn and update the parameters of the neural network, so that the robot achieves the control effect of recycling articles.

Description

Waste recycling robot control method based on deep Q network
Technical Field
The invention belongs to the technical field of artificial intelligence and control, and particularly relates to a waste recycling robot control method based on a deep Q network, which can perform self-learning and complete the grabbing control of a robot on articles.
Background
In recent years, artificial intelligence is more and more widely applied to family life, and a concept of smart home is formed. The sweeping robot is a small automatic control robot with artificial intelligence and is used for sweeping family sanitation. At present, the sweeping robot is well applied in the market, and the application of the sweeping robot partially liberates people from chores and gets favorable comment of people.
However, the cleaning object of the current sweeping robot mainly aims at the dust on the ground, and only cleans the ground by a dust collection method, so that the cleaning object is only suitable for cleaning in a single household in the ground environment, and most of the sweeping robots cannot handle large wastes such as waste bottles and cans, and can only mark the wastes as obstacles to directly bypass.
Obviously, a sweeping robot which can only sweep dust on the ground cannot completely meet the requirements of larger occasions and more complex environments (such as a road surface), so that the application range of the sweeping robot is limited.
Disclosure of Invention
The invention aims to: the control method is improved, the control method can adapt to a new environment more quickly through self-learning, the effectiveness of strategy updating is ensured, the requirements of different environments and different cleaning objects are adapted more quickly, and the application range is greatly expanded.
The technical scheme of the invention is as follows: the utility model provides a waste recycling robot device based on degree of depth Q network, includes sensing system, control system, operating system and actuating system, its characterized in that:
the sensing system: the robot comprises a camera and image acquisition equipment, wherein the camera is used for sensing the position information of an object in front of a robot and expressing the position information through image information;
the control system is: the robot is used for controlling a robot to grab an object by a grabbing arm, place the object in the containing mechanism and control the rotation angle of the rotating mechanism;
the operating system: the robot comprises a robot grabbing arm, a rotating mechanism and a receiving mechanism, wherein the robot grabbing arm, the rotating mechanism and the receiving mechanism are used for receiving information of a control system and executing various actions;
the driving system is: the system comprises a motor and a storage battery, and is used for providing power for various actions of an operation system execution control system;
the sensing system collects environmental information and driving system information, transmits the environmental information and the driving system information to the control system, the control system calculates and processes the environmental information and the driving system information according to the received information, and sends the information to the operation and driving system to drive the robot to execute corresponding actions.
The other technical scheme of the invention is as follows: a control method of a waste recycling robot device based on a deep Q network comprises the following steps: the method comprises the following steps:
acquiring environmental information including visual environmental information and non-visual information through a sensing system;
initializing neural network parameters including environment state information and reward information according to the environment information acquired in the step, and initializing various parameters of a reinforcement learning algorithm;
processing image information fed back by the surrounding environment, processing the image information into a gray image through digital processing, performing feature extraction and training by using a deep convolution network, converting high-dimensional environment visual information into low-dimensional feature information, and taking the low-dimensional feature information and the non-visual information as input states s of a current value network and a target value networkt
Fourth, the action of the robot is controlled by the output of the current value network; in a state stThen, the action a is obtained by calculation according to the current value network by using the action value function Q (s, a) in the reinforcement learning algorithmtThe robot performs action atThereafter, a new environmental state s is obtainedt+1And immediate reward rt
Fifthly, updating the current value network parameters and the target value network parameters, and updating the parameters by adopting a random small-batch gradient descent updating mode;
the calculation mode of the current value network loss function is as follows:
Figure GDA0002704316940000021
wherein Q (s ', a'; θ)i -) Indicates a state action value in the next state, Q (s, a; thetai) The method comprises the steps that a state action value in the current state is obtained, gamma is a discount factor of a return function, gamma (gamma is more than or equal to 0 and less than or equal to 1), E () is a loss function in a gradient descent algorithm, r is an immediate reward value, and theta represents a network parameter;
the target value network is obtained by copying a current value network after executing every N thousands of steps;
sixthly, checking whether learning termination conditions are met, if not, returning to the fourth step to continue circulation, and if not, ending; the learning termination condition is that the article falls off or the set step number is completed.
In the above technical scheme, in the fourth step, an experience pool E is set, and the experience pool E is used for obtaining state information and reward information fed back by the environment after the robot interacts with the environment, and specifically includes: selecting and executing actions according to the action value function Q (s, a), saving the immediate reward r obtained by the current state s, the action a and the executed action and the next arrived state s' as a tuple in an experience pool E, repeating the steps three to five thousand, and storing the results in the experience pool E, wherein the parameters for updating the current value network and the target value network in the step fifthly need to be sampled from the experience pool E.
In the above technical scheme, the sample sampled from the experience pool E in the step fifthly needs to be preferentially selected according to the priority level, and the priority level is set as: every time the content is stored in the experience pool E, the priority level of the sample is updated, and the updating formula is as follows:
Figure GDA0002704316940000031
where t is the number of times the sample is selected, β is the degree of influence of the usage priority, piAfter the priority of the sample is calculated for the probability of the selected ith sample, carrying out normalization operation on the probability, wherein the formula is as follows:
Figure GDA0002704316940000032
in the technical scheme, the current value network consists of three layers of convolutional neural networks and a full connection layer, and the activation function is a relu function; the system is used for processing image information obtained by processing of a sensing system, wherein the convolutional neural network outputs an action value function Q (s, a) through an activation function relu after extracting image features, and action is selected by a Greedy strategy according to the action value function Q (s, a).
The further technical proposal is that the action of the robot controlled by the output of the current value network is as follows: randomly extracting a plurality of samples from an experience pool E, taking the state s of the samples as the input of a first hidden layer of a current value network, outputting an action value function Q (s, a) by the current value network, and selecting an action a to be taken according to the action value functiontThe robot performing action atThereafter, a new environmental state s is obtainedt+1And immediate reward rtAnd adjusting parameters of the current value network through the current value network loss function.
In the above technical scheme, in the step three:
the state S is represented as: the environmental state sensed by the sensing system is position information of an article in the current view of the robot and is presented in an image mode;
action a is represented as: the operation set which can be executed in the current state comprises the angle and direction operation of grabbing articles by the robot;
the immediate reward r is: evaluating the action taken by the robot in the current state, and giving a reward of +1 if the object does not fall off after the robot grabs the object; if the item is successfully placed in the receiving mechanism, a prize is given +1000, if the item is dropped, a prize is given-1000, otherwise a prize is 0.
The invention has the advantages that:
1. according to the invention, the grabbing strategy of the robot facing different articles is obtained through the calculation of the reinforcement learning method according to the interaction between the robot and the sizes of the articles by utilizing the sensing system, so that the robot can face the articles with different shapes and can be smoothly placed in the accommodating mechanism, and the recovery of various waste products is realized;
2. by adopting a depth reinforcement learning control method with priority in a control system of the robot, the environment information acquired by a sensing system is processed, then a proper action is selected, and a control signal of the control system is transmitted to an operation and driving system by the sensing system, so that the robot can more accurately grab articles with different shapes;
3. the robot can train articles with different shapes, so that the applicability of the robot is greatly improved, and the robot can be widely applied to various scenes such as pavements and commercial comprehensive bodies after being fully trained;
4. the method can effectively process the control problem with continuous action space;
5. the article loss in the training and application process can be effectively avoided, and the training process is accelerated;
6. the convolutional neural network can be used for effectively extracting image features, so that the system can better find out proper actions.
Drawings
The invention is further described with reference to the following figures and examples:
fig. 1 is a block diagram of an information transfer structure of a robot apparatus according to an embodiment of the present invention;
FIG. 2 is a block diagram of a prioritized deep reinforcement learning controller according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a deep Q network structure according to an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following figures and examples:
example (b): referring to fig. 1-3, a waste recycling robot device based on a deep Q network includes a sensing system, a control system, an operating system and a driving system, and is characterized in that:
the sensing system: the robot comprises a camera and image acquisition equipment, wherein the camera is used for sensing the position information of an object in front of a robot and expressing the position information through image information;
the control system is: the robot is used for controlling a robot to grab an object by a grabbing arm, place the object in the containing mechanism and control the rotation angle of the rotating mechanism;
the operating system: the robot comprises a robot grabbing arm, a rotating mechanism and a receiving mechanism, receives information of a control system and executes various actions;
the driving system is: the system comprises a motor and a storage battery, and is used for providing power for various actions of an operation system execution control system;
the sensing system collects environmental information and driving system information, transmits the environmental information and the driving system information into the control system, the control system calculates and processes the environmental information and the driving system information according to the received information, and sends the information to the operation and driving system to drive the robot to execute corresponding actions, so that the grabbing and releasing of articles with various sizes are completed.
The specific implementation process is as follows:
in this embodiment, the overall control framework of the control system is a Deep Q Network (DQN) in Deep reinforcement Learning, and a Q-Learning algorithm in the field of reinforcement Learning is adopted for control. Assume that at each time step t 1,2, …, the robot sensor system observes a Markov decision process with a state stControlling the system to select action atImmediate reward r for obtaining environmental feedbacktAnd making the system transition to the next state st+1The transition probability is p(s)t,at,st+1). The goal of an agent in a reinforcement learning system is to learn a strategy pi that results in a cumulative discount award to be achieved in a future time step
Figure GDA0002704316940000051
And the maximum value (gamma is more than or equal to 0 and less than or equal to 1 and is a discount factor) is obtained, and the strategy is the optimal strategy. In a real environment, however, the state transition probability function p and the reward function r of the environment are unknown. The agent only rewards r immediately to learn the optimal strategytAvailable so that the loss function can be optimized directly using the strategic gradient method. And (3) performing current value network selection action, calculating loss by adopting TD (temporal difference) error, updating current value network parameters by a random gradient descent method, and searching an optimal strategy. The control structure is shown in fig. 2.
Under different environments, the network structures of the control systems are the same, and the same set of parameters are also adopted as algorithm parameters. The discount factor gamma of the return function is 0.99, a 3-layer convolutional neural network is adopted to extract image information collected by the sensing system, network parameters of the convolutional neural network are fixed, and the value network and the strategy network are composed of 3 hidden layers and one output layer. In each experiment, the initial state of the environment where the robot is located is a random initial state, the robot starts to learn from the random initial state, and if the control fails, the robot learns again until the robot can successfully grab the article.
Step 1: and acquiring the environment information of the robot.
The sensor system of the robot collects information through a camera and various image collection devices. The robot obtains image information of the surroundings of the robot through the sensor, and controls the behavior of the robot through the sensor.
Step 2: acquiring initial environment state information, reward information and the like of the robot, and initializing parameters of an algorithm.
And initializing neural network parameters and reinforcement learning algorithm parameters in the control system, wherein the neural network parameters comprise weights and offsets of the feedforward network.
And step 3: and processing the visual information fed back by the environment.
And sensing the state of the robot through a sensing system. The image information is processed into a gray image through digital processing, and high-dimensional environment visual information is converted into low-dimensional characteristic information. Low-dimensional characteristic information and sensor-perceived non-visual information as input states s of policy network and value networkt
And 4, step 4: filling experience pools
And after the robot interacts with the environment, the state information, reward information and the like fed back by the environment are obtained. Processing the high-dimensional visual information fed back by the environment through the step 3 to generate a processed output, repeating the operation for four times to obtain an output as a current value network input, selecting and executing an action according to an action value function, storing an immediate reward r obtained by the current state s, the action a and the executed action and an reached next state s' as a tuple in an experience pool E, and repeating the step 4 for fifty thousand steps. Updating the priority for the experience sample after each step, wherein the priority updating formula is as follows:
Figure GDA0002704316940000061
where t is the number of times the sample is selected, β is the degree of influence of the usage priority, piThe probability of being chosen for the ith sample. After the sample priority is calculated, the normalization operation is carried out, and the formula is as follows:
Figure GDA0002704316940000071
and 5: and controlling the action of the robot by the current value network.
Randomly extracting four samples from an experience pool E, taking the state s of the four samples as the input of a first hidden layer of a current value network, outputting an action value function Q (s, a) by the current value network, and selecting an action a to be taken according to the action value functiontThe robot performing action atThereafter, a new environmental state s is obtainedt+1And immediate reward rt. And passes an error function (current value network loss function L)ii) Adjust parameters of the current value network.
The current value network consists of three layers of convolutional neural networks and a full connection layer, and the activation function is a relu function. For processing the image information processed by the sensing system. After extracting image features, the convolutional neural network outputs an action value function through an activation function, and an action is selected by a Greedy strategy according to the action value function.
Step 6: the current state s, action a, the immediate reward r obtained by performing the action, and the next state s' reached are saved as a tuple in the experience pool E.
And 7: and updating the current value network parameters and the target value network parameters of the control system.
The robot continuously interacts with the environment in the manner of step 4, and a batch of samples is sampled to update the current value network and the target value network. The specific updating method is as follows:
current value network loss function Lii) The calculation method is as follows:
Figure GDA0002704316940000072
wherein Q (s ', a'; θ)i -) Indicates a state action value in the next state, Q (s, a; thetai) For the state action value in the current state, the method uses a Q-Learning algorithm in reinforcement Learning, and updates the current value network parameter by adopting a RMSProp gradient descent method (setting the momentum parameter gamma to be 0.95).
The target value network is copied from the current value network after each ten thousand steps.
And 8: viewing control results
Checking whether the learning termination condition is met, and if not, returning to the step 5 to continue the circulation. Otherwise, the algorithm is ended.
In a real environment, the initial state of the robot is initialized to the environment state of the position of the object in front of the robot, and the position of the object is a random position. The control system of the robot makes a decision on the actions the robot needs to take in one step by processing the collected environmental state and feedback information, and updates the current value network and the target value network by using the data until the robot encounters a termination state, and then the robot learns again. The robot executes 100 scenarios (scenario set to a finite length) in the environment, and if the robot can successfully grab an article, it is determined that the learning is successful.
The states, actions, and immediate rewards presented in this embodiment are respectively expressed as:
the state is as follows: the environmental state sensed by the sensing system is the position information of the object in the current view field of the robot, and is presented in an image mode.
And (4) action: the action is a set of operations that can be executed in the current state, and in this example, the action is divided into an angle and a direction operation for the robot to grab the article.
Immediate reward: an immediate reward is an assessment of the environment of the action taken by the robot in the current state. The reward function in the present invention is defined as: if the object does not fall off after the object is grabbed, giving a +1 reward; if the item is successfully placed in the designated location, a prize is given +1000, if the item is dropped a prize of-1000, otherwise a prize of 0 is given.
In the simulation process, after fifty million steps of simulation operation, if meeting the situation of ending the plot, a new initial environment state is executed at random, and the simulated steps are accumulated. The average of the recent 100 episodes of reward is used as a criterion to gauge whether the vehicle control is successful.

Claims (5)

1. A waste recycling robot control method based on a deep Q network comprises the following steps:
acquiring environmental information including visual environmental information and non-visual information through a sensing system;
initializing neural network parameters including environment state information and reward information according to the environment information acquired in the step, and initializing various parameters of a reinforcement learning algorithm;
processing image information fed back by the surrounding environment, processing the image information into a gray image through digital processing, performing feature extraction and training by using a deep convolution network, converting high-dimensional environment visual information into low-dimensional feature information, and taking the low-dimensional feature information and the non-visual information as input states s of a current value network and a target value networkt
Fourth, the action of the robot is controlled by the output of the current value network; in a state stThen, the action a is obtained by calculation according to the current value network by using the action value function Q (s, a) in the reinforcement learning algorithmtThe robot performs action atThereafter, a new environmental state s is obtainedt+1And immediate reward rt
Fifthly, updating the current value network parameters and the target value network parameters, and updating the parameters by adopting a random small-batch gradient descent updating mode;
current value network loss function calculation mode:
Figure FDA0002704316930000011
wherein
Figure FDA0002704316930000012
Represents a state action value in the next state, Q (s, a; theta)i) The state action value in the current state is gamma which is a discount factor of a return function, gamma is more than or equal to 0 and less than or equal to 1, E () is a loss function in a gradient descent algorithm, r is an immediate reward value, and theta represents a network parameter;
the target value network is obtained by copying a current value network after executing every N thousands of steps;
sixthly, checking whether learning termination conditions are met, if not, returning to the fourth step to continue circulation, and if not, ending; the learning termination condition is that the article falls off or the set step number is completed;
in step four, an experience pool E is set, and after the robot interacts with the environment, the experience pool E obtains state information and reward information fed back by the environment, and specifically includes: selecting and executing actions according to an action value function Q (s, a), storing an immediate reward r obtained by the current state s, the action a and the executed action and an arrived next state s 'into an experience pool E as a tuple, repeating the steps three to five thousand, storing the immediate reward r and the arrived next state s' into the experience pool E, and fifthly, wherein the parameters for updating the current value network and the target value network in the step are required to be sampled from the experience pool E;
the samples sampled from the experience pool E in the step fifthly need to be preferentially selected according to the priority levels of the samples, and the priority levels are set as follows: every time the content is stored in the experience pool E, the priority level of the sample is updated, and the updating formula is as follows:
Figure FDA0002704316930000021
where t is the number of times the sample is selected, β is the degree of influence of the usage priority, piAfter the priority of the sample is calculated for the probability of the selected ith sample, carrying out normalization operation on the probability, wherein the formula is as follows:
Figure FDA0002704316930000022
2. the control method according to claim 1, characterized in that: comprises a sensing system, a control system, an operating system and a driving system,
the sensing system: the robot comprises a camera and image acquisition equipment, wherein the camera is used for sensing the position information of an object in front of a robot and expressing the position information through image information;
the control system is: the robot is used for controlling a robot to grab an object by a grabbing arm, place the object in the containing mechanism and control the rotation angle of the rotating mechanism;
the operating system: the robot comprises a robot grabbing arm, a rotating mechanism and a receiving mechanism, wherein the robot grabbing arm, the rotating mechanism and the receiving mechanism are used for receiving information of a control system and executing various actions;
the driving system is: the system comprises a motor and a storage battery, and is used for providing power for various actions of an operation system execution control system;
the sensing system collects environmental information and driving system information, transmits the environmental information and the driving system information to the control system, the control system calculates and processes the environmental information and the driving system information according to the received information, and sends the information to the operation and driving system to drive the robot to execute corresponding actions.
3. The control method according to claim 1, characterized in that: the current value network consists of three layers of convolutional neural networks and a full connection layer, and the activation function is a relu function; the system is used for processing image information obtained by processing of a sensing system, wherein the convolutional neural network outputs an action value function Q (s, a) through an activation function relu after extracting image features, and action is selected by a Greedy strategy according to the action value function Q (s, a).
4. The control method according to claim 3, characterized in that: the action of the robot is controlled by the output of the current value network as follows: randomly extracting a plurality of samples from an experience pool E, taking the state s of the samples as the input of a first hidden layer of a current value network, outputting an action value function Q (s, a) by the current value network, and selecting an action a to be taken according to the action value functiontThe robot performing action atThereafter, a new environmental state s is obtainedt+1And immediate reward rtAnd adjusting parameters of the current value network through the current value network loss function.
5. The control method according to claim 1, characterized in that: in the step three:
the state S is represented as: the environmental state sensed by the sensing system is position information of an article in the current view of the robot and is presented in an image mode;
action a is represented as: the operation set which can be executed in the current state comprises the angle and direction operation of grabbing articles by the robot;
the immediate reward r is: evaluating the action taken by the robot in the current state, and giving a reward of +1 if the object does not fall off after the robot grabs the object; if the item is successfully placed in the receiving mechanism, a prize is given +1000, if the item is dropped, a prize is given-1000, otherwise a prize is 0.
CN201810199112.2A 2018-03-12 2018-03-12 Waste recycling robot control method based on deep Q network Active CN108415254B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810199112.2A CN108415254B (en) 2018-03-12 2018-03-12 Waste recycling robot control method based on deep Q network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810199112.2A CN108415254B (en) 2018-03-12 2018-03-12 Waste recycling robot control method based on deep Q network

Publications (2)

Publication Number Publication Date
CN108415254A CN108415254A (en) 2018-08-17
CN108415254B true CN108415254B (en) 2020-12-11

Family

ID=63131025

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810199112.2A Active CN108415254B (en) 2018-03-12 2018-03-12 Waste recycling robot control method based on deep Q network

Country Status (1)

Country Link
CN (1) CN108415254B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109634719A (en) * 2018-12-13 2019-04-16 国网上海市电力公司 A kind of dispatching method of virtual machine, device and electronic equipment
CN109693239A (en) * 2018-12-29 2019-04-30 深圳市越疆科技有限公司 A kind of robot grasping means based on deeply study
WO2020154542A1 (en) * 2019-01-23 2020-07-30 Google Llc Efficient adaption of robot control policy for new task using meta-learning based on meta-imitation learning and meta-reinforcement learning
CN110238840B (en) * 2019-04-24 2021-01-29 中山大学 Mechanical arm autonomous grabbing method based on vision
CN111251294A (en) * 2020-01-14 2020-06-09 北京航空航天大学 Robot grabbing method based on visual pose perception and deep reinforcement learning
CN112327821A (en) * 2020-07-08 2021-02-05 东莞市均谊视觉科技有限公司 Intelligent cleaning robot path planning method based on deep reinforcement learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN204287968U (en) * 2014-12-03 2015-04-22 韩烁 The scavenge dolly of Based Intelligent Control
CN105137967A (en) * 2015-07-16 2015-12-09 北京工业大学 Mobile robot path planning method with combination of depth automatic encoder and Q-learning algorithm
CN105740644A (en) * 2016-03-24 2016-07-06 苏州大学 Cleaning robot optimal target path planning method based on model learning
CN106094516A (en) * 2016-06-08 2016-11-09 南京大学 A kind of robot self-adapting grasping method based on deeply study
CN106113059A (en) * 2016-08-12 2016-11-16 桂林电子科技大学 A kind of waste recovery robot
CN205852816U (en) * 2016-08-12 2017-01-04 桂林电子科技大学 A kind of waste recovery robot

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2360629A3 (en) * 2005-05-07 2012-04-11 Stephen L. Thaler Device for the autonomous bootstrapping of useful information

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN204287968U (en) * 2014-12-03 2015-04-22 韩烁 The scavenge dolly of Based Intelligent Control
CN105137967A (en) * 2015-07-16 2015-12-09 北京工业大学 Mobile robot path planning method with combination of depth automatic encoder and Q-learning algorithm
CN105740644A (en) * 2016-03-24 2016-07-06 苏州大学 Cleaning robot optimal target path planning method based on model learning
CN106094516A (en) * 2016-06-08 2016-11-09 南京大学 A kind of robot self-adapting grasping method based on deeply study
CN106113059A (en) * 2016-08-12 2016-11-16 桂林电子科技大学 A kind of waste recovery robot
CN205852816U (en) * 2016-08-12 2017-01-04 桂林电子科技大学 A kind of waste recovery robot

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Human-level control through deep reinforcement learning;Volodymyr Mnih 等;《NATURE》;20150226;第518卷;第529-541页 *

Also Published As

Publication number Publication date
CN108415254A (en) 2018-08-17

Similar Documents

Publication Publication Date Title
CN108415254B (en) Waste recycling robot control method based on deep Q network
US20220212342A1 (en) Predictive robotic controller apparatus and methods
Wu et al. Daydreamer: World models for physical robot learning
US11062617B2 (en) Training system for autonomous driving control policy
US20200316773A1 (en) Adaptive predictor apparatus and methods
US9384443B2 (en) Robotic training apparatus and methods
US20180260685A1 (en) Hierarchical robotic controller apparatus and methods
WO2020154542A1 (en) Efficient adaption of robot control policy for new task using meta-learning based on meta-imitation learning and meta-reinforcement learning
CN112362066B (en) Path planning method based on improved deep reinforcement learning
Kiatos et al. Robust object grasping in clutter via singulation
CN108594804B (en) Automatic driving control method for distribution trolley based on deep Q network
Bakker et al. A robot that reinforcement-learns to identify and memorize important previous observations
CN111260027B (en) Intelligent agent automatic decision-making method based on reinforcement learning
CN112313043B (en) Self-supervising robot object interactions
CN113826051A (en) Generating digital twins of interactions between solid system parts
CN111300390B (en) Intelligent mechanical arm control system based on reservoir sampling and double-channel inspection pool
Zhao et al. Meld: Meta-reinforcement learning from images via latent state models
CN115917564A (en) System and method for learning reusable options to transfer knowledge between tasks
CN111783994A (en) Training method and device for reinforcement learning
CN115812180A (en) Robot-controlled offline learning using reward prediction model
JP7448683B2 (en) Learning options for action selection using meta-gradient in multi-task reinforcement learning
WO2014201422A2 (en) Apparatus and methods for hierarchical robotic control and robotic training
US20220410380A1 (en) Learning robotic skills with imitation and reinforcement at scale
Liu et al. Forward-looking imaginative planning framework combined with prioritized-replay double DQN
EP4143745A1 (en) Training an action selection system using relative entropy q-learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant