CN108415254B - Waste recycling robot control method based on deep Q network - Google Patents
Waste recycling robot control method based on deep Q network Download PDFInfo
- Publication number
- CN108415254B CN108415254B CN201810199112.2A CN201810199112A CN108415254B CN 108415254 B CN108415254 B CN 108415254B CN 201810199112 A CN201810199112 A CN 201810199112A CN 108415254 B CN108415254 B CN 108415254B
- Authority
- CN
- China
- Prior art keywords
- information
- robot
- action
- network
- state
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 239000002699 waste material Substances 0.000 title claims abstract description 13
- 238000004064 recycling Methods 0.000 title claims abstract description 10
- 230000009471 action Effects 0.000 claims abstract description 75
- 230000007613 environmental effect Effects 0.000 claims abstract description 26
- 230000007246 mechanism Effects 0.000 claims abstract description 20
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 13
- 230000002787 reinforcement Effects 0.000 claims abstract description 13
- 230000008569 process Effects 0.000 claims abstract description 10
- 238000013528 artificial neural network Methods 0.000 claims abstract description 5
- 230000001276 controlling effect Effects 0.000 claims abstract description 5
- 230000000875 corresponding effect Effects 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 39
- 238000012545 processing Methods 0.000 claims description 15
- 230000000007 visual effect Effects 0.000 claims description 12
- 238000013527 convolutional neural network Methods 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 2
- 238000013473 artificial intelligence Methods 0.000 abstract description 4
- 230000000694 effects Effects 0.000 abstract 1
- 238000010408 sweeping Methods 0.000 description 8
- 238000004140 cleaning Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 239000000428 dust Substances 0.000 description 3
- 230000007704 transition Effects 0.000 description 3
- 239000003795 chemical substances by application Substances 0.000 description 2
- 238000011478 gradient descent method Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B13/00—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
- G05B13/02—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
- G05B13/04—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
- G05B13/042—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Manipulator (AREA)
- Feedback Control In General (AREA)
Abstract
The invention discloses a waste recycling robot control method and a waste recycling robot control device based on a deep Q network, which are characterized in that: the sensing system: the system is used for sensing the position information of an object in front of the robot and expressing the position information through image information; the control system is: the robot is used for controlling a grabbing arm of the robot to grab an object and placing the object in the accommodating mechanism; the operating system: receiving information of a control system and executing various actions; the driving system is: the power supply is used for providing power for the operation system to execute various actions of the control system; the sensing system collects environmental information and driving system information, transmits the environmental information and the driving system information to the control system, the control system calculates and processes the environmental information and the driving system information according to the received information, and sends the information to the operation and driving system to drive the robot to execute corresponding actions. The invention applies the reinforcement learning algorithm in the field of artificial intelligence, and can autonomously learn and update the parameters of the neural network, so that the robot achieves the control effect of recycling articles.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence and control, and particularly relates to a waste recycling robot control method based on a deep Q network, which can perform self-learning and complete the grabbing control of a robot on articles.
Background
In recent years, artificial intelligence is more and more widely applied to family life, and a concept of smart home is formed. The sweeping robot is a small automatic control robot with artificial intelligence and is used for sweeping family sanitation. At present, the sweeping robot is well applied in the market, and the application of the sweeping robot partially liberates people from chores and gets favorable comment of people.
However, the cleaning object of the current sweeping robot mainly aims at the dust on the ground, and only cleans the ground by a dust collection method, so that the cleaning object is only suitable for cleaning in a single household in the ground environment, and most of the sweeping robots cannot handle large wastes such as waste bottles and cans, and can only mark the wastes as obstacles to directly bypass.
Obviously, a sweeping robot which can only sweep dust on the ground cannot completely meet the requirements of larger occasions and more complex environments (such as a road surface), so that the application range of the sweeping robot is limited.
Disclosure of Invention
The invention aims to: the control method is improved, the control method can adapt to a new environment more quickly through self-learning, the effectiveness of strategy updating is ensured, the requirements of different environments and different cleaning objects are adapted more quickly, and the application range is greatly expanded.
The technical scheme of the invention is as follows: the utility model provides a waste recycling robot device based on degree of depth Q network, includes sensing system, control system, operating system and actuating system, its characterized in that:
the sensing system: the robot comprises a camera and image acquisition equipment, wherein the camera is used for sensing the position information of an object in front of a robot and expressing the position information through image information;
the control system is: the robot is used for controlling a robot to grab an object by a grabbing arm, place the object in the containing mechanism and control the rotation angle of the rotating mechanism;
the operating system: the robot comprises a robot grabbing arm, a rotating mechanism and a receiving mechanism, wherein the robot grabbing arm, the rotating mechanism and the receiving mechanism are used for receiving information of a control system and executing various actions;
the driving system is: the system comprises a motor and a storage battery, and is used for providing power for various actions of an operation system execution control system;
the sensing system collects environmental information and driving system information, transmits the environmental information and the driving system information to the control system, the control system calculates and processes the environmental information and the driving system information according to the received information, and sends the information to the operation and driving system to drive the robot to execute corresponding actions.
The other technical scheme of the invention is as follows: a control method of a waste recycling robot device based on a deep Q network comprises the following steps: the method comprises the following steps:
acquiring environmental information including visual environmental information and non-visual information through a sensing system;
initializing neural network parameters including environment state information and reward information according to the environment information acquired in the step, and initializing various parameters of a reinforcement learning algorithm;
processing image information fed back by the surrounding environment, processing the image information into a gray image through digital processing, performing feature extraction and training by using a deep convolution network, converting high-dimensional environment visual information into low-dimensional feature information, and taking the low-dimensional feature information and the non-visual information as input states s of a current value network and a target value networkt;
Fourth, the action of the robot is controlled by the output of the current value network; in a state stThen, the action a is obtained by calculation according to the current value network by using the action value function Q (s, a) in the reinforcement learning algorithmtThe robot performs action atThereafter, a new environmental state s is obtainedt+1And immediate reward rt;
Fifthly, updating the current value network parameters and the target value network parameters, and updating the parameters by adopting a random small-batch gradient descent updating mode;
the calculation mode of the current value network loss function is as follows:wherein Q (s ', a'; θ)i -) Indicates a state action value in the next state, Q (s, a; thetai) The method comprises the steps that a state action value in the current state is obtained, gamma is a discount factor of a return function, gamma (gamma is more than or equal to 0 and less than or equal to 1), E () is a loss function in a gradient descent algorithm, r is an immediate reward value, and theta represents a network parameter;
the target value network is obtained by copying a current value network after executing every N thousands of steps;
sixthly, checking whether learning termination conditions are met, if not, returning to the fourth step to continue circulation, and if not, ending; the learning termination condition is that the article falls off or the set step number is completed.
In the above technical scheme, in the fourth step, an experience pool E is set, and the experience pool E is used for obtaining state information and reward information fed back by the environment after the robot interacts with the environment, and specifically includes: selecting and executing actions according to the action value function Q (s, a), saving the immediate reward r obtained by the current state s, the action a and the executed action and the next arrived state s' as a tuple in an experience pool E, repeating the steps three to five thousand, and storing the results in the experience pool E, wherein the parameters for updating the current value network and the target value network in the step fifthly need to be sampled from the experience pool E.
In the above technical scheme, the sample sampled from the experience pool E in the step fifthly needs to be preferentially selected according to the priority level, and the priority level is set as: every time the content is stored in the experience pool E, the priority level of the sample is updated, and the updating formula is as follows:
where t is the number of times the sample is selected, β is the degree of influence of the usage priority, piAfter the priority of the sample is calculated for the probability of the selected ith sample, carrying out normalization operation on the probability, wherein the formula is as follows:
in the technical scheme, the current value network consists of three layers of convolutional neural networks and a full connection layer, and the activation function is a relu function; the system is used for processing image information obtained by processing of a sensing system, wherein the convolutional neural network outputs an action value function Q (s, a) through an activation function relu after extracting image features, and action is selected by a Greedy strategy according to the action value function Q (s, a).
The further technical proposal is that the action of the robot controlled by the output of the current value network is as follows: randomly extracting a plurality of samples from an experience pool E, taking the state s of the samples as the input of a first hidden layer of a current value network, outputting an action value function Q (s, a) by the current value network, and selecting an action a to be taken according to the action value functiontThe robot performing action atThereafter, a new environmental state s is obtainedt+1And immediate reward rtAnd adjusting parameters of the current value network through the current value network loss function.
In the above technical scheme, in the step three:
the state S is represented as: the environmental state sensed by the sensing system is position information of an article in the current view of the robot and is presented in an image mode;
action a is represented as: the operation set which can be executed in the current state comprises the angle and direction operation of grabbing articles by the robot;
the immediate reward r is: evaluating the action taken by the robot in the current state, and giving a reward of +1 if the object does not fall off after the robot grabs the object; if the item is successfully placed in the receiving mechanism, a prize is given +1000, if the item is dropped, a prize is given-1000, otherwise a prize is 0.
The invention has the advantages that:
1. according to the invention, the grabbing strategy of the robot facing different articles is obtained through the calculation of the reinforcement learning method according to the interaction between the robot and the sizes of the articles by utilizing the sensing system, so that the robot can face the articles with different shapes and can be smoothly placed in the accommodating mechanism, and the recovery of various waste products is realized;
2. by adopting a depth reinforcement learning control method with priority in a control system of the robot, the environment information acquired by a sensing system is processed, then a proper action is selected, and a control signal of the control system is transmitted to an operation and driving system by the sensing system, so that the robot can more accurately grab articles with different shapes;
3. the robot can train articles with different shapes, so that the applicability of the robot is greatly improved, and the robot can be widely applied to various scenes such as pavements and commercial comprehensive bodies after being fully trained;
4. the method can effectively process the control problem with continuous action space;
5. the article loss in the training and application process can be effectively avoided, and the training process is accelerated;
6. the convolutional neural network can be used for effectively extracting image features, so that the system can better find out proper actions.
Drawings
The invention is further described with reference to the following figures and examples:
fig. 1 is a block diagram of an information transfer structure of a robot apparatus according to an embodiment of the present invention;
FIG. 2 is a block diagram of a prioritized deep reinforcement learning controller according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a deep Q network structure according to an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following figures and examples:
example (b): referring to fig. 1-3, a waste recycling robot device based on a deep Q network includes a sensing system, a control system, an operating system and a driving system, and is characterized in that:
the sensing system: the robot comprises a camera and image acquisition equipment, wherein the camera is used for sensing the position information of an object in front of a robot and expressing the position information through image information;
the control system is: the robot is used for controlling a robot to grab an object by a grabbing arm, place the object in the containing mechanism and control the rotation angle of the rotating mechanism;
the operating system: the robot comprises a robot grabbing arm, a rotating mechanism and a receiving mechanism, receives information of a control system and executes various actions;
the driving system is: the system comprises a motor and a storage battery, and is used for providing power for various actions of an operation system execution control system;
the sensing system collects environmental information and driving system information, transmits the environmental information and the driving system information into the control system, the control system calculates and processes the environmental information and the driving system information according to the received information, and sends the information to the operation and driving system to drive the robot to execute corresponding actions, so that the grabbing and releasing of articles with various sizes are completed.
The specific implementation process is as follows:
in this embodiment, the overall control framework of the control system is a Deep Q Network (DQN) in Deep reinforcement Learning, and a Q-Learning algorithm in the field of reinforcement Learning is adopted for control. Assume that at each time step t 1,2, …, the robot sensor system observes a Markov decision process with a state stControlling the system to select action atImmediate reward r for obtaining environmental feedbacktAnd making the system transition to the next state st+1The transition probability is p(s)t,at,st+1). The goal of an agent in a reinforcement learning system is to learn a strategy pi that results in a cumulative discount award to be achieved in a future time stepAnd the maximum value (gamma is more than or equal to 0 and less than or equal to 1 and is a discount factor) is obtained, and the strategy is the optimal strategy. In a real environment, however, the state transition probability function p and the reward function r of the environment are unknown. The agent only rewards r immediately to learn the optimal strategytAvailable so that the loss function can be optimized directly using the strategic gradient method. And (3) performing current value network selection action, calculating loss by adopting TD (temporal difference) error, updating current value network parameters by a random gradient descent method, and searching an optimal strategy. The control structure is shown in fig. 2.
Under different environments, the network structures of the control systems are the same, and the same set of parameters are also adopted as algorithm parameters. The discount factor gamma of the return function is 0.99, a 3-layer convolutional neural network is adopted to extract image information collected by the sensing system, network parameters of the convolutional neural network are fixed, and the value network and the strategy network are composed of 3 hidden layers and one output layer. In each experiment, the initial state of the environment where the robot is located is a random initial state, the robot starts to learn from the random initial state, and if the control fails, the robot learns again until the robot can successfully grab the article.
Step 1: and acquiring the environment information of the robot.
The sensor system of the robot collects information through a camera and various image collection devices. The robot obtains image information of the surroundings of the robot through the sensor, and controls the behavior of the robot through the sensor.
Step 2: acquiring initial environment state information, reward information and the like of the robot, and initializing parameters of an algorithm.
And initializing neural network parameters and reinforcement learning algorithm parameters in the control system, wherein the neural network parameters comprise weights and offsets of the feedforward network.
And step 3: and processing the visual information fed back by the environment.
And sensing the state of the robot through a sensing system. The image information is processed into a gray image through digital processing, and high-dimensional environment visual information is converted into low-dimensional characteristic information. Low-dimensional characteristic information and sensor-perceived non-visual information as input states s of policy network and value networkt。
And 4, step 4: filling experience pools
And after the robot interacts with the environment, the state information, reward information and the like fed back by the environment are obtained. Processing the high-dimensional visual information fed back by the environment through the step 3 to generate a processed output, repeating the operation for four times to obtain an output as a current value network input, selecting and executing an action according to an action value function, storing an immediate reward r obtained by the current state s, the action a and the executed action and an reached next state s' as a tuple in an experience pool E, and repeating the step 4 for fifty thousand steps. Updating the priority for the experience sample after each step, wherein the priority updating formula is as follows:
where t is the number of times the sample is selected, β is the degree of influence of the usage priority, piThe probability of being chosen for the ith sample. After the sample priority is calculated, the normalization operation is carried out, and the formula is as follows:
and 5: and controlling the action of the robot by the current value network.
Randomly extracting four samples from an experience pool E, taking the state s of the four samples as the input of a first hidden layer of a current value network, outputting an action value function Q (s, a) by the current value network, and selecting an action a to be taken according to the action value functiontThe robot performing action atThereafter, a new environmental state s is obtainedt+1And immediate reward rt. And passes an error function (current value network loss function L)i(θi) Adjust parameters of the current value network.
The current value network consists of three layers of convolutional neural networks and a full connection layer, and the activation function is a relu function. For processing the image information processed by the sensing system. After extracting image features, the convolutional neural network outputs an action value function through an activation function, and an action is selected by a Greedy strategy according to the action value function.
Step 6: the current state s, action a, the immediate reward r obtained by performing the action, and the next state s' reached are saved as a tuple in the experience pool E.
And 7: and updating the current value network parameters and the target value network parameters of the control system.
The robot continuously interacts with the environment in the manner of step 4, and a batch of samples is sampled to update the current value network and the target value network. The specific updating method is as follows:
current value network loss function Li(θi) The calculation method is as follows:wherein Q (s ', a'; θ)i -) Indicates a state action value in the next state, Q (s, a; thetai) For the state action value in the current state, the method uses a Q-Learning algorithm in reinforcement Learning, and updates the current value network parameter by adopting a RMSProp gradient descent method (setting the momentum parameter gamma to be 0.95).
The target value network is copied from the current value network after each ten thousand steps.
And 8: viewing control results
Checking whether the learning termination condition is met, and if not, returning to the step 5 to continue the circulation. Otherwise, the algorithm is ended.
In a real environment, the initial state of the robot is initialized to the environment state of the position of the object in front of the robot, and the position of the object is a random position. The control system of the robot makes a decision on the actions the robot needs to take in one step by processing the collected environmental state and feedback information, and updates the current value network and the target value network by using the data until the robot encounters a termination state, and then the robot learns again. The robot executes 100 scenarios (scenario set to a finite length) in the environment, and if the robot can successfully grab an article, it is determined that the learning is successful.
The states, actions, and immediate rewards presented in this embodiment are respectively expressed as:
the state is as follows: the environmental state sensed by the sensing system is the position information of the object in the current view field of the robot, and is presented in an image mode.
And (4) action: the action is a set of operations that can be executed in the current state, and in this example, the action is divided into an angle and a direction operation for the robot to grab the article.
Immediate reward: an immediate reward is an assessment of the environment of the action taken by the robot in the current state. The reward function in the present invention is defined as: if the object does not fall off after the object is grabbed, giving a +1 reward; if the item is successfully placed in the designated location, a prize is given +1000, if the item is dropped a prize of-1000, otherwise a prize of 0 is given.
In the simulation process, after fifty million steps of simulation operation, if meeting the situation of ending the plot, a new initial environment state is executed at random, and the simulated steps are accumulated. The average of the recent 100 episodes of reward is used as a criterion to gauge whether the vehicle control is successful.
Claims (5)
1. A waste recycling robot control method based on a deep Q network comprises the following steps:
acquiring environmental information including visual environmental information and non-visual information through a sensing system;
initializing neural network parameters including environment state information and reward information according to the environment information acquired in the step, and initializing various parameters of a reinforcement learning algorithm;
processing image information fed back by the surrounding environment, processing the image information into a gray image through digital processing, performing feature extraction and training by using a deep convolution network, converting high-dimensional environment visual information into low-dimensional feature information, and taking the low-dimensional feature information and the non-visual information as input states s of a current value network and a target value networkt;
Fourth, the action of the robot is controlled by the output of the current value network; in a state stThen, the action a is obtained by calculation according to the current value network by using the action value function Q (s, a) in the reinforcement learning algorithmtThe robot performs action atThereafter, a new environmental state s is obtainedt+1And immediate reward rt;
Fifthly, updating the current value network parameters and the target value network parameters, and updating the parameters by adopting a random small-batch gradient descent updating mode;
current value network loss function calculation mode:whereinRepresents a state action value in the next state, Q (s, a; theta)i) The state action value in the current state is gamma which is a discount factor of a return function, gamma is more than or equal to 0 and less than or equal to 1, E () is a loss function in a gradient descent algorithm, r is an immediate reward value, and theta represents a network parameter;
the target value network is obtained by copying a current value network after executing every N thousands of steps;
sixthly, checking whether learning termination conditions are met, if not, returning to the fourth step to continue circulation, and if not, ending; the learning termination condition is that the article falls off or the set step number is completed;
in step four, an experience pool E is set, and after the robot interacts with the environment, the experience pool E obtains state information and reward information fed back by the environment, and specifically includes: selecting and executing actions according to an action value function Q (s, a), storing an immediate reward r obtained by the current state s, the action a and the executed action and an arrived next state s 'into an experience pool E as a tuple, repeating the steps three to five thousand, storing the immediate reward r and the arrived next state s' into the experience pool E, and fifthly, wherein the parameters for updating the current value network and the target value network in the step are required to be sampled from the experience pool E;
the samples sampled from the experience pool E in the step fifthly need to be preferentially selected according to the priority levels of the samples, and the priority levels are set as follows: every time the content is stored in the experience pool E, the priority level of the sample is updated, and the updating formula is as follows:
2. the control method according to claim 1, characterized in that: comprises a sensing system, a control system, an operating system and a driving system,
the sensing system: the robot comprises a camera and image acquisition equipment, wherein the camera is used for sensing the position information of an object in front of a robot and expressing the position information through image information;
the control system is: the robot is used for controlling a robot to grab an object by a grabbing arm, place the object in the containing mechanism and control the rotation angle of the rotating mechanism;
the operating system: the robot comprises a robot grabbing arm, a rotating mechanism and a receiving mechanism, wherein the robot grabbing arm, the rotating mechanism and the receiving mechanism are used for receiving information of a control system and executing various actions;
the driving system is: the system comprises a motor and a storage battery, and is used for providing power for various actions of an operation system execution control system;
the sensing system collects environmental information and driving system information, transmits the environmental information and the driving system information to the control system, the control system calculates and processes the environmental information and the driving system information according to the received information, and sends the information to the operation and driving system to drive the robot to execute corresponding actions.
3. The control method according to claim 1, characterized in that: the current value network consists of three layers of convolutional neural networks and a full connection layer, and the activation function is a relu function; the system is used for processing image information obtained by processing of a sensing system, wherein the convolutional neural network outputs an action value function Q (s, a) through an activation function relu after extracting image features, and action is selected by a Greedy strategy according to the action value function Q (s, a).
4. The control method according to claim 3, characterized in that: the action of the robot is controlled by the output of the current value network as follows: randomly extracting a plurality of samples from an experience pool E, taking the state s of the samples as the input of a first hidden layer of a current value network, outputting an action value function Q (s, a) by the current value network, and selecting an action a to be taken according to the action value functiontThe robot performing action atThereafter, a new environmental state s is obtainedt+1And immediate reward rtAnd adjusting parameters of the current value network through the current value network loss function.
5. The control method according to claim 1, characterized in that: in the step three:
the state S is represented as: the environmental state sensed by the sensing system is position information of an article in the current view of the robot and is presented in an image mode;
action a is represented as: the operation set which can be executed in the current state comprises the angle and direction operation of grabbing articles by the robot;
the immediate reward r is: evaluating the action taken by the robot in the current state, and giving a reward of +1 if the object does not fall off after the robot grabs the object; if the item is successfully placed in the receiving mechanism, a prize is given +1000, if the item is dropped, a prize is given-1000, otherwise a prize is 0.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810199112.2A CN108415254B (en) | 2018-03-12 | 2018-03-12 | Waste recycling robot control method based on deep Q network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810199112.2A CN108415254B (en) | 2018-03-12 | 2018-03-12 | Waste recycling robot control method based on deep Q network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108415254A CN108415254A (en) | 2018-08-17 |
CN108415254B true CN108415254B (en) | 2020-12-11 |
Family
ID=63131025
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810199112.2A Active CN108415254B (en) | 2018-03-12 | 2018-03-12 | Waste recycling robot control method based on deep Q network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108415254B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109634719A (en) * | 2018-12-13 | 2019-04-16 | 国网上海市电力公司 | A kind of dispatching method of virtual machine, device and electronic equipment |
CN109693239A (en) * | 2018-12-29 | 2019-04-30 | 深圳市越疆科技有限公司 | A kind of robot grasping means based on deeply study |
WO2020154542A1 (en) * | 2019-01-23 | 2020-07-30 | Google Llc | Efficient adaption of robot control policy for new task using meta-learning based on meta-imitation learning and meta-reinforcement learning |
US11440183B2 (en) * | 2019-03-27 | 2022-09-13 | Abb Schweiz Ag | Hybrid machine learning-based systems and methods for training an object picking robot with real and simulated performance data |
CN110238840B (en) * | 2019-04-24 | 2021-01-29 | 中山大学 | Mechanical arm autonomous grabbing method based on vision |
CN111251294A (en) * | 2020-01-14 | 2020-06-09 | 北京航空航天大学 | Robot grabbing method based on visual pose perception and deep reinforcement learning |
CN111152227A (en) * | 2020-01-19 | 2020-05-15 | 聊城鑫泰机床有限公司 | Mechanical arm control method based on guided DQN control |
CN112327821A (en) * | 2020-07-08 | 2021-02-05 | 东莞市均谊视觉科技有限公司 | Intelligent cleaning robot path planning method based on deep reinforcement learning |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN204287968U (en) * | 2014-12-03 | 2015-04-22 | 韩烁 | The scavenge dolly of Based Intelligent Control |
CN105137967A (en) * | 2015-07-16 | 2015-12-09 | 北京工业大学 | Mobile robot path planning method with combination of depth automatic encoder and Q-learning algorithm |
CN105740644A (en) * | 2016-03-24 | 2016-07-06 | 苏州大学 | Cleaning robot optimal target path planning method based on model learning |
CN106094516A (en) * | 2016-06-08 | 2016-11-09 | 南京大学 | A kind of robot self-adapting grasping method based on deeply study |
CN106113059A (en) * | 2016-08-12 | 2016-11-16 | 桂林电子科技大学 | A kind of waste recovery robot |
CN205852816U (en) * | 2016-08-12 | 2017-01-04 | 桂林电子科技大学 | A kind of waste recovery robot |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1894150A2 (en) * | 2005-05-07 | 2008-03-05 | Stephen L. Thaler | Device for the autonomous bootstrapping of useful information |
-
2018
- 2018-03-12 CN CN201810199112.2A patent/CN108415254B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN204287968U (en) * | 2014-12-03 | 2015-04-22 | 韩烁 | The scavenge dolly of Based Intelligent Control |
CN105137967A (en) * | 2015-07-16 | 2015-12-09 | 北京工业大学 | Mobile robot path planning method with combination of depth automatic encoder and Q-learning algorithm |
CN105740644A (en) * | 2016-03-24 | 2016-07-06 | 苏州大学 | Cleaning robot optimal target path planning method based on model learning |
CN106094516A (en) * | 2016-06-08 | 2016-11-09 | 南京大学 | A kind of robot self-adapting grasping method based on deeply study |
CN106113059A (en) * | 2016-08-12 | 2016-11-16 | 桂林电子科技大学 | A kind of waste recovery robot |
CN205852816U (en) * | 2016-08-12 | 2017-01-04 | 桂林电子科技大学 | A kind of waste recovery robot |
Non-Patent Citations (1)
Title |
---|
Human-level control through deep reinforcement learning;Volodymyr Mnih 等;《NATURE》;20150226;第518卷;第529-541页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108415254A (en) | 2018-08-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108415254B (en) | Waste recycling robot control method based on deep Q network | |
US20220212342A1 (en) | Predictive robotic controller apparatus and methods | |
Wu et al. | Daydreamer: World models for physical robot learning | |
US11062617B2 (en) | Training system for autonomous driving control policy | |
US20200316773A1 (en) | Adaptive predictor apparatus and methods | |
US9384443B2 (en) | Robotic training apparatus and methods | |
CN112362066B (en) | Path planning method based on improved deep reinforcement learning | |
US20180260685A1 (en) | Hierarchical robotic controller apparatus and methods | |
Kiatos et al. | Robust object grasping in clutter via singulation | |
WO2020154542A1 (en) | Efficient adaption of robot control policy for new task using meta-learning based on meta-imitation learning and meta-reinforcement learning | |
CN112313043B (en) | Self-supervising robot object interactions | |
CN111260027B (en) | Intelligent agent automatic decision-making method based on reinforcement learning | |
CN113826051A (en) | Generating digital twins of interactions between solid system parts | |
CN108594804B (en) | Automatic driving control method for distribution trolley based on deep Q network | |
CN111300390A (en) | Intelligent mechanical arm control system based on reservoir sampling and double-channel inspection pool | |
Zhao et al. | Meld: Meta-reinforcement learning from images via latent state models | |
CN112884131A (en) | Deep reinforcement learning strategy optimization defense method and device based on simulation learning | |
CN115917564A (en) | System and method for learning reusable options to transfer knowledge between tasks | |
CN110400345A (en) | Radioactive waste based on deeply study, which pushes away, grabs collaboration method for sorting | |
JP7448683B2 (en) | Learning options for action selection using meta-gradient in multi-task reinforcement learning | |
CN111783994A (en) | Training method and device for reinforcement learning | |
CN115812180A (en) | Robot-controlled offline learning using reward prediction model | |
WO2014201422A2 (en) | Apparatus and methods for hierarchical robotic control and robotic training | |
US20220410380A1 (en) | Learning robotic skills with imitation and reinforcement at scale | |
CN116360435A (en) | Training method and system for multi-agent collaborative strategy based on plot memory |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |