CN108415254B

CN108415254B - Waste recycling robot control method based on deep Q network

Info

Publication number: CN108415254B
Application number: CN201810199112.2A
Authority: CN
Inventors: 朱斐; 吴文; 伏玉琛; 周小科
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2018-03-12
Filing date: 2018-03-12
Publication date: 2020-12-11
Anticipated expiration: 2038-03-12
Also published as: CN108415254A

Abstract

The invention discloses a waste recycling robot control method and a waste recycling robot control device based on a deep Q network, which are characterized in that: the sensing system: the system is used for sensing the position information of an object in front of the robot and expressing the position information through image information; the control system is: the robot is used for controlling a grabbing arm of the robot to grab an object and placing the object in the accommodating mechanism; the operating system: receiving information of a control system and executing various actions; the driving system is: the power supply is used for providing power for the operation system to execute various actions of the control system; the sensing system collects environmental information and driving system information, transmits the environmental information and the driving system information to the control system, the control system calculates and processes the environmental information and the driving system information according to the received information, and sends the information to the operation and driving system to drive the robot to execute corresponding actions. The invention applies the reinforcement learning algorithm in the field of artificial intelligence, and can autonomously learn and update the parameters of the neural network, so that the robot achieves the control effect of recycling articles.

Description

Waste recycling robot control method based on deep Q network

Technical Field

The invention belongs to the technical field of artificial intelligence and control, and particularly relates to a waste recycling robot control method based on a deep Q network, which can perform self-learning and complete the grabbing control of a robot on articles.

Background

In recent years, artificial intelligence is more and more widely applied to family life, and a concept of smart home is formed. The sweeping robot is a small automatic control robot with artificial intelligence and is used for sweeping family sanitation. At present, the sweeping robot is well applied in the market, and the application of the sweeping robot partially liberates people from chores and gets favorable comment of people.

However, the cleaning object of the current sweeping robot mainly aims at the dust on the ground, and only cleans the ground by a dust collection method, so that the cleaning object is only suitable for cleaning in a single household in the ground environment, and most of the sweeping robots cannot handle large wastes such as waste bottles and cans, and can only mark the wastes as obstacles to directly bypass.

Obviously, a sweeping robot which can only sweep dust on the ground cannot completely meet the requirements of larger occasions and more complex environments (such as a road surface), so that the application range of the sweeping robot is limited.

Disclosure of Invention

The invention aims to: the control method is improved, the control method can adapt to a new environment more quickly through self-learning, the effectiveness of strategy updating is ensured, the requirements of different environments and different cleaning objects are adapted more quickly, and the application range is greatly expanded.

The technical scheme of the invention is as follows: the utility model provides a waste recycling robot device based on degree of depth Q network, includes sensing system, control system, operating system and actuating system, its characterized in that:

the sensing system: the robot comprises a camera and image acquisition equipment, wherein the camera is used for sensing the position information of an object in front of a robot and expressing the position information through image information;

the control system is: the robot is used for controlling a robot to grab an object by a grabbing arm, place the object in the containing mechanism and control the rotation angle of the rotating mechanism;

the operating system: the robot comprises a robot grabbing arm, a rotating mechanism and a receiving mechanism, wherein the robot grabbing arm, the rotating mechanism and the receiving mechanism are used for receiving information of a control system and executing various actions;

the driving system is: the system comprises a motor and a storage battery, and is used for providing power for various actions of an operation system execution control system;

the sensing system collects environmental information and driving system information, transmits the environmental information and the driving system information to the control system, the control system calculates and processes the environmental information and the driving system information according to the received information, and sends the information to the operation and driving system to drive the robot to execute corresponding actions.

The other technical scheme of the invention is as follows: a control method of a waste recycling robot device based on a deep Q network comprises the following steps: the method comprises the following steps:

acquiring environmental information including visual environmental information and non-visual information through a sensing system;

initializing neural network parameters including environment state information and reward information according to the environment information acquired in the step, and initializing various parameters of a reinforcement learning algorithm;

processing image information fed back by the surrounding environment, processing the image information into a gray image through digital processing, performing feature extraction and training by using a deep convolution network, converting high-dimensional environment visual information into low-dimensional feature information, and taking the low-dimensional feature information and the non-visual information as input states s of a current value network and a target value network_t；

Fourth, the action of the robot is controlled by the output of the current value network; in a state s_tThen, the action a is obtained by calculation according to the current value network by using the action value function Q (s, a) in the reinforcement learning algorithm_tThe robot performs action a_tThereafter, a new environmental state s is obtained_t+1And immediate reward r_t；

Fifthly, updating the current value network parameters and the target value network parameters, and updating the parameters by adopting a random small-batch gradient descent updating mode;

the calculation mode of the current value network loss function is as follows:

wherein Q (s ', a'; θ)_i ^-) Indicates a state action value in the next state, Q (s, a; theta_i) The method comprises the steps that a state action value in the current state is obtained, gamma is a discount factor of a return function, gamma (gamma is more than or equal to 0 and less than or equal to 1), E () is a loss function in a gradient descent algorithm, r is an immediate reward value, and theta represents a network parameter;

the target value network is obtained by copying a current value network after executing every N thousands of steps;

sixthly, checking whether learning termination conditions are met, if not, returning to the fourth step to continue circulation, and if not, ending; the learning termination condition is that the article falls off or the set step number is completed.

In the above technical scheme, in the fourth step, an experience pool E is set, and the experience pool E is used for obtaining state information and reward information fed back by the environment after the robot interacts with the environment, and specifically includes: selecting and executing actions according to the action value function Q (s, a), saving the immediate reward r obtained by the current state s, the action a and the executed action and the next arrived state s' as a tuple in an experience pool E, repeating the steps three to five thousand, and storing the results in the experience pool E, wherein the parameters for updating the current value network and the target value network in the step fifthly need to be sampled from the experience pool E.

In the above technical scheme, the sample sampled from the experience pool E in the step fifthly needs to be preferentially selected according to the priority level, and the priority level is set as: every time the content is stored in the experience pool E, the priority level of the sample is updated, and the updating formula is as follows:

where t is the number of times the sample is selected, β is the degree of influence of the usage priority, p_iAfter the priority of the sample is calculated for the probability of the selected ith sample, carrying out normalization operation on the probability, wherein the formula is as follows:

in the technical scheme, the current value network consists of three layers of convolutional neural networks and a full connection layer, and the activation function is a relu function; the system is used for processing image information obtained by processing of a sensing system, wherein the convolutional neural network outputs an action value function Q (s, a) through an activation function relu after extracting image features, and action is selected by a Greedy strategy according to the action value function Q (s, a).

The further technical proposal is that the action of the robot controlled by the output of the current value network is as follows: randomly extracting a plurality of samples from an experience pool E, taking the state s of the samples as the input of a first hidden layer of a current value network, outputting an action value function Q (s, a) by the current value network, and selecting an action a to be taken according to the action value function_tThe robot performing action a_tThereafter, a new environmental state s is obtained_t+1And immediate reward r_tAnd adjusting parameters of the current value network through the current value network loss function.

In the above technical scheme, in the step three:

the state S is represented as: the environmental state sensed by the sensing system is position information of an article in the current view of the robot and is presented in an image mode;

action a is represented as: the operation set which can be executed in the current state comprises the angle and direction operation of grabbing articles by the robot;

the immediate reward r is: evaluating the action taken by the robot in the current state, and giving a reward of +1 if the object does not fall off after the robot grabs the object; if the item is successfully placed in the receiving mechanism, a prize is given +1000, if the item is dropped, a prize is given-1000, otherwise a prize is 0.

The invention has the advantages that:

1. according to the invention, the grabbing strategy of the robot facing different articles is obtained through the calculation of the reinforcement learning method according to the interaction between the robot and the sizes of the articles by utilizing the sensing system, so that the robot can face the articles with different shapes and can be smoothly placed in the accommodating mechanism, and the recovery of various waste products is realized;

2. by adopting a depth reinforcement learning control method with priority in a control system of the robot, the environment information acquired by a sensing system is processed, then a proper action is selected, and a control signal of the control system is transmitted to an operation and driving system by the sensing system, so that the robot can more accurately grab articles with different shapes;

3. the robot can train articles with different shapes, so that the applicability of the robot is greatly improved, and the robot can be widely applied to various scenes such as pavements and commercial comprehensive bodies after being fully trained;

4. the method can effectively process the control problem with continuous action space;

5. the article loss in the training and application process can be effectively avoided, and the training process is accelerated;

6. the convolutional neural network can be used for effectively extracting image features, so that the system can better find out proper actions.

Drawings

The invention is further described with reference to the following figures and examples:

fig. 1 is a block diagram of an information transfer structure of a robot apparatus according to an embodiment of the present invention;

FIG. 2 is a block diagram of a prioritized deep reinforcement learning controller according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a deep Q network structure according to an embodiment of the present invention.

Detailed Description

example (b): referring to fig. 1-3, a waste recycling robot device based on a deep Q network includes a sensing system, a control system, an operating system and a driving system, and is characterized in that:

the operating system: the robot comprises a robot grabbing arm, a rotating mechanism and a receiving mechanism, receives information of a control system and executes various actions;

the sensing system collects environmental information and driving system information, transmits the environmental information and the driving system information into the control system, the control system calculates and processes the environmental information and the driving system information according to the received information, and sends the information to the operation and driving system to drive the robot to execute corresponding actions, so that the grabbing and releasing of articles with various sizes are completed.

The specific implementation process is as follows:

in this embodiment, the overall control framework of the control system is a Deep Q Network (DQN) in Deep reinforcement Learning, and a Q-Learning algorithm in the field of reinforcement Learning is adopted for control. Assume that at each time step t 1,2, …, the robot sensor system observes a Markov decision process with a state s_tControlling the system to select action a_tImmediate reward r for obtaining environmental feedback_tAnd making the system transition to the next state s_t+1The transition probability is p(s)_t,a_t,s_t+1). The goal of an agent in a reinforcement learning system is to learn a strategy pi that results in a cumulative discount award to be achieved in a future time step

And the maximum value (gamma is more than or equal to 0 and less than or equal to 1 and is a discount factor) is obtained, and the strategy is the optimal strategy. In a real environment, however, the state transition probability function p and the reward function r of the environment are unknown. The agent only rewards r immediately to learn the optimal strategy_tAvailable so that the loss function can be optimized directly using the strategic gradient method. And (3) performing current value network selection action, calculating loss by adopting TD (temporal difference) error, updating current value network parameters by a random gradient descent method, and searching an optimal strategy. The control structure is shown in fig. 2.

Under different environments, the network structures of the control systems are the same, and the same set of parameters are also adopted as algorithm parameters. The discount factor gamma of the return function is 0.99, a 3-layer convolutional neural network is adopted to extract image information collected by the sensing system, network parameters of the convolutional neural network are fixed, and the value network and the strategy network are composed of 3 hidden layers and one output layer. In each experiment, the initial state of the environment where the robot is located is a random initial state, the robot starts to learn from the random initial state, and if the control fails, the robot learns again until the robot can successfully grab the article.

Step 1: and acquiring the environment information of the robot.

The sensor system of the robot collects information through a camera and various image collection devices. The robot obtains image information of the surroundings of the robot through the sensor, and controls the behavior of the robot through the sensor.

Step 2: acquiring initial environment state information, reward information and the like of the robot, and initializing parameters of an algorithm.

And initializing neural network parameters and reinforcement learning algorithm parameters in the control system, wherein the neural network parameters comprise weights and offsets of the feedforward network.

And step 3: and processing the visual information fed back by the environment.

And sensing the state of the robot through a sensing system. The image information is processed into a gray image through digital processing, and high-dimensional environment visual information is converted into low-dimensional characteristic information. Low-dimensional characteristic information and sensor-perceived non-visual information as input states s of policy network and value network_t。

And 4, step 4: filling experience pools

And after the robot interacts with the environment, the state information, reward information and the like fed back by the environment are obtained. Processing the high-dimensional visual information fed back by the environment through the step 3 to generate a processed output, repeating the operation for four times to obtain an output as a current value network input, selecting and executing an action according to an action value function, storing an immediate reward r obtained by the current state s, the action a and the executed action and an reached next state s' as a tuple in an experience pool E, and repeating the step 4 for fifty thousand steps. Updating the priority for the experience sample after each step, wherein the priority updating formula is as follows:

where t is the number of times the sample is selected, β is the degree of influence of the usage priority, p_iThe probability of being chosen for the ith sample. After the sample priority is calculated, the normalization operation is carried out, and the formula is as follows:

and 5: and controlling the action of the robot by the current value network.

Randomly extracting four samples from an experience pool E, taking the state s of the four samples as the input of a first hidden layer of a current value network, outputting an action value function Q (s, a) by the current value network, and selecting an action a to be taken according to the action value function_tThe robot performing action a_tThereafter, a new environmental state s is obtained_t+1And immediate reward r_t. And passes an error function (current value network loss function L)_i(θ_i) Adjust parameters of the current value network.

The current value network consists of three layers of convolutional neural networks and a full connection layer, and the activation function is a relu function. For processing the image information processed by the sensing system. After extracting image features, the convolutional neural network outputs an action value function through an activation function, and an action is selected by a Greedy strategy according to the action value function.

Step 6: the current state s, action a, the immediate reward r obtained by performing the action, and the next state s' reached are saved as a tuple in the experience pool E.

And 7: and updating the current value network parameters and the target value network parameters of the control system.

The robot continuously interacts with the environment in the manner of step 4, and a batch of samples is sampled to update the current value network and the target value network. The specific updating method is as follows:

current value network loss function L_i(θ_i) The calculation method is as follows:

wherein Q (s ', a'; θ)_i ^-) Indicates a state action value in the next state, Q (s, a; theta_i) For the state action value in the current state, the method uses a Q-Learning algorithm in reinforcement Learning, and updates the current value network parameter by adopting a RMSProp gradient descent method (setting the momentum parameter gamma to be 0.95).

The target value network is copied from the current value network after each ten thousand steps.

And 8: viewing control results

Checking whether the learning termination condition is met, and if not, returning to the step 5 to continue the circulation. Otherwise, the algorithm is ended.

In a real environment, the initial state of the robot is initialized to the environment state of the position of the object in front of the robot, and the position of the object is a random position. The control system of the robot makes a decision on the actions the robot needs to take in one step by processing the collected environmental state and feedback information, and updates the current value network and the target value network by using the data until the robot encounters a termination state, and then the robot learns again. The robot executes 100 scenarios (scenario set to a finite length) in the environment, and if the robot can successfully grab an article, it is determined that the learning is successful.

The states, actions, and immediate rewards presented in this embodiment are respectively expressed as:

the state is as follows: the environmental state sensed by the sensing system is the position information of the object in the current view field of the robot, and is presented in an image mode.

And (4) action: the action is a set of operations that can be executed in the current state, and in this example, the action is divided into an angle and a direction operation for the robot to grab the article.

Immediate reward: an immediate reward is an assessment of the environment of the action taken by the robot in the current state. The reward function in the present invention is defined as: if the object does not fall off after the object is grabbed, giving a +1 reward; if the item is successfully placed in the designated location, a prize is given +1000, if the item is dropped a prize of-1000, otherwise a prize of 0 is given.

In the simulation process, after fifty million steps of simulation operation, if meeting the situation of ending the plot, a new initial environment state is executed at random, and the simulated steps are accumulated. The average of the recent 100 episodes of reward is used as a criterion to gauge whether the vehicle control is successful.

Claims

1. A waste recycling robot control method based on a deep Q network comprises the following steps:

current value network loss function calculation mode:

wherein

Represents a state action value in the next state, Q (s, a; theta)_i) The state action value in the current state is gamma which is a discount factor of a return function, gamma is more than or equal to 0 and less than or equal to 1, E () is a loss function in a gradient descent algorithm, r is an immediate reward value, and theta represents a network parameter;

sixthly, checking whether learning termination conditions are met, if not, returning to the fourth step to continue circulation, and if not, ending; the learning termination condition is that the article falls off or the set step number is completed;

in step four, an experience pool E is set, and after the robot interacts with the environment, the experience pool E obtains state information and reward information fed back by the environment, and specifically includes: selecting and executing actions according to an action value function Q (s, a), storing an immediate reward r obtained by the current state s, the action a and the executed action and an arrived next state s 'into an experience pool E as a tuple, repeating the steps three to five thousand, storing the immediate reward r and the arrived next state s' into the experience pool E, and fifthly, wherein the parameters for updating the current value network and the target value network in the step are required to be sampled from the experience pool E;

the samples sampled from the experience pool E in the step fifthly need to be preferentially selected according to the priority levels of the samples, and the priority levels are set as follows: every time the content is stored in the experience pool E, the priority level of the sample is updated, and the updating formula is as follows:

2. the control method according to claim 1, characterized in that: comprises a sensing system, a control system, an operating system and a driving system,

3. The control method according to claim 1, characterized in that: the current value network consists of three layers of convolutional neural networks and a full connection layer, and the activation function is a relu function; the system is used for processing image information obtained by processing of a sensing system, wherein the convolutional neural network outputs an action value function Q (s, a) through an activation function relu after extracting image features, and action is selected by a Greedy strategy according to the action value function Q (s, a).

4. The control method according to claim 3, characterized in that: the action of the robot is controlled by the output of the current value network as follows: randomly extracting a plurality of samples from an experience pool E, taking the state s of the samples as the input of a first hidden layer of a current value network, outputting an action value function Q (s, a) by the current value network, and selecting an action a to be taken according to the action value function_tThe robot performing action a_tThereafter, a new environmental state s is obtained_t+1And immediate reward r_tAnd adjusting parameters of the current value network through the current value network loss function.

5. The control method according to claim 1, characterized in that: in the step three: