CN108415254B

CN108415254B - Control method of waste recycling robot based on deep Q network

Info

Publication number: CN108415254B
Application number: CN201810199112.2A
Authority: CN
Inventors: 朱斐; 吴文; 伏玉琛; 周小科
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2018-03-12
Filing date: 2018-03-12
Publication date: 2020-12-11
Anticipated expiration: 2038-03-12
Also published as: CN108415254A

Abstract

The present invention discloses a control method and device for a waste recycling robot based on a deep Q network, characterized in that: the sensor system is used to sense the position information of objects in front of the robot, represented by image information; the control system is used to control the robot's grasping arm to grasp objects and place objects in a storage mechanism; the operation system receives information from the control system and performs various actions; the drive system is used to provide power for the operation system to perform various actions of the control system; the sensor system collects environmental information and drive system information, and transmits the environmental information and drive system information to the control system, which calculates and processes the received information, and sends information to the operation and drive system to drive the robot to perform corresponding actions. The present invention uses the reinforcement learning algorithm in the field of artificial intelligence, and can autonomously learn and update the parameters of the neural network so that the robot can achieve the control effect of recycling items.

Description

Control method of waste recycling robot based on deep Q network

技术领域technical field

本发明属于人工智能以及控制技术领域，尤其涉及一种基于深度Q网络的废品回收机器人控制方法，可以进行自我学习，完成机器人对物品的抓取控制。The invention belongs to the field of artificial intelligence and control technology, and in particular relates to a method for controlling a waste recycling robot based on a deep Q network, which can carry out self-learning and complete the grasping control of objects by the robot.

背景技术Background technique

近年来，人工智能在家庭生活中的应用越来越广泛，形成了智能家居的概念。其中，扫地机器人就是一种使用具有人工智能的小型自动控制机器人，用于打扫家庭卫生。目前，扫地机器人在市场中已经得到了良好的应用，扫地机器人的应用使人们从家务琐事中部分解放了出来，得到了人们的好评。In recent years, the application of artificial intelligence in family life has become more and more extensive, forming the concept of smart home. Among them, the sweeping robot is a small automatic control robot with artificial intelligence, which is used to clean the home. At present, sweeping robots have been well used in the market. The application of sweeping robots has partially liberated people from household chores and has been well received by people.

然而，目前的扫地机器人清洁对象主要针对地面灰尘，只能通过吸尘的方法，清扫地面，因此仅适用于地面环境单一的家庭清洁，对于一些较大的废弃物，如废弃瓶、罐，大多数扫地机器人会束手无策，只能标记为障碍物，直接绕行。However, the current cleaning objects of sweeping robots are mainly aimed at the dust on the ground, and can only clean the ground by vacuuming, so it is only suitable for household cleaning of a single ground environment. For some large wastes, such as discarded bottles and cans, Most sweeping robots will be helpless and can only mark them as obstacles and bypass them directly.

显然，只能清扫地面灰尘的扫地机器人并不能完全满足较大场合、环境更为复杂(如路面)的需要，从而使得扫地机器人使用范围受到局限。Obviously, a sweeping robot that can only clean the dust on the ground cannot fully meet the needs of larger occasions and more complex environments (such as road surfaces), which limits the scope of use of the sweeping robot.

发明内容SUMMARY OF THE INVENTION

本发明目的是：提供一种基于深度Q网络的废品回收机器人控制方法及其装置，通过对控制方法的改良，通过自我学习能够更快的适应新的环境，保证策略更新的有效性，更快适应不同环境、不同清洁对象的需要，大大拓展适用范围。The purpose of the present invention is to provide a control method and device for a waste recycling robot based on a deep Q network. Through the improvement of the control method and self-learning, it can adapt to the new environment faster, ensure the effectiveness of the strategy update, and achieve faster Adapt to the needs of different environments and different cleaning objects, and greatly expand the scope of application.

本发明的技术方案是：一种基于深度Q网络的废品回收机器人装置，包括传感系统、控制系统、作业系统及驱动系统，其特征在于：The technical scheme of the present invention is: a waste recycling robot device based on a deep Q network, comprising a sensing system, a control system, an operation system and a driving system, and is characterized in that:

所述传感系统：包括摄像机以及图像采集设备，用于感知机器人面前物体位置信息，通过图像信息表示；The sensing system: including a camera and an image acquisition device, used for sensing the position information of objects in front of the robot, represented by image information;

所述控制系统：用于控制机器人抓取手臂抓取物体与放置物体于收纳机构内，以及控制旋转机构的旋转角度；The control system: used to control the robot grasping arm to grasp objects and place objects in the storage mechanism, and control the rotation angle of the rotating mechanism;

所述作业系统：包括机器人抓取手臂、旋转机构、收纳机构，用于接收控制系统的信息，执行各项动作；The operating system: including a robot grabbing arm, a rotating mechanism, and a storage mechanism, which are used to receive information from the control system and perform various actions;

所述驱动系统：包括电机、蓄电池，用于为作业系统执行控制系统的各项动作提供动力；The drive system: including a motor and a battery, used to provide power for the operating system to perform various actions of the control system;

所述传感系统采集环境信息和驱动系统信息，并将环境信息和驱动系统信息传入控制系统，由控制系统根据接收到的信息来计算处理，并发送信息于作业、驱动系统驱动机器人执行相应动作。The sensing system collects environmental information and driving system information, and transmits the environmental information and driving system information to the control system. action.

本发明的另一个技术方案是：一种基于深度Q网络的废品回收机器人装置的控制方法：其方法步骤为：Another technical solution of the present invention is: a control method of a waste recycling robot device based on a deep Q network: the method steps are:

⑴通过传感系统获取环境信息，包括视觉环境信息和非视觉信息；(1) Obtain environmental information through the sensing system, including visual environmental information and non-visual information;

⑵根据所述步骤⑴中获取的环境信息，初始化神经网络参数，包括环境状态信息和奖赏信息，并初始化强化学习算法的各项参数；(2) According to the environmental information obtained in the step (1), initialize the neural network parameters, including environmental state information and reward information, and initialize various parameters of the reinforcement learning algorithm;

⑶对周边环境反馈的图像信息进行处理，通过数字化处理将图像信息处理为灰度图像，使用深度卷积网络进行特征提取和训练，将高维度的环境视觉信息转换成低维度的特征信息，低维度特征信息与所述非视觉信息作为当前值网络和目标值网络的输入状态s_t；(3) Process the image information fed back by the surrounding environment, process the image information into grayscale images through digital processing, use deep convolutional networks for feature extraction and training, and convert high-dimensional environmental visual information into low-dimensional feature information. The dimension feature information and the non-visual information are used as the input state _st of the current value network and the target value network;

⑷由所述当前值网络的输出控制机器人的行动；在状态s_t下，根据当前值网络利用强化学习算法中的动作值函数Q(s,a)计算获得行动a_t，机器人执行行动a_t后，获得新的环境状态s_t+1和立即奖赏r_t；(4) The action of the robot is controlled by the output of the current value network; in the state s _t , the action a _t is obtained by calculating the action value function Q(s, a) in the reinforcement learning algorithm according to the current value network, and the robot executes the action a _t After , obtain a new environmental state _st ₊₁ and an immediate reward rt;

⑸更新当前值网络参数和目标值网络参数，采用随机小批量梯度下降更新方式更新参数；⑸Update the current value network parameters and the target value network parameters, and update the parameters by stochastic mini-batch gradient descent update method;

所述当前值网络损失函数计算方式：

其中Q(s′,a′；θ_i ^-)表示下一个状态下的状态动作值，Q(s,a；θ_i)为当前状态下的状态动作值，γ为回报函数的折扣因子，γ(0≤γ≤1)，E()为梯度下降算法中的损失函数，r为立即奖赏值，θ表示网络参数；The calculation method of the current value network loss function:

where Q(s',a'; θ _i ^- ) represents the state action value in the next state, Q(s, a; θ _i ) is the state action value in the current state, γ is the discount factor of the reward function, γ (0≤γ≤1), E() is the loss function in the gradient descent algorithm, r is the immediate reward value, and θ is the network parameter;

所述目标值网络在每执行N万步后由当前值网络复制得到；The target value network is copied and obtained by the current value network after each execution of N million steps;

⑹查看是否满足学习终止条件，若不满足，则返回到步骤⑷继续循环，否则结束；所述学习终止条件为物品脱落，或完成设定步数。⑹ Check whether the learning termination condition is met, if not, return to step ⑷ to continue the cycle, otherwise end; the learning termination condition is that the item falls off, or the set number of steps is completed.

上述技术方案中，在所述步骤⑷中，设置经验池E，该经验池E为机器人与环境交互后，获得环境反馈的状态信息、奖赏信息，具体为：根据动作值函数Q(s,a)选择动作并执行，将当前状态s、动作a、执行动作所获得立即奖赏r和到达的下一个状态s′作为一个元组保存到经验池E中，并重复上述步骤三～五万步，均存放于经验池E中，所述步骤⑸中的更新当前值网络和目标值网络的参数，需从经验池E中进行采样。In the above technical solution, in the step (4), an experience pool E is set, and the experience pool E is the state information and reward information that the robot obtains feedback from the environment after interacting with the environment, specifically: according to the action value function Q(s, a ) select an action and execute it, save the current state s, action a, the immediate reward r obtained by executing the action, and the next state s' reached as a tuple in the experience pool E, and repeat the above steps for three to fifty thousand steps, are stored in the experience pool E, and the parameters of updating the current value network and the target value network in the step (5) need to be sampled from the experience pool E.

上述技术方案中，所述步骤⑸从经验池E中采样的样本需根据其优先级别从优选取，该优先级别设置为：每当存放内容至经验池E中，更新一次样本的优先级别，更新公式为：In the above technical solution, the samples sampled from the experience pool E in step (5) need to be selected according to their priority level, and the priority level is set as: whenever the content is stored in the experience pool E, the priority level of the sample is updated once, and the formula is updated. for:

其中t为该样本被选取的次数，β为使用优先级的影响程度，p_i为第i个样本被选取的概率，在计算完样本优先级后，对其进行归一化操作，公式为：

Among them, t is the number of times the sample is selected, β is the influence degree of the use priority, and p _i is the probability that the ith sample is selected. After the sample priority is calculated, it is normalized. The formula is:

上述技术方案中，所述当前值网络由三层卷积神经网络和一层全连接层组成，激活函数为relu函数；用于处理经过传感系统处理得到的图像信息，其中卷积神经网络提取图像特征后通过激活函数relu输出动作值函数Q(s,a)，并根据动作值函数Q(s,a)用ε-Greedy贪心策略选择动作。In the above technical scheme, the current value network is composed of three layers of convolutional neural networks and one layer of fully connected layers, and the activation function is the relu function; it is used to process the image information processed by the sensor system, wherein the convolutional neural network extracts After the image features, the action value function Q(s,a) is output through the activation function relu, and the action is selected by the ε-Greedy greedy strategy according to the action value function Q(s,a).

进一步的技术方案是，将“当前值网络的输出控制机器人的行动”为：从经验池E中随机抽取若干个样本，将其状态s作为当前值网络的第一层隐藏层的输入，由当前值网络输出动作值函数Q(s,a)，并根据动作值函数选择所采取的动作a_t，机器人执行动作a_t后，获得新的环境状态s_t+1和立即奖赏r_t，并通过当前值网络损失函数调整当前值网络的参数。A further technical solution is to set "the output of the current value network to control the action of the robot" as: randomly select a number of samples from the experience pool E, and use its state s as the input of the first hidden layer of the current value network. The value network outputs the action value function Q(s,a), and selects the action a _t to take according to the action value function. After the robot performs the action a _t , it obtains a new environment state s _t+1 and an immediate reward r _t , and passes The current value network loss function adjusts the parameters of the current value network.

上述技术方案中，所述步骤⑶中：In above-mentioned technical scheme, in described step (3):

状态S表示为：传感系统感知的环境状态为机器人当前视野内物品的位置信息，以图像方式呈现；The state S is expressed as: the environmental state perceived by the sensing system is the position information of the objects in the current field of view of the robot, which is presented in the form of images;

行动a表示为：在当前状态下可以执行的操作集合，包括机器人抓取物品角度、方向操作；Action a is expressed as: the set of operations that can be performed in the current state, including the angle and direction operations of the robot grabbing the item;

立即奖赏r是：在当前状态下机器人所采取的行动的评价，若机器人抓取物品后物品未脱落，则给一个+1的奖赏；若物品成功放置在收纳机构内，则给奖赏+1000，若物品掉落则给奖赏-1000，其他情况奖赏为0。The immediate reward r is: the evaluation of the actions taken by the robot in the current state. If the object does not fall off after the robot grabs the object, it will give a reward of +1; if the object is successfully placed in the storage mechanism, the reward will be +1000, If the item is dropped, the reward will be -1000, otherwise the reward will be 0.

本发明的优点是：The advantages of the present invention are:

1、本发明根据机器人利用传感系统与物品大小的交互，通过强化学习方法的计算，获得机器人面对不同物品下的抓取策略，以使机器人能够面对形态各异的物品，均能顺利地将其放置于收纳机构内，实现对各类废品的回收；1. According to the interaction between the robot and the size of the object using the sensing system, the present invention obtains the grasping strategy of the robot facing different objects through the calculation of the reinforcement learning method, so that the robot can face objects of different shapes, and all can be smoothly Put it in the storage mechanism to realize the recovery of various wastes;

2、通过在机器人的控制系统中采用有优先级的深度强化学习控制方法，处理来自传感系统获取的环境信息，然后选择合适的行动，并利用传感系统传递控制系统的控制信号到作业和驱动系统，使机器人能够更为准确地抓取不同形状的物品；2. By adopting a prioritized deep reinforcement learning control method in the control system of the robot, processing the environmental information obtained from the sensing system, then selecting the appropriate action, and using the sensing system to transmit the control signal of the control system to the job and The drive system enables the robot to grasp objects of different shapes more accurately;

3、机器人可以对形状各异的物品进行训练，大大提高其应用性，充分训练后可以广泛应用于各种场景，如路面、商业综合体内；3. The robot can train objects of different shapes, which greatly improves its applicability. After sufficient training, it can be widely used in various scenarios, such as road surfaces and commercial complexes;

4、本方法能够有效处理具有连续动作空间的控制问题；4. This method can effectively deal with the control problem with continuous action space;

5、能够有效避免训练和应用过程中出现的物品损耗并加快训练过程；5. It can effectively avoid the loss of items in the training and application process and speed up the training process;

6、利用卷积神经网络能有效提取图像特征，使得系统能够更好地寻找到合适的动作。6. The use of convolutional neural network can effectively extract image features, so that the system can better find the appropriate action.

附图说明Description of drawings

下面结合附图及实施例对本发明作进一步描述：Below in conjunction with accompanying drawing and embodiment, the present invention is further described:

图1是本发明实施例一中机器人装置的信息传送结构框图；1 is a block diagram of the information transmission structure of the robot device in the first embodiment of the present invention;

图2是本发明实施例一中有优先级的深度强化学习控制器结构框图；2 is a structural block diagram of a priority deep reinforcement learning controller in Embodiment 1 of the present invention;

图3是本发明实施例一中深度Q网络结构示意图。FIG. 3 is a schematic structural diagram of a deep Q network in Embodiment 1 of the present invention.

具体实施方式Detailed ways

实施例：参见附图1～3所示，一种基于深度Q网络的废品回收机器人装置，包括传感系统、控制系统、作业系统及驱动系统，其特征在于：Embodiment: Referring to Figures 1 to 3, a waste recycling robot device based on a deep Q network includes a sensing system, a control system, an operation system and a drive system, and is characterized in that:

所述作业系统：包括机器人抓取手臂、旋转机构、收纳机构，接收控制系统的信息，执行各项动作；The operating system: including a robot grabbing arm, a rotating mechanism, and a storage mechanism, receiving information from the control system, and executing various actions;

所述传感系统采集环境信息和驱动系统信息，并将环境信息和驱动系统信息传入控制系统，由控制系统根据接收到的信息来计算处理，并发送信息于作业、驱动系统驱动机器人执行相应动作，完成对各种大小物品的抓取与投放。The sensing system collects environmental information and driving system information, and transmits the environmental information and driving system information to the control system. Actions to complete the grab and drop of various sizes of items.

具体的实施过程为：The specific implementation process is as follows:

本实施例中，控制系统的整体控制框架为深度强化学习中的深度Q网络(Deep Q-Network，DQN)，采用了强化学习领域的Q-Learning算法进行控制。假设在每个时间步t＝1,2,…，机器人传感器系统观察马尔科夫决策过程的状态为s_t，控制系统选择行动a_t，获得环境反馈的立即奖赏r_t，并使系统转移到下一个状态s_t+1，转移概率为p(s_t,a_t,s_t+1)。强化学习系统中智能体的目标是学习到一个策略π，使得在未来时间步内获得的累积折扣奖赏

最大(0≤γ≤1为折扣因子)，该策略即为最优策略。但是在现实环境中，环境的状态转移概率函数p和回报函数r未知。智能体要学习到最优策略，只有立即奖赏r_t可用，这样可以直接采用策略梯度方法优化损失函数。当前值网络选择行动，采用TD(TemporalDifference)误差来计算损失，并通过随机梯度下降方法更新当前值网络参数，寻找最优策略。控制结构如图2所示。In this embodiment, the overall control framework of the control system is a Deep Q-Network (DQN) in deep reinforcement learning, and the Q-Learning algorithm in the field of reinforcement learning is used for control. Assuming that at each time step t=1,2,..., the robot sensor system observes the state of the Markov decision process as s _t , the control system chooses the action a _t , obtains the immediate reward r _t of the environmental feedback, and makes the system transfer to The next state, s _t+1 , has a transition probability p(s _t , at , s _t ₊₁ ). The goal of an agent in a reinforcement learning system is to learn a policy π such that the cumulative discounted reward obtained in future time steps

Maximum (0≤γ≤1 is the discount factor), the strategy is the optimal strategy. But in the real environment, the state transition probability function p and reward function r of the environment are unknown. In order for the agent to learn the optimal policy, only the immediate reward _rt is available, so that the policy gradient method can be directly used to optimize the loss function. The current value network selects the action, uses the TD (TemporalDifference) error to calculate the loss, and updates the current value network parameters through the stochastic gradient descent method to find the optimal strategy. The control structure is shown in Figure 2.

在不同的环境下，控制系统的网络结构相同，算法参数也采用同一套参数。回报函数的折扣因子γ＝0.99，采用3层卷积神经网络来提取传感系统收集的图像信息，卷积神经网络的网络参数固定，值网络和策略网络由3层隐藏层和一层输出层组成。在每次实验中，机器人所处的环境初始状态是一个随机的初始状态，机器人从随机的初始状态开始学习，若控制失败，则机器人重新进行学习，直到机器人可以成功抓取物品。In different environments, the network structure of the control system is the same, and the algorithm parameters also use the same set of parameters. The discount factor of the reward function is γ=0.99. A 3-layer convolutional neural network is used to extract the image information collected by the sensor system. The network parameters of the convolutional neural network are fixed, and the value network and policy network are composed of 3 hidden layers and an output layer. composition. In each experiment, the initial state of the environment where the robot is located is a random initial state, and the robot starts learning from the random initial state. If the control fails, the robot re-learns until the robot can successfully grasp the item.

步骤1：获取机器人所处的环境信息。Step 1: Obtain the environment information where the robot is located.

机器人的传感器系统通过摄像机以及各种图像采集设备采集信息。机器人通过传感器获机器人周围环境的图像信息，并且通过传感器控制机器人的行动。The sensor system of the robot collects information through cameras and various image acquisition devices. The robot obtains the image information of the surrounding environment of the robot through the sensor, and controls the action of the robot through the sensor.

步骤2：获取机器人所处初始环境状态信息和奖赏信息等，并初始化算法的参数。Step 2: Obtain the initial environment state information and reward information of the robot, and initialize the parameters of the algorithm.

初始化控制系统中的神经网络参数和强化学习算法参数，其中神经网络参数包括前馈网络的权值和偏置。Initialize the neural network parameters and reinforcement learning algorithm parameters in the control system, where the neural network parameters include the weights and biases of the feedforward network.

步骤3：对于环境反馈的视觉信息进行处理。Step 3: Process the visual information of the environmental feedback.

通过传感系统感知机器人所处的状态。通过数字化处理将图像信息处理为灰度图像，将高维度的环境视觉信息转换成低维度的特征信息。低维度特征信息与传感器感知的非视觉信息作为策略网络和值网络的输入状态s_t。The state of the robot is sensed through the sensor system. The image information is processed into grayscale images through digital processing, and the high-dimensional environmental visual information is converted into low-dimensional feature information. The low-dimensional feature information and sensor-perceived non-visual information serve as the input state s _t of the policy network and value network.

步骤4：填充经验池Step 4: Populate the Experience Pool

机器人在与环境交互后，获得环境反馈的状态信息、奖赏信息等。环境反馈的高维视觉信息经过步骤3处理，产生一个处理后的输出，将该操作重复四次后作为当前值网络输入得到输出，根据动作值函数选择动作并执行，将当前状态s、动作a、执行动作所获得立即奖赏r和到达的下一个状态s′作为一个元组保存到经验池E中，重复步骤4五万步。每一步后为经验样本更新优先级，优先级更新公式为：After the robot interacts with the environment, it obtains the status information and reward information of the environment feedback. The high-dimensional visual information of the environmental feedback is processed in step 3, and a processed output is generated. After repeating the operation four times, the output is obtained as the current value network input, and the action is selected and executed according to the action value function. The current state s, action a , The immediate reward r obtained by performing the action and the next state s' reached are stored as a tuple in the experience pool E, and step 4 is repeated for fifty thousand steps. After each step, the priority is updated for the experience sample, and the priority update formula is:

其中，t为该样本被选取的次数，β为使用优先级的影响程度，p_i为第i个样本被选取的概率。在计算完样本优先级后，对其进行归一化操作，公式为：Among them, t is the number of times the sample is selected, β is the influence degree of the use priority, and p _i is the probability that the ith sample is selected. After calculating the sample priority, normalize it, the formula is:

步骤5：由当前值网络控制机器人的行动。Step 5: The actions of the robot are controlled by the current value network.

从经验池E中随机抽取四个样本，将其状态s作为当前值网络的第一层隐藏层的输入，由当前值网络输出动作值函数Q(s,a)，并根据动作值函数选择所采取的动作a_t，机器人执行动作a_t后，获得新的环境状态s_t+1和立即奖赏r_t。并通过误差函数(当前值网络损失函数L_i(θ_i))调整当前值网络的参数。Four samples are randomly selected from the experience pool E, and their state s is used as the input of the first hidden layer of the current value network, and the action value function Q(s, a) is output by the current value network. Taking action a _t , after the robot performs action a _t , it obtains a new environmental state s _t+1 and an immediate reward r _t . And adjust the parameters of the current value network through the error function (the current value network loss function _{Li (θ i} ₎ ).

当前值网络由三层卷积神经网络和一层全连接层组成，激活函数为relu函数。用于处理经过传感系统处理得到的图像信息。卷积神经网络提取图像特征后通过激活函数输出动作值函数，并根据动作值函数用ε-Greedy策略选择动作。The current value network consists of a three-layer convolutional neural network and a fully connected layer, and the activation function is the relu function. It is used to process the image information processed by the sensor system. After the convolutional neural network extracts the image features, the action value function is output through the activation function, and the action is selected by the ε-Greedy strategy according to the action value function.

步骤6：将当前状态s、动作a、执行动作所获得立即奖赏r和到达的下一个状态s′作为一个元组保存到经验池E中。Step 6: Save the current state s, the action a, the immediate reward r obtained by executing the action, and the next state s' reached into the experience pool E as a tuple.

步骤7：更新控制系统的当前值网络参数和目标值网络参数。Step 7: Update the current value network parameters and target value network parameters of the control system.

机器人通过步骤4的方式不断与环境进行交互，采样一批样本用来更新当前值网络和目标值网络。具体更新方式如下：The robot continuously interacts with the environment through step 4, and samples a batch of samples to update the current value network and the target value network. The specific update method is as follows:

当前值网络损失函数L_i(θ_i)的计算方式为：

其中Q(s′,a′；θ_i ^-)表示下一个状态下的状态动作值，Q(s,a；θ_i)为当前状态下的状态动作值，该方法使用了强化学习中的Q-Learning算法，并采用RMSProp梯度下降方法(设置动量参数γ为0.95)来更新当前值网络参数。The current value network loss function _{Li (θ i} ₎ is calculated as:

Where Q(s',a'; θ _i ^- ) represents the state action value in the next state, Q(s, a; θ _i ) is the state action value in the current state, this method uses the Q in reinforcement learning -Learning algorithm, and adopt RMSProp gradient descent method (set momentum parameter γ to 0.95) to update current value network parameters.

所述目标值网络在每执行一万步后由当前值网络复制得到。The target value network is replicated from the current value network after every ten thousand steps.

步骤8：查看控制结果Step 8: View Control Results

查看是否满足学习终止条件，若不满足，则返回到步骤5继续循环。否则结束算法。Check whether the learning termination condition is met, if not, return to step 5 to continue the loop. Otherwise end the algorithm.

在真实环境中，机器人的初始状态初始化为机器人面前物品所处位置的环境状态，物品所处位置是一个随机位置。机器人的控制系统通过处理所收集环境的状态和反馈信息来对机器人一步需要采取的行动做出决策，并利用这些数据更新当前值网络和目标值网络，直到机器人遇到终止状态，则机器人重新进行学习。机器人在环境中执行100个情节(情节设定为有限长度)，若能成功抓取物品，则判定学习成功。In the real environment, the initial state of the robot is initialized to the environmental state of the position of the item in front of the robot, and the position of the item is a random position. The control system of the robot makes decisions on the actions the robot needs to take in one step by processing the state and feedback information of the collected environment, and uses these data to update the current value network and the target value network until the robot encounters a termination state, the robot starts again. study. The robot executes 100 episodes in the environment (the episodes are set to have a limited length), and if they can successfully grab the item, the learning is determined to be successful.

本实施例中提出的状态、行动、立即奖赏分别表示为：The states, actions, and immediate rewards proposed in this embodiment are respectively expressed as:

状态：传感系统感知的环境状态为机器人当前视野内物品的位置信息，以图像方式呈现。Status: The environmental status perceived by the sensing system is the position information of the items in the robot's current field of view, which is presented in the form of images.

行动：行动是在当前状态下可以执行的操作集合，本实例中行动分为机器人抓取物品角度、方向操作。Action: Action is a set of operations that can be performed in the current state. In this example, the action is divided into the angle and direction operations of the robot grabbing the item.

立即奖赏：立即奖赏是环境对在当前状态下机器人所采取的行动的评价。本发明中奖赏函数定义为：若物体在抓取物品后物品未脱落，则给一个+1的奖赏；若物品成功放置在指定位置，则给奖赏+1000，若物品掉落则给奖赏-1000，其他情况奖赏为0。Immediate reward: Immediate reward is the environment's evaluation of the action taken by the robot in the current state. The reward function in the present invention is defined as: if the object does not fall off after grabbing the object, a reward of +1 is given; if the object is successfully placed in the designated position, the reward is +1000, and if the object is dropped, the reward is -1000 , otherwise the reward is 0.

在仿真过程中，经过五千万步的仿真运行，若遇到情节结束情况，则随机一个新的初始环境状态继续执行，模拟的步数累加。采用最近的100个情节的奖赏的平均值作为衡量车辆控制是否成功的标准。In the simulation process, after 50 million steps of simulation operation, if encountering the end of the plot, a random new initial environment state will continue to execute, and the number of simulated steps will be accumulated. The average value of the rewards for the most recent 100 episodes is used as the criterion to measure the success of vehicle control.

Claims

1. A kind of waste recycling robot control method based on deep Q network, its method steps are:

(1) Obtain environmental information through the sensing system, including visual environmental information and non-visual information;

(2) According to the environmental information obtained in the step (1), initialize the neural network parameters, including environmental state information and reward information, and initialize various parameters of the reinforcement learning algorithm;

(3) Process the image information fed back by the surrounding environment, process the image information into grayscale images through digital processing, use deep convolutional networks for feature extraction and training, and convert high-dimensional environmental visual information into low-dimensional feature information. The dimension feature information and the non-visual information are used as the input state _st of the current value network and the target value network;

(4) The action of the robot is controlled by the output of the current value network; in the state s _t , the action a _t is obtained by calculating the action value function Q(s, a) in the reinforcement learning algorithm according to the current value network, and the robot executes the action a _t After , obtain a new environmental state _st ₊₁ and an immediate reward rt;

⑸Update the current value network parameters and the target value network parameters, and update the parameters by stochastic mini-batch gradient descent update method;

The calculation method of the current value network loss function:

in

Represents the state action value in the next state, Q(s, a; θ _i ) is the state action value in the current state, γ is the discount factor of the reward function, 0≤γ≤1, E() is the gradient descent algorithm in The loss function of , r is the immediate reward value, and θ is the network parameter;

The target value network is copied and obtained by the current value network after each execution of N million steps;

⑹ Check whether the learning termination condition is met, if not, return to step ⑷ to continue the cycle, otherwise end; the learning termination condition is that the item falls off, or the set number of steps is completed;

In the step (4), an experience pool E is set, and the experience pool E is the state information and reward information that the robot can obtain feedback from the environment after interacting with the environment, specifically: selecting an action according to the action value function Q(s, a) and executing it , save the current state s, action a, the immediate reward r obtained by executing the action, and the next state s' reached as a tuple in the experience pool E, and repeat the above steps for 30,000 to 50,000 steps, all of which are stored in the experience pool In E, the parameters of updating the current value network and the target value network in the step (5) need to be sampled from the experience pool E;

In the step (5), the samples sampled from the experience pool E need to be selected according to their priority level, and the priority level is set as: whenever the content is stored in the experience pool E, the priority level of the sample is updated once, and the update formula is:

2. The control method according to claim 1, characterized in that it comprises a sensing system, a control system, an operation system and a drive system,

The sensing system: including a camera and an image acquisition device, used for sensing the position information of objects in front of the robot, represented by image information;

The control system: used to control the robot grasping arm to grasp objects and place objects in the storage mechanism, and control the rotation angle of the rotating mechanism;

The operating system: including a robot grabbing arm, a rotating mechanism, and a storage mechanism, which are used to receive information from the control system and perform various actions;

The drive system: including a motor and a battery, used to provide power for the operating system to perform various actions of the control system;

The sensing system collects environmental information and driving system information, and transmits the environmental information and driving system information to the control system. action.

3. control method according to claim 1 is characterized in that: described current value network is made up of three-layer convolutional neural network and one-layer fully connected layer, and activation function is relu function; The obtained image information, in which the convolutional neural network extracts the image features and outputs the action value function Q(s,a) through the activation function relu, and selects the action with the ε-Greedy greedy strategy according to the action value function Q(s,a).

4. The control method according to claim 3 is characterized in that: "the output of the current value network controls the action of the robot" as: randomly extract several samples from the experience pool E, and use its state s as the current value network. The input of the first hidden layer, the action value function Q(s, a) is output by the current value network, and the action a _t to be taken is selected according to the action value function. After the robot performs the action a _t , it obtains a new environment state s _{t +1} and an immediate reward _rt , and the parameters of the current value network are adjusted by the current value network loss function.

5. control method according to claim 1, is characterized in that: in described step (3):

The state S is expressed as: the environmental state perceived by the sensing system is the position information of the objects in the current field of view of the robot, which is presented in the form of images;

Action a is expressed as: the set of operations that can be performed in the current state, including the angle and direction operations of the robot grabbing the item;

The immediate reward r is: the evaluation of the actions taken by the robot in the current state. If the object does not fall off after the robot grabs the object, it will give a reward of +1; if the object is successfully placed in the storage mechanism, the reward will be +1000, If the item is dropped, the reward will be -1000, otherwise the reward will be 0.