CN112975967A

CN112975967A - Service robot quantitative water pouring method based on simulation learning and storage medium

Info

Publication number: CN112975967A
Application number: CN202110217089.7A
Authority: CN
Inventors: 尤鸣宇; 徐炫辉; 周洪钧
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-06-18
Anticipated expiration: 2041-02-26
Also published as: CN112975967B

Abstract

The invention relates to a service robot quantitative water pouring method based on simulation learning and a storage medium, wherein the quantitative water pouring method comprises the following steps: step 1: acquiring quantitative water pouring demonstration data of human experts; step 2: training a reward function output network by using the demonstration data obtained in the step 1; and step 3: establishing a quantitative water pouring decision network, outputting the network based on a reward function, and learning quantitative water pouring actions by using a reinforcement learning algorithm in a complex unstructured scene to obtain a target decision network; and 4, step 4: and the trained target decision network is used for driving the service robot to complete quantitative water pouring service. Compared with the prior art, the method has the advantages of low deployment complexity, good robustness, good generalization performance and the like.

Description

Service robot quantitative water pouring method based on simulation learning and storage medium

Technical Field

The invention relates to the technical field of quantitative water pouring methods for service robots, in particular to a quantitative water pouring method for a service robot based on simulation learning and a storage medium.

Background

The robot serves the daily life of human beings, and solving the trivia in the life is a constantly pursued target in the research field of the service robot. The pouring is one of the most common daily activities of human beings, so that the service robot has extremely high application value in how to robustly complete quantitative pouring tasks in complex unstructured scenes. At present, many mechanical arm water pouring applications are realized based on methods such as programming or dragging teaching. The programming-based method is very unfriendly to ordinary users without professional knowledge, and is not beneficial to the popularization of the service robot; the method based on the drag teaching has weak generalization performance, and can only repeat the previous demonstration track continuously. The method cannot conveniently and efficiently endow the service robot with the capability of completing quantitative water pouring tasks in a complex unstructured scene.

For example, chinese patent CN108762101A discloses a water pouring service robot regulation and control system based on sensing monitoring, which monitors the water quantity and water temperature in a cup used by people in real time through water level sensing monitoring and water temperature sensing monitoring; through the voice confirmation control mode, corresponding water pouring operation is carried out after human voice confirmation, although the robot can be driven to realize water pouring service, multiple sensors such as a water level sensor and a water temperature sensor need to be deployed, the water pouring robot can only be used after the sensors are deployed, the limiting factors are more, and the water pouring robot is not intelligent enough and cannot effectively finish quantitative water pouring tasks in a complex non-structural environment.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a service robot quantitative water pouring method and a storage medium based on simulation learning, which have low deployment complexity, good robustness and good generalization performance.

The purpose of the invention can be realized by the following technical scheme:

a service robot quantitative water pouring method based on simulation learning comprises the following steps:

step 1: acquiring quantitative water pouring demonstration data of human experts;

step 2: training a reward function output network by using the demonstration data obtained in the step 1;

and step 3: establishing a quantitative water pouring decision network, outputting the network based on a reward function, and learning quantitative water pouring actions by using a reinforcement learning algorithm in a complex unstructured scene to obtain a target decision network;

and 4, step 4: and the trained target decision network is used for driving the service robot to complete quantitative water pouring service.

Preferably, the constraint conditions for collecting the water pouring demonstration data in the step 1 include:

in collecting demonstration data D_expertWhen the water quality is high, the target container and the desktop background need to be replaced, and quantitative water pouring demonstration is repeated under different illumination conditions;

dividing the water quantity in the target container into three grades of low water quantity, medium water quantity and full water quantity, wherein the low water quantity is 20 +/-5% of the capacity of the target container, the medium water quantity is 50 +/-5% of the water quantity of the target container, the full water quantity is 90 +/-5%, and the three grades of the low water quantity, the medium water quantity and the full water quantity are respectively represented by three binary codes of 001, 010 and 100;

in the process of collecting demonstration data, the same target container, the same desktop background and the same illumination condition are taken as a group of demonstration conditions, and the water pouring demonstration of three levels of water is required to be completed under each group of demonstration conditions.

More preferably, the number of the target containers is N_cupThe number of the desktop backgrounds is N_wallpaperThe number of the types of illumination is N_lightThe number of repetition times of the repetition of the same target water volume under each group of demonstration conditions is lambda, and the total number of demonstration tracks is as follows:

N_total＝3*λ*N_cup*N_wallpaper*N_light。

preferably, the step 2 specifically comprises:

the input of the reward function output network R is an RGB image and a target water level, the output of the network R is discrete values 0 and 1, wherein 0 represents that the task is not finished, and 1 represents that the task is finished;

when training the network R, firstly, downsampling the demonstration data acquired in the step 1, setting the reward label of the last frame in the downsampled demonstration track data as 1, setting the reward labels of the other frames as 0, and simultaneously recording the target water level of each group of demonstration data;

the data after setting the label is D_expert', using labeled demonstration data D by means of supervised learning_expert' training reward letterA number output network R.

Preferably, the step 3 specifically comprises:

adopting a PPO strategy optimization algorithm to perform reinforcement learning in a complex unstructured scene, wherein the reward function required in the training process is provided by the reward function output network R obtained in the step 2;

and training the quantitative water pouring decision network by using the demonstration data, wherein the environment faced by the service robot in the training process needs to be the same as the demonstration conditions of the current demonstration data, and finally obtaining the quantitative water pouring target decision network.

More preferably, the quantitative water pouring decision network specifically comprises:

discretizing the action of the service robot into a plurality of sub-actions, wherein the minimum motion amplitude of each joint of the service robot is 1 degree, and the motion angle range of a robot base is 0-90 degrees;

after the action is discretized, the robot judges whether the number of the discretized sub-actions exceeds 100, if so, the robot is regarded as that the water pouring action fails, and if not, each sub-action is executed in sequence until the whole action track is completed.

More preferably, the quantitative water pouring decision network is provided with a collision avoidance method, specifically:

after the quantitative water pouring decision network makes a decision, calculating the TCP position and posture of the robot end effector;

and judging whether the action of the robot collides with the desktop or not according to the calculated pose of the end effector of the robot, if so, re-making a decision, and if not, directly executing the decision.

More preferably, the optimization objective function of the PPO policy optimization algorithm is:

the method is a strategy model, theta is a model parameter, beta is a hyperparameter, KL is Kullback-Leibler divergence, A is an advantage function, p is probability distribution, s is a state, a is an action, t is a step number, and k is a model training frequency.

More preferably, the PPO policy algorithm includes:

a Critic network C for predicting a state s_tA value function of;

and two Actor networks A₁And A₂For generating an action;

the Critic network C and the Actor network A₁And A₂The input of (1) is an RGB image and a target water level; during training, within an epicode, A₁Interacting with the environment to generate a series of data, generating a network R by using the trained reward function, generating the reward function for each step, and predicting the current state s by using a Critic network C_tA value function, the RGB image, the target water level and the reward function of each step are put into a memory pool, and when the parameter quantity in the memory pool reaches a certain quantity, the Loss function is used for A₁Optimizing, wherein the Loss function specifically comprises:

wherein r is_tFor the reward function, ε is a hyper-parameter,

gamma is the discount factor, V is the cost function,

to expect, T is the number of steps, T is the total number of steps, θ is the model parameter, clip refers to r_tThe medium is smaller than 1-epsilon and is replaced by 1-epsilon, and the medium is larger than 1+ epsilon and is replaced by 1+ epsilon;

to A₂Training N times, A in the course of training₁Continuing to interact with the environment to collect data and update the memory pool, A₂After N times of training, A₂Is given to A₁Then continue training A with data in the memory pool₂And circulating the steps until the decision network effect meets the requirement, and generating a quantitative water pouring target decision network.

A storage medium is provided, wherein the service robot quantitative water pouring method based on simulation learning in any one of the above aspects is stored in the storage medium.

Compared with the prior art, the invention has the following beneficial effects:

firstly, the deployment complexity is low: the quantitative water pouring method of the service robot is based on the simulation learning algorithm, the used expert demonstration data only comprise RGB images, complex marking on the images is not needed manually, and compared with the traditional simulation learning algorithm needing a large number of labels, the method is simpler, more convenient and more effective, improves the universality of the simulation learning algorithm and reduces the complexity of deployment.

Secondly, the robustness is good: the quantitative water pouring method of the service robot enables the decision network to complete the quantitative water pouring task in a complex unstructured environment through reinforcement learning, and the robustness of the algorithm is good.

Thirdly, the generalization performance is good: the service robot quantitative water pouring method randomly selects the conditions of tablecloth texture, target container texture, shape, target water level, illumination and the like during track initialization, takes more environmental conditions into consideration during training of a decision network, and greatly improves the generalization performance of the algorithm.

Drawings

FIG. 1 is a schematic diagram illustrating a scenario for exemplary data acquisition according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart illustrating a quantitative water pouring method of a service robot according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a reward function generation network according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

A service robot quantitative water pouring method based on simulation learning is disclosed, and the flow of the method is shown in FIG. 2, and comprises the following steps:

constraints for collecting the pouring demonstration data include:

The number of target containers is N_cupThe number of the desktop backgrounds is N_wallpaperThe number of the types of illumination is N_lightThe number of repetition times of the repetition of the same target water volume under each group of demonstration conditions is lambda, and the total number of demonstration tracks is as follows:

N_total＝3*λ*N_cup*N_wallpaper*N_light。

when exemplary data is collected, an exemplary operation subject is human, the demonstration is performed in a complex unstructured environment constructed by human, and the environment constructed in the embodiment as shown in fig. 1 comprises an original container (cup), a target container (cup), a piece of table cloth, a mechanical arm and two RGB cameras. The exemplary variation includes four aspects, respectively: texture, shape, lighting, and target water level. The texture of the original container, the texture of the target container and the texture of the tablecloth can be changed at will, the shape of the target container can be changed, the target container is a common cup which can be bought in the market, and the distribution of the original container, the target container and the tablecloth when the mechanical arm is used for reinforcement learning is approximately the same as that in the demonstration, and the distribution should not be greatly different. The target water level is divided into three grades of low, medium and full, and the illumination is also divided into three kinds of brightness of dark, general and bright.

In the demonstration process, an expert needs to repeat quantitative water pouring demonstration under various different lighting conditions after replacing a target container and tablecloth, and the water pouring demonstration of three different target water quantities needs to be completed under each group of demonstration conditions (one quantitative water pouring demonstration under the same target container, the same background and the same lighting conditions is regarded as one group of demonstration conditions). One of the two cameras is responsible for shooting global information, and the images of the two cameras comprise the arms of experts and a complete workbench; the other camera is responsible for shooting the water level change in the target container, the water level change in the target container can be clearly seen in the image, and the shooting frame rate is 30 Hz.

the reward function output network can distinguish whether the water level in the target container is the same as the target water level and whether the motion trail of the service robot is similar to that of the expert.

when training the network R, firstly, the demonstration data obtained in the step 1 is down-sampled to change the frequency thereof to 3Hz, and a group of the down-sampled data isSetting the reward label of the last frame in the demonstration track data (one group of demonstration data comprises two groups of images shot by two cameras at the same time) to be 1, setting the reward labels of the rest frames to be 0, and simultaneously recording the target digit of each group of demonstration data; the data after setting the label is D_expert', using labeled demonstration data D by means of supervised learning_expert' training the reward function output network R.

The input to the deep convolutional neural network R includes two cameras and images taken at the same time and their corresponding target water levels. After being superposed on the channel layer, the two images are convolved by 5 layers of CNN and then are pulled into one-dimensional vectors to be fused with target water level codes of binary codes, and then the rewards are output through a layer of fully-connected network, wherein the labels of the vectors are automatically labeled labels;

reinforced learning is carried out by adopting a PPO (Proximal Policy Optimization Algorithms) algorithm, and the reward functions required in the training process are provided by a reward function generation network R. In the training process, the environment facing the robot, including the shape texture of the target container, the texture of the tablecloth, the target water level, the illumination and the like, needs to be changed continuously, but the distribution of the environments is the same as the distribution of the environment corresponding to each group of demonstration data, and the distribution is close to the distribution of the environments corresponding to each group of demonstration data at the lowest degree.

In the reinforcement learning process, the action of the service robot is discretized into a plurality of sub-actions, the minimum motion amplitude of each joint of the service robot is 1 degree (0.01744rad), the motion angle range of the robot base is 0-90 degrees, and the angle of the workbench is 90 degrees.

A collision avoidance method is arranged in a quantitative water pouring decision network, and specifically comprises the following steps:

The optimization objective function of the PPO strategy optimization algorithm is as follows:

The PPO policy algorithm in this embodiment includes:

a Critic network C for predicting a state s_tA value function of;

and two Actor networks A₁And A₂For generating an action;

the Critic network C and the Actor network A₁And A₂The input of (1) is an RGB image and a target water level; during training, within an epicode, A₁Interacting with the environment to generate a series of data, generating a network R by using the trained reward function, generating the reward function for each step, and predicting the current state s by using a Critic network C_tA value function, the RGB image, the target water level and the reward function of each step are put into a memory pool, and when the parameter quantity in the memory pool reaches a certain quantity, the Loss function is used for A₁Optimizing, the Loss function hasThe body is as follows:

wherein r is_tFor the reward function, ε is a hyper-parameter,

gamma is the discount factor, V is the cost function,

In order to ensure the robustness and generalization capability of the model in a complex unstructured environment in the training process, the initial state of the environment where the mechanical arm is located, including the shape, the spatial position, the tablecloth texture, the illumination condition and the like of the cup, needs to be changed continuously during data collection, and the target of pouring water at each time is randomly selected from three states, namely low state, medium state and full state by an algorithm. After a period of training, the model can robustly pour water at a specified target water level in an unstructured complex environment.

A specific application example is provided below:

in the application example, the Youyao UR5 mechanical arm is taken as an example, a Ubuntu 16.04 system is installed on a control workstation, Intel Core i7-10700K is carried, 8-Core 16 threads are adopted, and the Rui frequency is 5.1 GHz; the GPU is NVIDIA GTX1080 x 2; the memory is a 32G DDR4 memory; two universal RGB cameras are also needed to observe the overall state and water level, and based on the equipment, a specific implementation method is introduced:

an environment as shown in fig. 1 is built, and comprises a workbench, a mechanical arm, a workstation, two cups and two RGB cameras. The mechanical arm grasps the cup serving as an original container through the two fingers, and the cup serving as a target container can be randomly arranged at any position of the workbench surface; the camera 1 shoots vertically from top to bottom, the camera 2 shoots the global state obliquely, and the state at each moment in the reinforcement learning of the invention consists of two graphs.

After the construction is finished, the quantitative water pouring method is executed, namely:

step 1: the quantitative water pouring demonstration of a collecting expert is that the expert is a human, the pouring water is that the original container is held by the expert, and the target can be easily placed on any position of the working table shown in the figure 1. In order to guarantee the generalization capability of the reward function network R later when collecting the expert demonstration tracks, richness of table texture, target container shape, target container texture and illumination in the demonstration data is ensured. In the present exemplary embodiment, a total of 15 cups and 5 different tablecloths are included. Meanwhile, each table cloth, cup and illumination combination is demonstrated for a plurality of times according to different target water level requirements, so that the total demonstration track number is N in the exemplary embodiment_total3 λ 15 × 5 × 3. The expert demonstration track at each moment comprises 2 images, taken by camera 1 and camera 2 respectively, where camera 1 is primarily responsible for focusing on the water level in the target container and camera 2 is responsible for focusing on the global information. And (3) downsampling the collected expert demonstration data, automatically labeling labels, wherein the last frame of each track is 1, and the rest frames are 0.

Step 2: training a reward function generation network R by using the exemplary data collected in the step 1, wherein R is a deep convolution neural network. The input of the method is two images and a target water level, the two images are images shot by a camera 1 and a camera 2 at the same time, the output images correspond to labels, the channel layers of the images are overlapped, the images are subjected to 5 layers of CNN convolution and then pulled into one-dimensional vectors to be spliced with the coded target water level, and a reward value is output after one layer of full connection.

And step 3: and (4) reinforcement learning, wherein in the embodiment, the robot directly performs reinforcement learning in a real environment. Since the decision network randomly selects the action strategy when learning is started, if the action strategy is not limited to a certain degree, the robot can make some dangerous actions, such as hitting a desk. Therefore, in the present embodiment, the robot motion is discretized, the minimum motion amplitude of each joint is 1 ° (0.01744rad), and the robot base motion angle is limited to 0 to 90 ° in order to increase the search efficiency, and since the table angle is 90 °, the search range is limited to three quarters of the time while allowing the robot to search the entire table surface. On the basis, the decision network calculates the TCP pose of the robot end effector after each selection, if the TCP pose collides with the desktop, the action is reselected, each track of the robot is at most 100 steps, and if the track exceeds 100 steps, the water pouring is regarded as failure.

Based on the above limitation, a PPO algorithm is adopted for reinforcement learning, and in order to ensure the generalization performance and robustness of the algorithm in a complex unstructured environment, the tablecloth texture, the target container texture, the shape, the target water level and the illumination are randomly selected during track initialization each time. The inputs to the policy network are again two images, which are images taken by camera 1 and camera 2 at the same time, and the target water level amount. And in the process of interacting with the real environment, generating the reward function in real time by using the reward function generation network R obtained in the step 2. After a period of training, the model can achieve robust completion of quantitative water pouring tasks in complex unstructured environments.

The result of the reward function generation network R is shown in fig. 3, where R is a deep convolutional neural network, which inputs two spliced images captured by the camera 1 and the camera 2 at the same time and a vector representing a target water level, and outputs a real number within [0, 1] representing the reward of the current state. The loss is obtained by calculating the euclidean distance between the output and the label during training.

The embodiment also relates to a storage medium which stores any one of the quantitative water pouring methods.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A service robot quantitative water pouring method based on simulation learning is characterized by comprising the following steps:

2. The service robot quantitative pouring method based on simulation learning of claim 1, wherein the constraints of collecting pouring demonstration data in step 1 comprise:

3. The service robot quantitative water pouring method based on simulation learning as claimed in claim 2, wherein the number of the target containers is N_cupThe number of the desktop backgrounds is N_wallpaperThe number of the types of illumination is N_lightThe number of repetition times of the repetition of the same target water volume under each group of demonstration conditions is lambda, and the total number of demonstration tracks is as follows:

N_total＝3*λ*N_cup*N_wallpaper*N_light。

4. the service robot quantitative water pouring method based on simulation learning as claimed in claim 1, wherein the step 2 is specifically as follows:

the data after setting the label is D_expert', using labeled demonstration data D by means of supervised learning_expert' training the reward function output network R.

5. The service robot quantitative water pouring method based on simulation learning of claim 1, wherein the step 3 is specifically as follows:

6. The service robot quantitative water pouring method based on simulation learning of claim 5, wherein the quantitative water pouring decision network specifically comprises:

7. The service robot quantitative water pouring method based on simulation learning of claim 6, wherein the quantitative water pouring decision network is provided with a collision avoidance method, specifically comprising:

8. The simulation learning-based service robot quantitative water pouring method as claimed in claim 5, wherein the optimization objective function of the PPO strategy optimization algorithm is as follows:

wherein the content of the first and second substances,

9. The simulation learning-based service robot quantitative water pouring method as claimed in claim 5, wherein the PPO strategy algorithm comprises:

a Critic network C for predicting a state s_tA value function of;

and two Actor networks A₁And A₂For generating an action;

wherein r is_tFor the reward function, ε is a hyper-parameter,

gamma is the discount factor, V is the cost function,

10. A storage medium, wherein the storage medium stores the service robot quantitative water pouring method based on the imitation learning of any one of claims 1 to 9.