CN112975967A - Service robot quantitative water pouring method based on simulation learning and storage medium - Google Patents

Service robot quantitative water pouring method based on simulation learning and storage medium Download PDF

Info

Publication number
CN112975967A
CN112975967A CN202110217089.7A CN202110217089A CN112975967A CN 112975967 A CN112975967 A CN 112975967A CN 202110217089 A CN202110217089 A CN 202110217089A CN 112975967 A CN112975967 A CN 112975967A
Authority
CN
China
Prior art keywords
network
water pouring
quantitative water
demonstration
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110217089.7A
Other languages
Chinese (zh)
Other versions
CN112975967B (en
Inventor
尤鸣宇
徐炫辉
周洪钧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN202110217089.7A priority Critical patent/CN112975967B/en
Publication of CN112975967A publication Critical patent/CN112975967A/en
Application granted granted Critical
Publication of CN112975967B publication Critical patent/CN112975967B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • B25J9/163Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J11/00Manipulators not otherwise provided for
    • B25J11/008Manipulators for service tasks
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1679Programme controls characterised by the tasks executed
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1694Programme controls characterised by use of sensors other than normal servo-feedback from position, speed or acceleration sensors, perception control, multi-sensor controlled systems, sensor fusion
    • B25J9/1697Vision controlled systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Manipulator (AREA)

Abstract

The invention relates to a service robot quantitative water pouring method based on simulation learning and a storage medium, wherein the quantitative water pouring method comprises the following steps: step 1: acquiring quantitative water pouring demonstration data of human experts; step 2: training a reward function output network by using the demonstration data obtained in the step 1; and step 3: establishing a quantitative water pouring decision network, outputting the network based on a reward function, and learning quantitative water pouring actions by using a reinforcement learning algorithm in a complex unstructured scene to obtain a target decision network; and 4, step 4: and the trained target decision network is used for driving the service robot to complete quantitative water pouring service. Compared with the prior art, the method has the advantages of low deployment complexity, good robustness, good generalization performance and the like.

Description

Service robot quantitative water pouring method based on simulation learning and storage medium
Technical Field
The invention relates to the technical field of quantitative water pouring methods for service robots, in particular to a quantitative water pouring method for a service robot based on simulation learning and a storage medium.
Background
The robot serves the daily life of human beings, and solving the trivia in the life is a constantly pursued target in the research field of the service robot. The pouring is one of the most common daily activities of human beings, so that the service robot has extremely high application value in how to robustly complete quantitative pouring tasks in complex unstructured scenes. At present, many mechanical arm water pouring applications are realized based on methods such as programming or dragging teaching. The programming-based method is very unfriendly to ordinary users without professional knowledge, and is not beneficial to the popularization of the service robot; the method based on the drag teaching has weak generalization performance, and can only repeat the previous demonstration track continuously. The method cannot conveniently and efficiently endow the service robot with the capability of completing quantitative water pouring tasks in a complex unstructured scene.
For example, chinese patent CN108762101A discloses a water pouring service robot regulation and control system based on sensing monitoring, which monitors the water quantity and water temperature in a cup used by people in real time through water level sensing monitoring and water temperature sensing monitoring; through the voice confirmation control mode, corresponding water pouring operation is carried out after human voice confirmation, although the robot can be driven to realize water pouring service, multiple sensors such as a water level sensor and a water temperature sensor need to be deployed, the water pouring robot can only be used after the sensors are deployed, the limiting factors are more, and the water pouring robot is not intelligent enough and cannot effectively finish quantitative water pouring tasks in a complex non-structural environment.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a service robot quantitative water pouring method and a storage medium based on simulation learning, which have low deployment complexity, good robustness and good generalization performance.
The purpose of the invention can be realized by the following technical scheme:
a service robot quantitative water pouring method based on simulation learning comprises the following steps:
step 1: acquiring quantitative water pouring demonstration data of human experts;
step 2: training a reward function output network by using the demonstration data obtained in the step 1;
and step 3: establishing a quantitative water pouring decision network, outputting the network based on a reward function, and learning quantitative water pouring actions by using a reinforcement learning algorithm in a complex unstructured scene to obtain a target decision network;
and 4, step 4: and the trained target decision network is used for driving the service robot to complete quantitative water pouring service.
Preferably, the constraint conditions for collecting the water pouring demonstration data in the step 1 include:
in collecting demonstration data DexpertWhen the water quality is high, the target container and the desktop background need to be replaced, and quantitative water pouring demonstration is repeated under different illumination conditions;
dividing the water quantity in the target container into three grades of low water quantity, medium water quantity and full water quantity, wherein the low water quantity is 20 +/-5% of the capacity of the target container, the medium water quantity is 50 +/-5% of the water quantity of the target container, the full water quantity is 90 +/-5%, and the three grades of the low water quantity, the medium water quantity and the full water quantity are respectively represented by three binary codes of 001, 010 and 100;
in the process of collecting demonstration data, the same target container, the same desktop background and the same illumination condition are taken as a group of demonstration conditions, and the water pouring demonstration of three levels of water is required to be completed under each group of demonstration conditions.
More preferably, the number of the target containers is NcupThe number of the desktop backgrounds is NwallpaperThe number of the types of illumination is NlightThe number of repetition times of the repetition of the same target water volume under each group of demonstration conditions is lambda, and the total number of demonstration tracks is as follows:
Ntotal=3*λ*Ncup*Nwallpaper*Nlight
preferably, the step 2 specifically comprises:
the input of the reward function output network R is an RGB image and a target water level, the output of the network R is discrete values 0 and 1, wherein 0 represents that the task is not finished, and 1 represents that the task is finished;
when training the network R, firstly, downsampling the demonstration data acquired in the step 1, setting the reward label of the last frame in the downsampled demonstration track data as 1, setting the reward labels of the other frames as 0, and simultaneously recording the target water level of each group of demonstration data;
the data after setting the label is Dexpert', using labeled demonstration data D by means of supervised learningexpert' training reward letterA number output network R.
Preferably, the step 3 specifically comprises:
adopting a PPO strategy optimization algorithm to perform reinforcement learning in a complex unstructured scene, wherein the reward function required in the training process is provided by the reward function output network R obtained in the step 2;
and training the quantitative water pouring decision network by using the demonstration data, wherein the environment faced by the service robot in the training process needs to be the same as the demonstration conditions of the current demonstration data, and finally obtaining the quantitative water pouring target decision network.
More preferably, the quantitative water pouring decision network specifically comprises:
discretizing the action of the service robot into a plurality of sub-actions, wherein the minimum motion amplitude of each joint of the service robot is 1 degree, and the motion angle range of a robot base is 0-90 degrees;
after the action is discretized, the robot judges whether the number of the discretized sub-actions exceeds 100, if so, the robot is regarded as that the water pouring action fails, and if not, each sub-action is executed in sequence until the whole action track is completed.
More preferably, the quantitative water pouring decision network is provided with a collision avoidance method, specifically:
after the quantitative water pouring decision network makes a decision, calculating the TCP position and posture of the robot end effector;
and judging whether the action of the robot collides with the desktop or not according to the calculated pose of the end effector of the robot, if so, re-making a decision, and if not, directly executing the decision.
More preferably, the optimization objective function of the PPO policy optimization algorithm is:
Figure BDA0002954184000000031
Figure BDA0002954184000000032
Figure BDA0002954184000000033
the method is a strategy model, theta is a model parameter, beta is a hyperparameter, KL is Kullback-Leibler divergence, A is an advantage function, p is probability distribution, s is a state, a is an action, t is a step number, and k is a model training frequency.
More preferably, the PPO policy algorithm includes:
a Critic network C for predicting a state stA value function of;
and two Actor networks A1And A2For generating an action;
the Critic network C and the Actor network A1And A2The input of (1) is an RGB image and a target water level; during training, within an epicode, A1Interacting with the environment to generate a series of data, generating a network R by using the trained reward function, generating the reward function for each step, and predicting the current state s by using a Critic network CtA value function, the RGB image, the target water level and the reward function of each step are put into a memory pool, and when the parameter quantity in the memory pool reaches a certain quantity, the Loss function is used for A1Optimizing, wherein the Loss function specifically comprises:
Figure BDA0002954184000000034
Figure BDA0002954184000000035
wherein r istFor the reward function, ε is a hyper-parameter,
Figure BDA0002954184000000036
gamma is the discount factor, V is the cost function,
Figure BDA0002954184000000037
to expect, T is the number of steps, T is the total number of steps, θ is the model parameter, clip refers to rtThe medium is smaller than 1-epsilon and is replaced by 1-epsilon, and the medium is larger than 1+ epsilon and is replaced by 1+ epsilon;
to A2Training N times, A in the course of training1Continuing to interact with the environment to collect data and update the memory pool, A2After N times of training, A2Is given to A1Then continue training A with data in the memory pool2And circulating the steps until the decision network effect meets the requirement, and generating a quantitative water pouring target decision network.
A storage medium is provided, wherein the service robot quantitative water pouring method based on simulation learning in any one of the above aspects is stored in the storage medium.
Compared with the prior art, the invention has the following beneficial effects:
firstly, the deployment complexity is low: the quantitative water pouring method of the service robot is based on the simulation learning algorithm, the used expert demonstration data only comprise RGB images, complex marking on the images is not needed manually, and compared with the traditional simulation learning algorithm needing a large number of labels, the method is simpler, more convenient and more effective, improves the universality of the simulation learning algorithm and reduces the complexity of deployment.
Secondly, the robustness is good: the quantitative water pouring method of the service robot enables the decision network to complete the quantitative water pouring task in a complex unstructured environment through reinforcement learning, and the robustness of the algorithm is good.
Thirdly, the generalization performance is good: the service robot quantitative water pouring method randomly selects the conditions of tablecloth texture, target container texture, shape, target water level, illumination and the like during track initialization, takes more environmental conditions into consideration during training of a decision network, and greatly improves the generalization performance of the algorithm.
Drawings
FIG. 1 is a schematic diagram illustrating a scenario for exemplary data acquisition according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart illustrating a quantitative water pouring method of a service robot according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a reward function generation network according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
A service robot quantitative water pouring method based on simulation learning is disclosed, and the flow of the method is shown in FIG. 2, and comprises the following steps:
step 1: acquiring quantitative water pouring demonstration data of human experts;
constraints for collecting the pouring demonstration data include:
in collecting demonstration data DexpertWhen the water quality is high, the target container and the desktop background need to be replaced, and quantitative water pouring demonstration is repeated under different illumination conditions;
dividing the water quantity in the target container into three grades of low water quantity, medium water quantity and full water quantity, wherein the low water quantity is 20 +/-5% of the capacity of the target container, the medium water quantity is 50 +/-5% of the water quantity of the target container, the full water quantity is 90 +/-5%, and the three grades of the low water quantity, the medium water quantity and the full water quantity are respectively represented by three binary codes of 001, 010 and 100;
in the process of collecting demonstration data, the same target container, the same desktop background and the same illumination condition are taken as a group of demonstration conditions, and the water pouring demonstration of three levels of water is required to be completed under each group of demonstration conditions.
The number of target containers is NcupThe number of the desktop backgrounds is NwallpaperThe number of the types of illumination is NlightThe number of repetition times of the repetition of the same target water volume under each group of demonstration conditions is lambda, and the total number of demonstration tracks is as follows:
Ntotal=3*λ*Ncup*Nwallpaper*Nlight
when exemplary data is collected, an exemplary operation subject is human, the demonstration is performed in a complex unstructured environment constructed by human, and the environment constructed in the embodiment as shown in fig. 1 comprises an original container (cup), a target container (cup), a piece of table cloth, a mechanical arm and two RGB cameras. The exemplary variation includes four aspects, respectively: texture, shape, lighting, and target water level. The texture of the original container, the texture of the target container and the texture of the tablecloth can be changed at will, the shape of the target container can be changed, the target container is a common cup which can be bought in the market, and the distribution of the original container, the target container and the tablecloth when the mechanical arm is used for reinforcement learning is approximately the same as that in the demonstration, and the distribution should not be greatly different. The target water level is divided into three grades of low, medium and full, and the illumination is also divided into three kinds of brightness of dark, general and bright.
In the demonstration process, an expert needs to repeat quantitative water pouring demonstration under various different lighting conditions after replacing a target container and tablecloth, and the water pouring demonstration of three different target water quantities needs to be completed under each group of demonstration conditions (one quantitative water pouring demonstration under the same target container, the same background and the same lighting conditions is regarded as one group of demonstration conditions). One of the two cameras is responsible for shooting global information, and the images of the two cameras comprise the arms of experts and a complete workbench; the other camera is responsible for shooting the water level change in the target container, the water level change in the target container can be clearly seen in the image, and the shooting frame rate is 30 Hz.
Step 2: training a reward function output network by using the demonstration data obtained in the step 1;
the reward function output network can distinguish whether the water level in the target container is the same as the target water level and whether the motion trail of the service robot is similar to that of the expert.
The input of the reward function output network R is an RGB image and a target water level, the output of the network R is discrete values 0 and 1, wherein 0 represents that the task is not finished, and 1 represents that the task is finished;
when training the network R, firstly, the demonstration data obtained in the step 1 is down-sampled to change the frequency thereof to 3Hz, and a group of the down-sampled data isSetting the reward label of the last frame in the demonstration track data (one group of demonstration data comprises two groups of images shot by two cameras at the same time) to be 1, setting the reward labels of the rest frames to be 0, and simultaneously recording the target digit of each group of demonstration data; the data after setting the label is Dexpert', using labeled demonstration data D by means of supervised learningexpert' training the reward function output network R.
The input to the deep convolutional neural network R includes two cameras and images taken at the same time and their corresponding target water levels. After being superposed on the channel layer, the two images are convolved by 5 layers of CNN and then are pulled into one-dimensional vectors to be fused with target water level codes of binary codes, and then the rewards are output through a layer of fully-connected network, wherein the labels of the vectors are automatically labeled labels;
and step 3: establishing a quantitative water pouring decision network, outputting the network based on a reward function, and learning quantitative water pouring actions by using a reinforcement learning algorithm in a complex unstructured scene to obtain a target decision network;
reinforced learning is carried out by adopting a PPO (Proximal Policy Optimization Algorithms) algorithm, and the reward functions required in the training process are provided by a reward function generation network R. In the training process, the environment facing the robot, including the shape texture of the target container, the texture of the tablecloth, the target water level, the illumination and the like, needs to be changed continuously, but the distribution of the environments is the same as the distribution of the environment corresponding to each group of demonstration data, and the distribution is close to the distribution of the environments corresponding to each group of demonstration data at the lowest degree.
In the reinforcement learning process, the action of the service robot is discretized into a plurality of sub-actions, the minimum motion amplitude of each joint of the service robot is 1 degree (0.01744rad), the motion angle range of the robot base is 0-90 degrees, and the angle of the workbench is 90 degrees.
After the action is discretized, the robot judges whether the number of the discretized sub-actions exceeds 100, if so, the robot is regarded as that the water pouring action fails, and if not, each sub-action is executed in sequence until the whole action track is completed.
A collision avoidance method is arranged in a quantitative water pouring decision network, and specifically comprises the following steps:
after the quantitative water pouring decision network makes a decision, calculating the TCP position and posture of the robot end effector;
and judging whether the action of the robot collides with the desktop or not according to the calculated pose of the end effector of the robot, if so, re-making a decision, and if not, directly executing the decision.
The optimization objective function of the PPO strategy optimization algorithm is as follows:
Figure BDA0002954184000000061
Figure BDA0002954184000000062
Figure BDA0002954184000000063
the method is a strategy model, theta is a model parameter, beta is a hyperparameter, KL is Kullback-Leibler divergence, A is an advantage function, p is probability distribution, s is a state, a is an action, t is a step number, and k is a model training frequency.
The PPO policy algorithm in this embodiment includes:
a Critic network C for predicting a state stA value function of;
and two Actor networks A1And A2For generating an action;
the Critic network C and the Actor network A1And A2The input of (1) is an RGB image and a target water level; during training, within an epicode, A1Interacting with the environment to generate a series of data, generating a network R by using the trained reward function, generating the reward function for each step, and predicting the current state s by using a Critic network CtA value function, the RGB image, the target water level and the reward function of each step are put into a memory pool, and when the parameter quantity in the memory pool reaches a certain quantity, the Loss function is used for A1Optimizing, the Loss function hasThe body is as follows:
Figure BDA0002954184000000071
Figure BDA0002954184000000072
wherein r istFor the reward function, ε is a hyper-parameter,
Figure BDA0002954184000000073
gamma is the discount factor, V is the cost function,
Figure BDA0002954184000000074
to expect, T is the number of steps, T is the total number of steps, θ is the model parameter, clip refers to rtThe medium is smaller than 1-epsilon and is replaced by 1-epsilon, and the medium is larger than 1+ epsilon and is replaced by 1+ epsilon;
to A2Training N times, A in the course of training1Continuing to interact with the environment to collect data and update the memory pool, A2After N times of training, A2Is given to A1Then continue training A with data in the memory pool2And circulating the steps until the decision network effect meets the requirement, and generating a quantitative water pouring target decision network.
In order to ensure the robustness and generalization capability of the model in a complex unstructured environment in the training process, the initial state of the environment where the mechanical arm is located, including the shape, the spatial position, the tablecloth texture, the illumination condition and the like of the cup, needs to be changed continuously during data collection, and the target of pouring water at each time is randomly selected from three states, namely low state, medium state and full state by an algorithm. After a period of training, the model can robustly pour water at a specified target water level in an unstructured complex environment.
And 4, step 4: and the trained target decision network is used for driving the service robot to complete quantitative water pouring service.
A specific application example is provided below:
in the application example, the Youyao UR5 mechanical arm is taken as an example, a Ubuntu 16.04 system is installed on a control workstation, Intel Core i7-10700K is carried, 8-Core 16 threads are adopted, and the Rui frequency is 5.1 GHz; the GPU is NVIDIA GTX1080 x 2; the memory is a 32G DDR4 memory; two universal RGB cameras are also needed to observe the overall state and water level, and based on the equipment, a specific implementation method is introduced:
an environment as shown in fig. 1 is built, and comprises a workbench, a mechanical arm, a workstation, two cups and two RGB cameras. The mechanical arm grasps the cup serving as an original container through the two fingers, and the cup serving as a target container can be randomly arranged at any position of the workbench surface; the camera 1 shoots vertically from top to bottom, the camera 2 shoots the global state obliquely, and the state at each moment in the reinforcement learning of the invention consists of two graphs.
After the construction is finished, the quantitative water pouring method is executed, namely:
step 1: the quantitative water pouring demonstration of a collecting expert is that the expert is a human, the pouring water is that the original container is held by the expert, and the target can be easily placed on any position of the working table shown in the figure 1. In order to guarantee the generalization capability of the reward function network R later when collecting the expert demonstration tracks, richness of table texture, target container shape, target container texture and illumination in the demonstration data is ensured. In the present exemplary embodiment, a total of 15 cups and 5 different tablecloths are included. Meanwhile, each table cloth, cup and illumination combination is demonstrated for a plurality of times according to different target water level requirements, so that the total demonstration track number is N in the exemplary embodimenttotal3 λ 15 × 5 × 3. The expert demonstration track at each moment comprises 2 images, taken by camera 1 and camera 2 respectively, where camera 1 is primarily responsible for focusing on the water level in the target container and camera 2 is responsible for focusing on the global information. And (3) downsampling the collected expert demonstration data, automatically labeling labels, wherein the last frame of each track is 1, and the rest frames are 0.
Step 2: training a reward function generation network R by using the exemplary data collected in the step 1, wherein R is a deep convolution neural network. The input of the method is two images and a target water level, the two images are images shot by a camera 1 and a camera 2 at the same time, the output images correspond to labels, the channel layers of the images are overlapped, the images are subjected to 5 layers of CNN convolution and then pulled into one-dimensional vectors to be spliced with the coded target water level, and a reward value is output after one layer of full connection.
And step 3: and (4) reinforcement learning, wherein in the embodiment, the robot directly performs reinforcement learning in a real environment. Since the decision network randomly selects the action strategy when learning is started, if the action strategy is not limited to a certain degree, the robot can make some dangerous actions, such as hitting a desk. Therefore, in the present embodiment, the robot motion is discretized, the minimum motion amplitude of each joint is 1 ° (0.01744rad), and the robot base motion angle is limited to 0 to 90 ° in order to increase the search efficiency, and since the table angle is 90 °, the search range is limited to three quarters of the time while allowing the robot to search the entire table surface. On the basis, the decision network calculates the TCP pose of the robot end effector after each selection, if the TCP pose collides with the desktop, the action is reselected, each track of the robot is at most 100 steps, and if the track exceeds 100 steps, the water pouring is regarded as failure.
Based on the above limitation, a PPO algorithm is adopted for reinforcement learning, and in order to ensure the generalization performance and robustness of the algorithm in a complex unstructured environment, the tablecloth texture, the target container texture, the shape, the target water level and the illumination are randomly selected during track initialization each time. The inputs to the policy network are again two images, which are images taken by camera 1 and camera 2 at the same time, and the target water level amount. And in the process of interacting with the real environment, generating the reward function in real time by using the reward function generation network R obtained in the step 2. After a period of training, the model can achieve robust completion of quantitative water pouring tasks in complex unstructured environments.
The result of the reward function generation network R is shown in fig. 3, where R is a deep convolutional neural network, which inputs two spliced images captured by the camera 1 and the camera 2 at the same time and a vector representing a target water level, and outputs a real number within [0, 1] representing the reward of the current state. The loss is obtained by calculating the euclidean distance between the output and the label during training.
And 4, step 4: and the trained target decision network is used for driving the service robot to complete quantitative water pouring service.
The embodiment also relates to a storage medium which stores any one of the quantitative water pouring methods.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A service robot quantitative water pouring method based on simulation learning is characterized by comprising the following steps:
step 1: acquiring quantitative water pouring demonstration data of human experts;
step 2: training a reward function output network by using the demonstration data obtained in the step 1;
and step 3: establishing a quantitative water pouring decision network, outputting the network based on a reward function, and learning quantitative water pouring actions by using a reinforcement learning algorithm in a complex unstructured scene to obtain a target decision network;
and 4, step 4: and the trained target decision network is used for driving the service robot to complete quantitative water pouring service.
2. The service robot quantitative pouring method based on simulation learning of claim 1, wherein the constraints of collecting pouring demonstration data in step 1 comprise:
in collecting demonstration data DexpertWhen the water quality is high, the target container and the desktop background need to be replaced, and quantitative water pouring demonstration is repeated under different illumination conditions;
dividing the water quantity in the target container into three grades of low water quantity, medium water quantity and full water quantity, wherein the low water quantity is 20 +/-5% of the capacity of the target container, the medium water quantity is 50 +/-5% of the water quantity of the target container, the full water quantity is 90 +/-5%, and the three grades of the low water quantity, the medium water quantity and the full water quantity are respectively represented by three binary codes of 001, 010 and 100;
in the process of collecting demonstration data, the same target container, the same desktop background and the same illumination condition are taken as a group of demonstration conditions, and the water pouring demonstration of three levels of water is required to be completed under each group of demonstration conditions.
3. The service robot quantitative water pouring method based on simulation learning as claimed in claim 2, wherein the number of the target containers is NcupThe number of the desktop backgrounds is NwallpaperThe number of the types of illumination is NlightThe number of repetition times of the repetition of the same target water volume under each group of demonstration conditions is lambda, and the total number of demonstration tracks is as follows:
Ntotal=3*λ*Ncup*Nwallpaper*Nlight
4. the service robot quantitative water pouring method based on simulation learning as claimed in claim 1, wherein the step 2 is specifically as follows:
the input of the reward function output network R is an RGB image and a target water level, the output of the network R is discrete values 0 and 1, wherein 0 represents that the task is not finished, and 1 represents that the task is finished;
when training the network R, firstly, downsampling the demonstration data acquired in the step 1, setting the reward label of the last frame in the downsampled demonstration track data as 1, setting the reward labels of the other frames as 0, and simultaneously recording the target water level of each group of demonstration data;
the data after setting the label is Dexpert', using labeled demonstration data D by means of supervised learningexpert' training the reward function output network R.
5. The service robot quantitative water pouring method based on simulation learning of claim 1, wherein the step 3 is specifically as follows:
adopting a PPO strategy optimization algorithm to perform reinforcement learning in a complex unstructured scene, wherein the reward function required in the training process is provided by the reward function output network R obtained in the step 2;
and training the quantitative water pouring decision network by using the demonstration data, wherein the environment faced by the service robot in the training process needs to be the same as the demonstration conditions of the current demonstration data, and finally obtaining the quantitative water pouring target decision network.
6. The service robot quantitative water pouring method based on simulation learning of claim 5, wherein the quantitative water pouring decision network specifically comprises:
discretizing the action of the service robot into a plurality of sub-actions, wherein the minimum motion amplitude of each joint of the service robot is 1 degree, and the motion angle range of a robot base is 0-90 degrees;
after the action is discretized, the robot judges whether the number of the discretized sub-actions exceeds 100, if so, the robot is regarded as that the water pouring action fails, and if not, each sub-action is executed in sequence until the whole action track is completed.
7. The service robot quantitative water pouring method based on simulation learning of claim 6, wherein the quantitative water pouring decision network is provided with a collision avoidance method, specifically comprising:
after the quantitative water pouring decision network makes a decision, calculating the TCP position and posture of the robot end effector;
and judging whether the action of the robot collides with the desktop or not according to the calculated pose of the end effector of the robot, if so, re-making a decision, and if not, directly executing the decision.
8. The simulation learning-based service robot quantitative water pouring method as claimed in claim 5, wherein the optimization objective function of the PPO strategy optimization algorithm is as follows:
Figure FDA0002954183990000021
Figure FDA0002954183990000022
wherein the content of the first and second substances,
Figure FDA0002954183990000023
the method is a strategy model, theta is a model parameter, beta is a hyperparameter, KL is Kullback-Leibler divergence, A is an advantage function, p is probability distribution, s is a state, a is an action, t is a step number, and k is a model training frequency.
9. The simulation learning-based service robot quantitative water pouring method as claimed in claim 5, wherein the PPO strategy algorithm comprises:
a Critic network C for predicting a state stA value function of;
and two Actor networks A1And A2For generating an action;
the Critic network C and the Actor network A1And A2The input of (1) is an RGB image and a target water level; during training, within an epicode, A1Interacting with the environment to generate a series of data, generating a network R by using the trained reward function, generating the reward function for each step, and predicting the current state s by using a Critic network CtA value function, the RGB image, the target water level and the reward function of each step are put into a memory pool, and when the parameter quantity in the memory pool reaches a certain quantity, the Loss function is used for A1Optimizing, wherein the Loss function specifically comprises:
Figure FDA0002954183990000031
Figure FDA0002954183990000032
wherein r istFor the reward function, ε is a hyper-parameter,
Figure FDA0002954183990000033
gamma is the discount factor, V is the cost function,
Figure FDA0002954183990000034
to expect, T is the number of steps, T is the total number of steps, θ is the model parameter, clip refers to rtThe medium is smaller than 1-epsilon and is replaced by 1-epsilon, and the medium is larger than 1+ epsilon and is replaced by 1+ epsilon;
to A2Training N times, A in the course of training1Continuing to interact with the environment to collect data and update the memory pool, A2After N times of training, A2Is given to A1Then continue training A with data in the memory pool2And circulating the steps until the decision network effect meets the requirement, and generating a quantitative water pouring target decision network.
10. A storage medium, wherein the storage medium stores the service robot quantitative water pouring method based on the imitation learning of any one of claims 1 to 9.
CN202110217089.7A 2021-02-26 2021-02-26 Service robot quantitative water pouring method based on simulation learning and storage medium Active CN112975967B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110217089.7A CN112975967B (en) 2021-02-26 2021-02-26 Service robot quantitative water pouring method based on simulation learning and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110217089.7A CN112975967B (en) 2021-02-26 2021-02-26 Service robot quantitative water pouring method based on simulation learning and storage medium

Publications (2)

Publication Number Publication Date
CN112975967A true CN112975967A (en) 2021-06-18
CN112975967B CN112975967B (en) 2022-06-28

Family

ID=76351116

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110217089.7A Active CN112975967B (en) 2021-02-26 2021-02-26 Service robot quantitative water pouring method based on simulation learning and storage medium

Country Status (1)

Country Link
CN (1) CN112975967B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016083678A (en) * 2014-10-24 2016-05-19 三明機工株式会社 Robot hand for injecting molten metal
CN111241952A (en) * 2020-01-03 2020-06-05 广东工业大学 Reinforced learning reward self-learning method in discrete manufacturing scene
CN111401556A (en) * 2020-04-22 2020-07-10 清华大学深圳国际研究生院 Selection method of opponent type imitation learning winning incentive function
CN111488988A (en) * 2020-04-16 2020-08-04 清华大学 Control strategy simulation learning method and device based on counterstudy
CN111766782A (en) * 2020-06-28 2020-10-13 浙江大学 Strategy selection method based on Actor-Critic framework in deep reinforcement learning
CN111872934A (en) * 2020-06-19 2020-11-03 南京邮电大学 Mechanical arm control method and system based on hidden semi-Markov model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016083678A (en) * 2014-10-24 2016-05-19 三明機工株式会社 Robot hand for injecting molten metal
CN111241952A (en) * 2020-01-03 2020-06-05 广东工业大学 Reinforced learning reward self-learning method in discrete manufacturing scene
CN111488988A (en) * 2020-04-16 2020-08-04 清华大学 Control strategy simulation learning method and device based on counterstudy
CN111401556A (en) * 2020-04-22 2020-07-10 清华大学深圳国际研究生院 Selection method of opponent type imitation learning winning incentive function
CN111872934A (en) * 2020-06-19 2020-11-03 南京邮电大学 Mechanical arm control method and system based on hidden semi-Markov model
CN111766782A (en) * 2020-06-28 2020-10-13 浙江大学 Strategy selection method based on Actor-Critic framework in deep reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
PASTOR P ET AL: "Learning and generalization of motor skills by learning from demonstration", 《2009 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION》 *

Also Published As

Publication number Publication date
CN112975967B (en) 2022-06-28

Similar Documents

Publication Publication Date Title
CN111515961B (en) Reinforcement learning reward method suitable for mobile mechanical arm
Mandlekar et al. What matters in learning from offline human demonstrations for robot manipulation
Rahmatizadeh et al. Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration
CN109800864B (en) Robot active learning method based on image input
CN111203878B (en) Robot sequence task learning method based on visual simulation
Suzuki et al. Visual servoing to catch fish using global/local GA search
CN108819948A (en) Driving behavior modeling method based on reverse intensified learning
CN108415254B (en) Waste recycling robot control method based on deep Q network
Zhang et al. Modular deep q networks for sim-to-real transfer of visuo-motor policies
Ghadirzadeh et al. Bayesian meta-learning for few-shot policy adaptation across robotic platforms
CN108791302A (en) Driving behavior modeling
Inoue et al. Transfer learning from synthetic to real images using variational autoencoders for robotic applications
Zhang et al. One-shot domain-adaptive imitation learning via progressive learning applied to robotic pouring
CN112975968B (en) Mechanical arm imitation learning method based on third visual angle variable main body demonstration video
CN112975967B (en) Service robot quantitative water pouring method based on simulation learning and storage medium
Hwang et al. Achieving" synergy" in cognitive behavior of humanoids via deep learning of dynamic visuo-motor-attentional coordination
Gómez et al. Simulating development in a real robot: on the concurrent increase of sensory, motor, and neural complexity
Yokota et al. A multi-task learning framework for grasping-position detection and few-shot classification
CN113119073A (en) Mechanical arm system based on computer vision and machine learning and oriented to 3C assembly scene
Yoshimoto et al. Object recognition system using deep learning with depth images for service robots
Chang et al. Reducing the deployment-time inference control costs of deep reinforcement learning agents via an asymmetric architecture
Jeong et al. Developmental learning of integrating visual attention shifts and bimanual object grasping and manipulation tasks
Goncalves et al. Neural mechanisms for learning of attention control and pattern categorization as basis for robot cognition
Ushida et al. Fuzzy-associative-memory-based knowledge construction with an application to a human-machine interface
Zhang et al. One-shot domain-adaptive imitation learning via progressive learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant