CN113050433A

CN113050433A - Robot control strategy migration method, device and system

Info

Publication number: CN113050433A
Application number: CN202110603540.9A
Authority: CN
Inventors: 刘智勇; 吴亮东
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-06-29
Anticipated expiration: 2041-05-31
Also published as: CN113050433B

Abstract

The invention provides a robot control strategy migration method, a device and a system, wherein a difference strategy is introduced when the difference value between the actual state and the reference state determined based on the task strategy is out of the preset range, and the application effect of the task strategy in the actual control system can be ensured through the double-strategy collaborative migration of the task strategy and the difference strategy, so that the actual control system can more accurately control a target robot, and the smooth execution of a task to be executed is realized.

Description

Robot control strategy migration method, device and system

Technical Field

The invention relates to the technical field of reinforcement learning and robot control, in particular to a robot control strategy migration method, device and system.

Background

In recent years, research on the development of robot control by reinforcement learning has become a focus of research. However, the core mechanism of reinforcement learning lies in trial and error of a large number of samples, and then a qualified control strategy is obtained by training, and training directly on an actual control system of a robot faces a series of actual problems of high cost, high risk and low efficiency, such as hardware wear, potential safety hazard, long time consumption, and the like, which forces a large amount of research to be concentrated on a simulation level. Therefore, it is a natural idea for researchers to transfer the control strategy obtained by training under simulation to the actual control system.

And the fact that: because the difference between the simulation source domain and the real target domain causes the difference problem from simulation to reality, the application effect of the actual control system of the robot migrated by the obtained control strategy is far from the effect of the simulation level with high probability. How to solve the above-mentioned difference problem becomes a major difficulty for researchers. Currently, various solutions can be divided into three major categories: a system identification based method, a domain adaptation based method, and a domain randomization based method.

Although the three methods have obvious effects in solving the difference problem, it needs to be studied intensively how to better reproduce the application effect of the control strategy in the actual control system.

Disclosure of Invention

The invention provides a robot control strategy migration method, device and system, which are used for overcoming the defects in the prior art.

The invention provides a robot control strategy migration method, which comprises the following steps:

migrating a task strategy of a target robot to an actual control system of the target robot, and determining the actual state of the target robot at the current moment based on the actual control system;

if the difference value between the actual state and the reference state determined based on the task strategy is judged to be out of the preset range, transferring the difference strategy of the target robot to the actual control system so that the actual control system executes the task strategy and the coupling action under the difference strategy, and further determining the actual state of the target robot at the next moment of the current moment;

and determining the difference strategy based on a state deviation set between a sample actual state set obtained after the task strategy is transferred to the actual control system for multiple times and a reference state set output by the task strategy and a sample correction action corresponding to each transfer.

According to the robot control strategy migration method provided by the invention, the difference strategy is specifically determined by the following method:

transferring the task strategy to the actual control system for multiple times, and determining a sample actual state set of the target robot based on sample actions obtained by the actual control system executing the task strategy each time;

for any transition, determining a sample correction action corresponding to the any transition based on a state deviation set between a reference state set of the target robot and a sample actual state set corresponding to the any transition;

and determining the difference strategy based on the sample correction actions corresponding to the multiple times of migration and the actual state of the sample corresponding to each sample correction action.

According to the robot control strategy migration method provided by the invention, the method for determining the sample correction action corresponding to any migration based on the reference state set of the target robot and the state deviation set between the sample actual state sets corresponding to any migration specifically comprises the following steps:

selecting a first state deviation exceeding a threshold value from the state deviation set according to a time sequence, and determining an alternative sample correction action set corresponding to the state deviation;

and determining the sample correcting action based on the state deviation, a sample estimated state obtained by correcting the actual state of the sample corresponding to the state deviation through each candidate sample correcting action in the candidate sample correcting action set and a reference state corresponding to the state deviation.

According to the robot control strategy migration method provided by the invention, the determining the difference strategy based on the sample correction actions corresponding to the multiple times of migration and the actual state of the sample corresponding to each sample correction action specifically comprises the following steps:

and constructing a training target based on the sample correcting actions corresponding to the multiple times of migration and the sample actual state corresponding to each sample correcting action, and training the sample correcting actions corresponding to the multiple times of migration and the sample actual states corresponding to each sample correcting action based on the training target to obtain the difference strategy.

According to the robot control strategy migration method provided by the invention, the difference strategy is determined based on the sample correction actions corresponding to the multiple times of migration and the actual state of the sample corresponding to each sample correction action, and the method also comprises the following steps:

and eliminating repeated sample correction actions in the sample actual states corresponding to the sample correction actions and the repeated sample correction actions corresponding to the repeated sample correction actions.

According to the robot control strategy migration method provided by the invention, the task strategy is obtained by pre-training based on a reinforcement learning method, and a reward function adopted during training is determined based on a distance function between the actual position and the target position of a target object involved in a task to be executed of the target robot.

The invention also provides a robot control strategy migration device, which comprises:

the task strategy migration module is used for migrating a task strategy of the target robot to an actual control system of the target robot and determining the actual state of the target robot at the current moment based on the actual control system;

a difference strategy migration module, configured to migrate the difference strategy of the target robot to the actual control system if it is determined that a difference between the actual state and a reference state determined based on the task strategy is outside a preset range, so that the actual control system executes the task strategy and a coupling action under the difference strategy, and further determines an actual state of the target robot at a next time of the current time;

The invention also provides a robot control strategy migration system, which comprises: the robot control strategy migration device is connected with the camera device;

the camera device is used for acquiring the actual state of the target robot.

The invention further provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the robot control strategy migration method.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the robot control strategy migration method according to any of the above-described methods.

The invention provides a robot control strategy migration method, a device and a system, firstly, a task strategy of a target robot is migrated to an actual control system of the target robot, and the actual state of the target robot at the current moment is determined based on the actual control system; and then when the difference value between the actual state and the reference state determined based on the task strategy is out of the preset range, transferring the difference strategy of the target robot to the actual control system so that the actual control system executes the task strategy and the coupling action under the difference strategy, and further determining the actual state of the target robot at the next moment of the current moment. Due to the fact that the difference strategy is introduced when the difference value between the actual state and the reference state determined based on the task strategy is out of the preset range, the application effect of the task strategy in an actual control system can be guaranteed through the double-strategy collaborative migration of the task strategy and the difference strategy, the actual control system can control the target robot more accurately, and smooth execution of the task to be executed is achieved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a robot control strategy migration method provided by the present invention;

FIG. 2 is a second schematic flow chart of a robot control strategy migration method according to the present invention;

FIG. 3 is a third schematic flowchart of a robot control strategy migration method according to the present invention;

fig. 4 is a schematic diagram of a trajectory of a target object on a motion plane during task strategy migration in the robot control strategy migration method provided by the present invention;

FIG. 5 is a schematic diagram of a target object deflected clockwise by the difference effect in the robot control strategy migration method provided by the present invention;

FIG. 6 is a schematic diagram of a target object deflected counterclockwise due to the difference effect in the robot control strategy migration method provided by the present invention;

FIG. 7 is a schematic top view of a coupling action in the robot control strategy migration method according to the present invention;

FIG. 8 is a schematic top view of a different strategy training task when a target object initially undergoes clockwise deflection in the robot control strategy migration method provided by the present invention;

FIG. 9 is a schematic top view of a strategy training task for counter-clockwise deflection of a target object at the beginning of migration in a robot control strategy migration method according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of a movement locus of a plane where an iron box is located in a single strategy migration method in the prior art;

FIG. 11 is a schematic diagram of a movement locus of a plane where an iron box is located in the robot control strategy migration method provided by the invention;

FIG. 12 is a schematic diagram of a movement locus of a plane on which an iron block is pushed above a paper box in a single strategy migration method in the prior art;

FIG. 13 is a schematic diagram of a movement locus of a plane on which an iron block is placed above a paper box and pushed in the robot control strategy transferring method provided by the invention;

FIG. 14 is a schematic diagram of a movement locus of a plane in which an iron block is placed below a paper box and pushed in a single strategy migration method in the prior art;

FIG. 15 is a schematic diagram of a movement locus of a plane where an iron block is placed below a paper box and pushed in the robot control strategy migration method provided by the invention;

FIG. 16 is a schematic structural diagram of a robot control strategy migration apparatus provided by the present invention;

FIG. 17 is a schematic structural diagram of a robot control strategy migration system provided by the present invention;

fig. 18 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Because the difference between the simulation source domain and the real target domain causes the difference problem from simulation to reality, the application effect of the actual control system of the robot migrated by the obtained control strategy is far from the effect of the simulation level with high probability. In the prior art, when solving the above difference problem, three general solutions are generally adopted, including: a system identification based method, a domain adaptation based method, and a domain randomization based method. Although the three methods have obvious effects in solving the difference problem, it needs to be studied intensively how to better reproduce the application effect of the control strategy in the actual control system. For example, how to reproduce the application effect of the control strategy in the actual control system as much as possible, how to shorten the training time of the control strategy, and the like. Therefore, the embodiment of the invention provides a robot control strategy migration method.

Fig. 1 is a schematic flowchart of a robot control policy migration method provided in an embodiment of the present invention, and as shown in fig. 1, the method includes:

s1, transferring the task strategy of the target robot to an actual control system of the target robot, and determining the actual state of the target robot at the current moment based on the actual control system;

s2, if the difference value between the actual state and the reference state determined based on the task strategy is judged to be out of the preset range, transferring the difference strategy of the target robot to the actual control system, so that the actual control system executes the task strategy and the coupling action under the difference strategy, and further determining the actual state of the target robot at the next moment of the current moment;

Specifically, in the robot control policy migration method provided in the embodiment of the present invention, an execution subject is a robot control policy migration apparatus, the apparatus may be configured in a server, where the server may be a local server or a cloud server, and the local server may specifically be a computer, which is not specifically limited in the embodiment of the present invention.

In general, the difference problem is given by the following form:

wherein P is and

a transition probability distribution respectively representing the dynamic characteristics of the virtual simulation end and the actual control system,

And

respectively representing the generation probability distribution of the virtual simulation end and the actual control system related to the image information, R representing the actual control system,

the action of the task at the time t is represented,

indicating the reference state at time t.

Represents the reference state of the virtual simulation terminal at the moment t +1,

representing the actual state at time t +1 in the actual control system.

Indicating that the left-side content is available from the right-side content.

Because of the difference of the probability distribution of the virtual and the real sides, the control strategy obtained by the training of the virtual simulation end is transferred to the actual control system,

relative to

There will be differences and differences will be created. In the embodiment of the invention, the problem of virtual-real difference of dynamic characteristics is focused

Hereinafter, it is rewritten as

Wherein, the image information is not used in the learning and migration of the strategy, so the method and the device of the invention omit

And

the influence of (c).

At present, the strategy migration from the virtual simulation end to the actual control system is basically the migration of a single strategy, and for the embodied difference problem, much research is devoted to the strategy, so that the strategy is more robust and has stronger generalization, and the difference is further compensated. Unlike this idea, the embodiment of the present invention first uses a general policy

Decoupling as a task policy

And difference policy

The former is dedicated to task skills, the latter is dedicated to overcoming differences, and the two are further migrated to an actual control system in a coordinated mode, so that a low-difference and strong-robustness migration effect is achieved. Wherein the content of the first and second substances,

the strategy parameters are obtained by training, T represents a task, and G represents a difference.

The difference strategy requires experimental feedback given after the task strategy is migrated to an actual control system to be used as prior knowledge of simulation training setting, so that a learning target of the difference strategy is constructed, the virtual simulation end is regressed to perform training, and the characteristics of virtual-real interaction from 'virtual' to 'real' and then to 'virtual' are reflected. Migration starting points of single strategies are all around

Expanding, trying to obtain a system close to reality under virtual simulation

Is/are as follows

Or setting a parameter space in the virtual simulation end to coverThe cover actually controls the dynamic distribution of the system to strengthen the robustness of the strategy. In contrast, the present invention focuses on how to satisfy the requirements

. In that

Determining,

In the known case, if it is to be changed

The key lies in

To do so

Has been trained, so that it can be assumed that

So that

，

Although the actual state cannot be made to approach the reference state, the movement tendency is guaranteed, so that the embodiment of the invention has

Further, it should be:

wherein the content of the first and second substances,

for the correction action, can be combined with

Is coupled to obtain

Further assume that

And change the actions in the above formula into strategies

In the form of (1) is:

in the above formula

Although not readily available, it is not

Has been already provided with

Migratory from deficiency or excess

、

、

And

it is deduced that

Then it can be obtained, and then based on the action and state adjustment simulation setting training

. Further, the compound of the above formula

Is defined as

Will be

Is defined as

. To this end, general strategies with difference overcoming capabilities are decoupled into task strategies and difference strategies.

As shown in fig. 2, the general strategy is divided into a task strategy and a difference strategy through decoupling, the task strategy and the difference strategy interact with each other through virtual and real, and the task strategy and the difference strategy act on the actual control system of the target robot in a cooperative manner.

First, step S1 is executed to migrate the task strategy of the target robot to the actual control system of the target robot. The target robot is a robot to be controlled, and executes a task to be executed, and the task to be executed may be to push a target object to a target position. The target robot can include the robot clamping jaw in the structure, presss from both sides the target object and gets and push it to the target location through the robot clamping jaw to treat to carry out the task and carry out.

The task strategy can be a pre-trained preset task execution model, the input of the task strategy can be the actual state of the target robot at each moment in the process of executing the task to be executed, and the output can be the task action required to be executed by the target robot at each moment in the process of executing the task to be executed. Wherein, the basic model of the task strategy can be a neural network model.

The actual control system of the target robot means a control system for controlling an actual state of the target robot and a task action to be performed for performing a task to be performed. After the actual control system of the target robot is migrated into the task strategy, the actual state of the target robot at the previous moment is controlled to change through the task action output by the task strategy at the previous moment, and the actual state at the current moment is obtained.

The state of the target robot can be represented by information such as Cartesian position coordinates of the tail end of the clamping jaw of the robot, pose of each joint of the robot and the like. The task action required to be executed by the target robot to execute the task to be executed can be represented by information such as the coordinate change of the Cartesian position of the tail end of the clamping jaw of the robot, the pose change of each joint of the robot and the like.

Then, step S2 is executed to determine whether the difference between the actual state of the target robot at the current moment and the reference state determined based on the task strategy is outside the preset range. The reference state refers to the state of each moment in the process that the target robot executes the task to be executed under the ideal condition that the task strategy is not combined with an actual control system. At each moment, the target robot corresponds to an actual state and a reference state, and then for the current moment, a difference value between the actual state and the reference state of the target robot at the current moment can be calculated, and the difference value can be a difference value between two position coordinates, namely a distance.

The preset range may be a state deviation range which can be allowed and is given in advance, and if the difference value is within the preset range, it indicates that the actual state at the current moment is consistent with the reference state determined by the task strategy to some extent, so that the actual control system may continue to implement control over the target robot through the task strategy until the task to be executed is completed. If the difference value is out of the preset range, the difference value between the actual state at the current moment and the reference state determined by the task strategy is large and cannot be ignored, so that the difference value needs to be corrected by introducing a difference strategy, namely the difference strategy of the target robot is transferred to an actual control system.

The difference strategy can be a difference correction model trained in advance, the input of the difference strategy can be the actual state of the target robot at the current moment, and the output can be the correction action required to be executed by the target robot to enable the actual state at the current moment to be consistent with the reference state or enable the state difference value of the actual state and the reference state to be within a preset range. Wherein the basic model of the difference strategy may be a neural network model.

After the difference strategy is migrated into the actual control system, the correction action output by the difference strategy is coupled with the task action output by the task strategy to obtain a coupling action. The coupling action may be a combination of a correction action and a task action, for example, the correction action is in a north direction, the task action is in a west direction, and the coupling action is in the north-west direction. After determining the coupling action, the actual control system of the target robot can control the target robot to execute the coupling action, and further obtain the actual state of the target robot at the next moment of the current moment.

In the embodiment of the invention, the difference strategy can be determined by a state deviation set between a sample actual state set obtained after the task strategy is transferred to the actual control system for multiple times and a reference state set output by the task strategy. The number of times of migration may be set according to the circumstances, such as once, three times, ten times, one hundred times, and the like, and this is not particularly limited in the embodiment of the present invention. Each time of migration, a sample task action at each time of the migration is obtained, and the actual control system can obtain a sample actual state set including the sample actual state at each time of the migration by executing the sample task action. The reference state set is a set formed by reference states at all times in the process of executing a task to be executed by the target robot, and the state deviation set is a set of state differences between the reference states and the actual states at the same time in the sample actual state set and the reference state set. Through the set, the difference strategy can be determined by combining the sample correction action corresponding to each migration.

The robot control strategy migration method provided by the embodiment of the invention comprises the steps of firstly migrating a task strategy of a target robot to an actual control system of the target robot, and determining the actual state of the target robot at the current moment based on the actual control system; and then when the difference value between the actual state and the reference state determined based on the task strategy is out of the preset range, transferring the difference strategy of the target robot to the actual control system so that the actual control system executes the task strategy and the coupling action under the difference strategy, and further determining the actual state of the target robot at the next moment of the current moment. Due to the fact that the difference strategy is introduced when the difference value between the actual state and the reference state determined based on the task strategy is out of the preset range, the application effect of the task strategy in an actual control system can be guaranteed through the double-strategy collaborative migration of the task strategy and the difference strategy, the actual control system can control the target robot more accurately, and smooth execution of the task to be executed is achieved.

On the basis of the foregoing embodiment, in the robot control policy migration method provided in the embodiment of the present invention, the difference policy is specifically determined by the following method:

Specifically, in the embodiment of the present invention, during the difference policy, the difference policy may be implemented by migrating the task policy to the actual control system for multiple times, where each time the task policy is migrated, the actual control system may execute the sample action at each time of the migration obtained by the task policy, and further determine a sample actual state set of the target robot, where the sample actual state set includes sample actual states of the target robot at different times of the migration.

Since the process of multiple migrations is the same, taking any one of the multiple migrations as an example, for the migration, the reference state set can be expressed as

The set of actual states of the sample can be expressed as

The set of state deviations can be expressed as

And n is the number of states involved in the execution of the task to be executed. And determining a sample correction action corresponding to the migration according to the reference state set of the target robot and the state deviation set between the actual state sets of the samples corresponding to the migration. By matching the sample correction operation with the sample task operation, the state deviation can be reduced, that is, the actual state of the sample tends to the reference state.

And finally, the difference strategy can be obtained by taking the sample correction action corresponding to the multiple times of migration and the actual state of the sample corresponding to each sample correction action as training samples.

In the embodiment of the invention, the difference strategy is obtained by training the sample actual state set obtained by the multiple times of migration of the task strategy and the sample correction action corresponding to each sample actual state as the training sample, so that the correction effect of the difference strategy on the task strategy migrated to the actual control system is better.

On the basis of the foregoing embodiment, the robot control strategy migration method provided in the embodiment of the present invention is a method for determining a sample correction action corresponding to any one of the transitions based on a reference state set of the target robot and a state deviation set between sample actual state sets corresponding to the any one of the transitions, and specifically includes:

Specifically, in the embodiment of the present invention, when determining the sample correction action, the first state deviation exceeding the threshold c may be selected from the state deviation set in time sequence

The threshold may be a distance threshold, and may be set according to needs, which is not specifically limited in the embodiment of the present invention.

Incorporating the same moment

、

Can obtain

The state-deviation trajectory obtained after k transitions can be expressed as

。

According to the state deviation, a candidate sample correction action set corresponding to the state deviation can be determined, and the candidate sample correction action can be an action which is selected manually and can possibly realize a correction action.

Then passing through the state deviation

And each alternative sample correction action in the alternative sample correction action set corrects the actual state of the sample corresponding to the state deviation to obtain an estimated sample state and a reference state corresponding to the state deviation, and the sample correction action is determined. For example, the absolute value of the state deviation between the estimated state of the sample obtained by correcting the actual state of the sample corresponding to the state deviation with each candidate sample correction action and the reference state corresponding to the state deviation may be calculated first, and then the absolute value and the state deviation may be determined

If there is:

wherein h represents a number of 1 to k in a natural order,

represents the estimated state of the sample at the h-th transition,

indicates the reference state at the h-th transition.

Then an alternative sample modification action is illustrated

Can satisfy

And M represents correction, the alternative sample is corrected

As a sample correction operation.

The embodiment of the invention provides a method for determining the sample correcting action, so that the accuracy of the difference strategy obtained by training the sample correcting action can be ensured.

On the basis of the foregoing embodiment, the robot control policy migration method provided in the embodiment of the present invention determines the difference policy based on sample correction actions corresponding to multiple migrations and a sample actual state corresponding to each sample correction action, and specifically includes:

Specifically, when determining the difference policy in the embodiment of the present invention, a training target may be first constructed according to sample correction actions corresponding to multiple migrations and sample actual states corresponding to the sample correction actions, where a form of the training target is related to task content of a task to be executed, and this is not specifically limited in the embodiment of the present invention. For example, the training target may be such that the difference between the actual state of the target robot and the reference state is minimal.

Then, according to the training target, combining the sample correction actions corresponding to the multiple times of migration and the sample actual state training corresponding to each sample correction action to obtain a difference strategy

。

In the embodiment of the invention, the obtained difference strategy can be more reliable by training.

On the basis of the foregoing embodiment, the robot control strategy migration method provided in the embodiment of the present invention determines the difference strategy based on the sample correction actions corresponding to the multiple migrations and the sample actual states corresponding to the sample correction actions, and before the determining, further includes:

Specifically, in the embodiment of the invention, after k times of migration, the migration can be performedTo extract each pair

To obtain a plurality of representative pairs

Namely, the repeated sample correction action and the actual sample state corresponding to the repeated sample correction action are eliminated. Wherein

Indicating the actual state of the sample at the h-th migration. Then, when determining the difference policy based on the sample correction actions corresponding to the plurality of transitions and the actual state of the sample corresponding to each sample correction action, the difference policy may be determined based on the obtained plurality of pairs

A difference policy is determined.

In the embodiment of the invention, the calculation amount in the process of determining the difference strategy can be reduced by eliminating the repeated sample correction action and the actual sample state corresponding to the repeated sample correction action, so that the process of determining the difference strategy is simplified.

On the basis of the above embodiment, the robot control strategy migration method provided in the embodiment of the present invention is obtained by pre-training the task strategy based on a reinforcement learning method, and the reward function adopted during training is determined based on the distance function between the actual position and the target position of the target object involved in the task to be executed of the target robot.

Specifically, in the embodiment of the present invention, the task Policy may be obtained by training through a reinforcement learning method, and the reinforcement learning method may include a Deep Deterministic Policy Gradient (DDPG), a near-end Policy Optimization algorithm (PPO), and the like.

In the process of training the task strategy, the reward function used can be determined according to a distance function between the actual position and the target position of the target object involved in the task to be performed of the target robot. For example:

wherein the content of the first and second substances,

in order to be a function of the reward,

as a function of the distance between the actual position of the target object and the target position at time t,

as a function of the distance between the actual position of the target object and the target position at the initial moment.

In the embodiment of the invention, the related reward functions belong to intensive reward functions, compared with 0 and 1 type sparse reward functions, the training speed is higher, and the reward is gradually increased along with the reduction of the distance. In addition, different from general pushing tasks of the same type, the initial position and the target position of the target object which are randomly updated in each migration are not adopted in the embodiment of the invention, and the two positions are fixed according to the corresponding situation of practical application; this reduces training time, although sacrificing some robustness. Therefore, the experience followed in the task strategy training process is that the training time is shortened as much as possible on the basis that the strategy can meet the task execution requirement, the simulation setting is simplified, the robustness and the generalization capability of the training strategy under the virtual simulation are weakened, and the effectiveness and the robustness of the migration method are highlighted.

As shown in fig. 3, on the basis of the above embodiment, in the embodiment of the present invention, when a target robot is controlled, a task policy is first obtained, then the task policy is migrated to an actual control system, then it is determined whether a difference between an actual state at a current time and a reference state determined by the actual control system is within a preset range, if so, execution is continued until a task to be executed is completed, if not, a difference policy is introduced, and the task policy is returned to, the difference policy and the task policy are subjected to action coupling, a coupling action is obtained to act on the actual control system, and then it is determined whether a difference between an actual state at a next time of the current time and the reference state of the target robot is within a preset range.

On the basis of the above embodiments, the effectiveness and robustness of the robot control strategy migration method provided by the embodiment of the invention are verified based on the adaptive object pushing experiment of the UR3 robot.

Experimental setup of virtual simulation: the simulation environment is composed of a MuJoCo physical engine and an Gym library of OpenAI, and particularly uses a UR3 robot with 6 joint degrees of freedom, the tail end of the robot is provided with two fingers, and the robot is closed to push an experimental object during experiment. The simulation step size is 0.002s, comprising 50 steps per round. The output of the mission strategy is the cartesian coordinates that the robot jaw tip should reach, with the Z coordinate fixed to keep the tip height constant. The inputs of the task strategy comprise the experimental object, the target, the cartesian position coordinates of the end of the robot gripping jaw and the deflection posture of the experimental object. The simulation coordinate system takes the center of the base of the UR3 robot as the zero point of the world coordinates.

Experimental setup of the actual control system: the actual control System is a Robot Operating System (ROS), the target Robot is a UR3 Robot, communication among a computer, a Kinect2 camera and a UR3 Robot is built based on the ROS, and the tail end of the Robot is provided with a Robot finger grip, so that gripping is not involved, and the Robot finger grip is set as a closed normal state. The camera is used for acquiring relevant information such as the pose and the target position of the target object, and various state data of the target robot are acquired by a Moveit package under the ROS. The tail end of the clamping jaw of the robot is perpendicular to the moving plane, and the height of the clamping jaw is fixed and unchanged, so that the tail end is prevented from being collided. In addition, in the embodiment of the invention, the same action outputs the instruction, and the actual robot and the simulated robot have the same execution condition. In fact, a slight error does exist, but compared with the difference influence of the experimental object, the execution error of the robot itself is negligible, and the experimental result of the embodiment of the present invention is not significantly influenced.

In order to ensure the rigor of the experiment, the data types of strategy input and strategy output during simulation training are consistent with those during actual migration, and the coordinates of the two environments are correspondingly the same.

Simulation training of a task strategy: for the task strategy training of the UR3 robot body push experiment, the corresponding simulation environment is designed firstly as described above. The reinforcement learning algorithm used for training is SAC, combined with HER use. The neural network setting and the hyper-parameter of the algorithm program use the corresponding default setting in the stable-bases library.

Analyzing and summarizing the task strategy reality migration experiment: the trained task strategy is firstly tested in the simulation, and the specific position of each step of the pushed target object in the motion plane is recorded and collected as a reference track, as shown in fig. 4, the initial point 42 refers to the actual position of the target object at the initial time, and the terminal point 41 refers to the target position reached by the target object. Then, the task strategy is transferred to the UR3 robot system, because physical characteristics such as actual object mass, gravity center, friction force and the like are not clear, difference influence is generated, through a plurality of object pushing experiments, a plurality of object states different from the reference track are observed, and the position and deflection of the object are changed, namely deviation

Further generalizations can yield: even if the physical parameters of the target object are not known, the influence of the difference is almost two types at the beginning, namely, the two sides of the target object are deflected along the motion direction of the reference track, and only the specific degree is different, as shown in fig. 5 and fig. 6. The squares in fig. 5 and 6 each represent a target object, the circles each represent a contact point of the target robot with the target object, and the arrow direction is the moving direction of the target object at time t. FIG. 5 shows the target object at t,

、

、

The moment is deflected clockwise, and figure 6 shows the target object at t,

、

、

The moment is deflected in the counter-clockwise direction.

And as the task action evolves, the deflection gradually increases, the task strategy cannot provide corresponding action adjustment, the position also deviates, and finally the task fails. Therefore, the method can further develop the assumption that timely correction action can be given when deflection occurs or not so as to improve the success rate of tasks. The deviation summary and the correction find that the deviation summary and the correction are really prior knowledge required for carrying out the simulation training of the difference strategy in the next step, which is the reason for the actual migration of the task strategy in the method.

Analyzing a difference strategy and performing simulation training: after a period of exploration and accumulation, the exploration is only carried out on the virtual simulation level, and if the strong robustness migration of a single strategy is to be realized, whether an ideal result can be obtained by the existing method is difficult to guarantee. Two strategies are migrated simultaneously, one main 'task' and one main 'correction' are combined, the difference is overcome, and the task is completed, so that the idea is gradually applied. Focusing on an object pushing experiment of the UR3 robot, after training and migration of a task strategy are completed, the deviation of the difference influencing concrete embodiment is found

For a two-sided deflection, it is clear that the corresponding correction adjustment is given an action in an oblique direction (in terms of the direction of movement of the object pushing). Judging the movement direction of the task strategy migration application and the reference track to consider thatTask actions given by task policies

The direction of the movement is basically consistent with the movement direction, and the movement requirement under the reward function is also met. It is then clear that the simulation setup for the difference strategy is such that when an object is deflected clockwise, the robot end requires the difference strategy to give a corrective action vertically upwards with respect to the direction of movement

(ii) a When the object deflects anticlockwise, the correction action given by the difference strategy

Should be vertically downward. Therefore, after the actions given by the two strategies are coupled, the actions are coupled

Will point obliquely upwards or obliquely downwards in the direction of movement, and a corresponding schematic top view is shown in fig. 7, which briefly shows the results from the start of the experiment

Is at the moment

The change of the action execution condition at the moment and the deflection state of the target object. The action outputs of the two strategies are both cartesian coordinates of the next moment, so the action coupling in the embodiment of the invention is vector addition of the coordinates:

in order to make the output action meet the requirement, the training of the difference strategy is designed as the task that the tail end of the clamping jaw of the robot reaches the random point of the designated area, and the top view schematic diagram is shown in fig. 8 and 9. Fig. 8 shows that the target object is deflected clockwise at the beginning of the migration, the end of the robot gripper at the beginning of the migration randomly appears at point a on the fitting line segment from the initial point 81 to the end point 82, and the training task is that the end moves to point B which is perpendicular to the line connecting point a and is at a constant distance from point a in the migration. Fig. 9 shows that the target object has been deflected counterclockwise at the beginning of the migration, the end of the robot gripper at the beginning of the migration randomly appears at point a on the fitting line segment from the initial point 91 to the end point 92, and the training task at this time is that the end moves to point B which is perpendicular to the line connecting point a and is at a constant distance from point a within the migration. The AB distance in the embodiment of the present invention may be set to 0.10 m.

Therefore, it can be seen that the essential reason for designing the difference strategy of the robot control strategy migration method provided in the embodiment of the present invention is to obtain the corrective action, and how to design the corrective action depends on the concrete representation of the real migration difference to be overcome, and the difference is caused because the task strategy is only obtained by virtual simulation. This also reflects the interactive nature of the strategy learning from "virtual" to "real" to "virtual".

Collaborative migration of task policies and difference policies: and after the difference strategy is obtained through training, the difference strategy is combined with the task strategy and is migrated to an actual system. The writing of the program can follow a double-process mode, so that the two strategies can run in parallel at the same time. The execution of the action is adjusted in real time according to specific conditions, and if no difference deflection occurs, only the action given by the task strategy is executed; if deflection occurs, the coupling action of the double strategies is executed. Specifically, in the strategy migration starting stage, the target object starting angle is 135 degrees around the Z axis, and the deflection around the X axis and the Y axis is ignored. Further setting that when the deflection is less than or equal to 130 degrees, the clockwise deflection is considered to occur, a difference strategy is required to give a vertically upward action, the action is coupled with the task action before being executed, the coupling action faces to the obliquely upper direction of the movement direction, and then an object is pushed along the task direction while the deflection is attempted to be corrected until the deflection angle returns to 135 degrees, and then the single execution of the task strategy is recovered. When the deflection is equal to or greater than 140 degrees, the counter-clockwise deflection is considered to occur, the coupling action is directed obliquely downward in the motion direction, and the execution logic is the same as that described above.

In the embodiment of the invention, single migration of the task strategy is used as a basic experiment, and compared with the basic experiment, dual-strategy migration is adopted. The experiment promoted that the paper box pasted with the two-dimensional code has the size of 0.15m multiplied by 0.05m and the self weight of about 60g, and in order to verify the effectiveness and the robustness of the method, in addition to the unfolding experiment of the paper box, an iron block with the weight of 1000g is further arranged in the non-geometric center position of the upper part and the lower part of the inner part of the paper box, so that the mass, the friction force and the gravity center position of the whole paper box are obviously increased and changed, and the experiment is carried out.

And setting the successful experimental mark as that the distance between the two-dimensional code center of the paper box and the two-dimensional code center of the target position is less than 0.02m, and the difference between the deflection angle and the initial angle is not more than +/-5 degrees. End point of experiment

Starting point of

The unit is meter, and the starting position of the carton is not accurately measured in each experiment, but is placed at the approximate position of the starting point. Each set of experiments was performed 50 times and the corresponding success rates were recorded, with the specific results shown in table 1.

Table 1 success rate of experiments on three types of cartons using two migration methods

Experimental results show that the robot control strategy migration method provided by the embodiment of the invention has strong robustness on the change of the physical attributes of the paper boxes, and completes the pushing task with high success rate; the single-strategy migration only has certain task completion capability on the carton without the iron blocks, and the task cannot be completed after the iron blocks are loaded. In addition, the movement locus of the plane where the three types of paper boxes are pushed under the two migration methods is shown in fig. 10 to 15, the solid points in fig. 10 to 15 all represent target positions, fig. 10, 12 and 14 respectively represent the task completion situation when the three types of iron boxes, namely the iron box, the iron block arranged above the paper box and the iron block arranged below the paper box, are migrated by adopting a single strategy, and fig. 11, 13 and 15 respectively represent the task completion situation when the three types of iron boxes, namely the iron block arranged above the paper box and the iron block arranged below the paper box, are migrated by adopting a double strategy. Comparing fig. 10 and fig. 11, it can be seen that the task completion situation can be better by adopting the dual policy migration; as can be seen from comparison between fig. 12 and fig. 13, a task cannot be completed by using single policy migration, but can be completed by using dual policy migration; as can be seen from a comparison of fig. 14 and fig. 15, the task cannot be completed by using single policy migration, but the task can be completed by using dual policy migration. It follows that dual policy migration can effectively make real-time adjustments to the differential impact, whereas single migration cannot. The broken lines appearing in fig. 11, 13, 15 are track changes due to the introduction of the difference strategy.

It should be noted that, for a single migration task strategy, the embodiment of the present invention is not specifically retrained, but the same strategy and the same program are used to perform experiments on three types of paper boxes. When the double-strategy migration method provided by the embodiment of the invention is used for carrying out the pushing experiment on three types of paper boxes, no manual adjustment is carried out on the task strategy and the difference strategy, and the same procedure can be operated on the basis of the same strategy, so that the strong robustness of the method is reflected, and the strategy does not need to be retrained according to the physical attribute change of an experimental object.

At present, researchers have systematically studied the strategy of training a robot pushing body by using domain randomization, but the change of the physical property of the pushed object is limited, and only a paper sheet for increasing the friction force is added at the bottom. In addition, the method has high requirements on equipment, the strategy needs to be trained on 100-core equipment for 8 hours, and the method only needs to be trained on a conventional 4-core computer with an 8G video card for 3 hours (the training time is the sum of the training time of the two types of strategies).

The method comprises the steps of carrying out virtual-real interaction double-strategy migration object promotion experiments based on the UR3 robot, wherein logic behind the virtual-real interaction double-strategy migration object promotion experiments reflects learning training of task skills from 'virtual', learning training for summarizing and correcting the 'real' deviation, learning training for aiming at differences and compensation of 'virtual', and executing application of strong-robustness and low-difference double strategies of 'real'. Under the synergistic effect of the two strategies, the uncertain influence caused by the difference problem is effectively overcome, so that the success rate of the task is remarkably improved.

In conclusion, the robot control strategy migration method for solving the problem of difference from simulation to reality is strong in robustness, high in efficiency and capable of achieving virtual-real interaction.

As shown in fig. 16, on the basis of the above embodiment, an embodiment of the present invention provides a robot control policy migration apparatus, including:

a task strategy migration module 161, configured to migrate a task strategy of a target robot to an actual control system of the target robot, and determine an actual state of the target robot at a current time based on the actual control system;

a difference policy migration module 162, configured to migrate the difference policy of the target robot to the actual control system if it is determined that the difference between the actual state and the reference state determined based on the task policy is outside a preset range, so that the actual control system executes the task policy and a coupling action under the difference policy, and further determines an actual state of the target robot at a next time of the current time;

On the basis of the foregoing embodiment, the robot control policy migration apparatus provided in the embodiment of the present invention further includes a difference policy determination module, configured to:

On the basis of the foregoing embodiment, the robot control policy migration apparatus provided in the embodiment of the present invention includes a difference policy determination module, which is specifically configured to:

On the basis of the foregoing embodiment, in the robot control policy migration apparatus provided in the embodiment of the present invention, the difference policy determining module is further specifically configured to:

the determining the difference policy based on the sample correcting actions corresponding to the multiple times of migration and the actual state of the sample corresponding to each sample correcting action specifically includes:

On the basis of the foregoing embodiment, the robot control policy migration apparatus provided in the embodiment of the present invention further includes a rejection module, configured to:

On the basis of the above embodiment, the robot control strategy migration apparatus provided in the embodiment of the present invention is obtained by pre-training the task strategy based on a reinforcement learning method, and the reward function used in the training is determined based on the distance function between the actual position and the target position of the target object involved in the task to be executed of the target robot.

Specifically, the functions of the modules in the robot control policy migration apparatus provided in the embodiment of the present invention correspond to the operation flows of the steps in the method embodiments one to one, and the implementation effects are also consistent.

As shown in fig. 17, on the basis of the foregoing embodiment, an embodiment of the present invention provides a robot control policy migration system, including: an imaging device 171 and the robot control strategy transfer device 172 described in the above embodiments, the robot control strategy transfer device 172 being connected to the imaging device 171; the imaging device 171 is used to acquire the actual state of the target robot.

Fig. 18 illustrates a physical structure diagram of an electronic device, and as shown in fig. 18, the electronic device may include: a processor (processor)1810, a communication Interface 1820, a memory (memory)1830, and a communication bus 1840, wherein the processor 1810, the communication Interface 1820, and the memory 1830 communicate with each other via the communication bus 1840. The processor 1810 may invoke logic instructions in the memory 1830 to perform the robot control strategy migration method provided by the above embodiments, the method including: migrating a task strategy of a target robot to an actual control system of the target robot, and determining the actual state of the target robot at the current moment based on the actual control system; if the difference value between the actual state and the reference state determined based on the task strategy is judged to be out of the preset range, transferring the difference strategy of the target robot to the actual control system so that the actual control system executes the task strategy and the coupling action under the difference strategy, and further determining the actual state of the target robot at the next moment of the current moment; and determining the difference strategy based on a state deviation set between a sample actual state set obtained after the task strategy is transferred to the actual control system for multiple times and a reference state set output by the task strategy and a sample correction action corresponding to each transfer.

In addition, the logic instructions in the memory 1830 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, which includes a computer program stored on a non-transitory computer-readable storage medium, the computer program including program instructions, when the program instructions are executed by a computer, the computer being capable of executing the robot control strategy migration method provided by the above embodiments, the method including: migrating a task strategy of a target robot to an actual control system of the target robot, and determining the actual state of the target robot at the current moment based on the actual control system; if the difference value between the actual state and the reference state determined based on the task strategy is judged to be out of the preset range, transferring the difference strategy of the target robot to the actual control system so that the actual control system executes the task strategy and the coupling action under the difference strategy, and further determining the actual state of the target robot at the next moment of the current moment; and determining the difference strategy based on a state deviation set between a sample actual state set obtained after the task strategy is transferred to the actual control system for multiple times and a reference state set output by the task strategy and a sample correction action corresponding to each transfer.

In yet another aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the robot control policy migration method provided in the foregoing embodiments, the method including: migrating a task strategy of a target robot to an actual control system of the target robot, and determining the actual state of the target robot at the current moment based on the actual control system; if the difference value between the actual state and the reference state determined based on the task strategy is judged to be out of the preset range, transferring the difference strategy of the target robot to the actual control system so that the actual control system executes the task strategy and the coupling action under the difference strategy, and further determining the actual state of the target robot at the next moment of the current moment; and determining the difference strategy based on a state deviation set between a sample actual state set obtained after the task strategy is transferred to the actual control system for multiple times and a reference state set output by the task strategy and a sample correction action corresponding to each transfer.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A robot control strategy migration method is characterized by comprising the following steps:

2. The robot control strategy migration method of claim 1, wherein the difference strategy is specifically determined by:

3. The robot control strategy transfer method according to claim 2, wherein the determining the sample correction action corresponding to the any one transfer based on the state deviation set between the reference state set of the target robot and the actual state set of the sample corresponding to the any one transfer specifically comprises:

4. The robot control strategy migration method according to claim 2, wherein the determining the difference strategy based on the sample correction actions corresponding to the multiple migrations and the actual state of the sample corresponding to each sample correction action specifically comprises:

5. The robot control strategy migration method according to claim 4, wherein the determining the difference strategy based on the sample correction actions corresponding to the plurality of migrations and the sample actual states corresponding to the respective sample correction actions further comprises:

6. A robot control strategy transfer method according to any of claims 1-5, characterized in that the task strategy is pre-trained based on a reinforcement learning method, and the reward function used in training is determined based on a distance function between the actual position and the target position of the target object involved in the task to be performed by the target robot.

7. A robotic control strategy migration apparatus, comprising:

8. A robotic control policy migration system, comprising: the robot control strategy transferring device according to claim 7, wherein the robot control strategy transferring device is connected with the camera device;

the camera device is used for acquiring the actual state of the target robot.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the robot control strategy migration method according to any of claims 1 to 6 are implemented when the program is executed by the processor.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the robot control strategy migration method according to any one of claims 1 to 6.