CN113050433A - Robot control strategy migration method, device and system - Google Patents

Robot control strategy migration method, device and system Download PDF

Info

Publication number
CN113050433A
CN113050433A CN202110603540.9A CN202110603540A CN113050433A CN 113050433 A CN113050433 A CN 113050433A CN 202110603540 A CN202110603540 A CN 202110603540A CN 113050433 A CN113050433 A CN 113050433A
Authority
CN
China
Prior art keywords
strategy
sample
actual
state
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110603540.9A
Other languages
Chinese (zh)
Other versions
CN113050433B (en
Inventor
刘智勇
吴亮东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202110603540.9A priority Critical patent/CN113050433B/en
Publication of CN113050433A publication Critical patent/CN113050433A/en
Application granted granted Critical
Publication of CN113050433B publication Critical patent/CN113050433B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Abstract

The invention provides a robot control strategy migration method, a device and a system, wherein a difference strategy is introduced when the difference value between the actual state and the reference state determined based on the task strategy is out of the preset range, and the application effect of the task strategy in the actual control system can be ensured through the double-strategy collaborative migration of the task strategy and the difference strategy, so that the actual control system can more accurately control a target robot, and the smooth execution of a task to be executed is realized.

Description

Robot control strategy migration method, device and system
Technical Field
The invention relates to the technical field of reinforcement learning and robot control, in particular to a robot control strategy migration method, device and system.
Background
In recent years, research on the development of robot control by reinforcement learning has become a focus of research. However, the core mechanism of reinforcement learning lies in trial and error of a large number of samples, and then a qualified control strategy is obtained by training, and training directly on an actual control system of a robot faces a series of actual problems of high cost, high risk and low efficiency, such as hardware wear, potential safety hazard, long time consumption, and the like, which forces a large amount of research to be concentrated on a simulation level. Therefore, it is a natural idea for researchers to transfer the control strategy obtained by training under simulation to the actual control system.
And the fact that: because the difference between the simulation source domain and the real target domain causes the difference problem from simulation to reality, the application effect of the actual control system of the robot migrated by the obtained control strategy is far from the effect of the simulation level with high probability. How to solve the above-mentioned difference problem becomes a major difficulty for researchers. Currently, various solutions can be divided into three major categories: a system identification based method, a domain adaptation based method, and a domain randomization based method.
Although the three methods have obvious effects in solving the difference problem, it needs to be studied intensively how to better reproduce the application effect of the control strategy in the actual control system.
Disclosure of Invention
The invention provides a robot control strategy migration method, device and system, which are used for overcoming the defects in the prior art.
The invention provides a robot control strategy migration method, which comprises the following steps:
migrating a task strategy of a target robot to an actual control system of the target robot, and determining the actual state of the target robot at the current moment based on the actual control system;
if the difference value between the actual state and the reference state determined based on the task strategy is judged to be out of the preset range, transferring the difference strategy of the target robot to the actual control system so that the actual control system executes the task strategy and the coupling action under the difference strategy, and further determining the actual state of the target robot at the next moment of the current moment;
and determining the difference strategy based on a state deviation set between a sample actual state set obtained after the task strategy is transferred to the actual control system for multiple times and a reference state set output by the task strategy and a sample correction action corresponding to each transfer.
According to the robot control strategy migration method provided by the invention, the difference strategy is specifically determined by the following method:
transferring the task strategy to the actual control system for multiple times, and determining a sample actual state set of the target robot based on sample actions obtained by the actual control system executing the task strategy each time;
for any transition, determining a sample correction action corresponding to the any transition based on a state deviation set between a reference state set of the target robot and a sample actual state set corresponding to the any transition;
and determining the difference strategy based on the sample correction actions corresponding to the multiple times of migration and the actual state of the sample corresponding to each sample correction action.
According to the robot control strategy migration method provided by the invention, the method for determining the sample correction action corresponding to any migration based on the reference state set of the target robot and the state deviation set between the sample actual state sets corresponding to any migration specifically comprises the following steps:
selecting a first state deviation exceeding a threshold value from the state deviation set according to a time sequence, and determining an alternative sample correction action set corresponding to the state deviation;
and determining the sample correcting action based on the state deviation, a sample estimated state obtained by correcting the actual state of the sample corresponding to the state deviation through each candidate sample correcting action in the candidate sample correcting action set and a reference state corresponding to the state deviation.
According to the robot control strategy migration method provided by the invention, the determining the difference strategy based on the sample correction actions corresponding to the multiple times of migration and the actual state of the sample corresponding to each sample correction action specifically comprises the following steps:
and constructing a training target based on the sample correcting actions corresponding to the multiple times of migration and the sample actual state corresponding to each sample correcting action, and training the sample correcting actions corresponding to the multiple times of migration and the sample actual states corresponding to each sample correcting action based on the training target to obtain the difference strategy.
According to the robot control strategy migration method provided by the invention, the difference strategy is determined based on the sample correction actions corresponding to the multiple times of migration and the actual state of the sample corresponding to each sample correction action, and the method also comprises the following steps:
and eliminating repeated sample correction actions in the sample actual states corresponding to the sample correction actions and the repeated sample correction actions corresponding to the repeated sample correction actions.
According to the robot control strategy migration method provided by the invention, the task strategy is obtained by pre-training based on a reinforcement learning method, and a reward function adopted during training is determined based on a distance function between the actual position and the target position of a target object involved in a task to be executed of the target robot.
The invention also provides a robot control strategy migration device, which comprises:
the task strategy migration module is used for migrating a task strategy of the target robot to an actual control system of the target robot and determining the actual state of the target robot at the current moment based on the actual control system;
a difference strategy migration module, configured to migrate the difference strategy of the target robot to the actual control system if it is determined that a difference between the actual state and a reference state determined based on the task strategy is outside a preset range, so that the actual control system executes the task strategy and a coupling action under the difference strategy, and further determines an actual state of the target robot at a next time of the current time;
and determining the difference strategy based on a state deviation set between a sample actual state set obtained after the task strategy is transferred to the actual control system for multiple times and a reference state set output by the task strategy and a sample correction action corresponding to each transfer.
The invention also provides a robot control strategy migration system, which comprises: the robot control strategy migration device is connected with the camera device;
the camera device is used for acquiring the actual state of the target robot.
The invention further provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the robot control strategy migration method.
The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the robot control strategy migration method according to any of the above-described methods.
The invention provides a robot control strategy migration method, a device and a system, firstly, a task strategy of a target robot is migrated to an actual control system of the target robot, and the actual state of the target robot at the current moment is determined based on the actual control system; and then when the difference value between the actual state and the reference state determined based on the task strategy is out of the preset range, transferring the difference strategy of the target robot to the actual control system so that the actual control system executes the task strategy and the coupling action under the difference strategy, and further determining the actual state of the target robot at the next moment of the current moment. Due to the fact that the difference strategy is introduced when the difference value between the actual state and the reference state determined based on the task strategy is out of the preset range, the application effect of the task strategy in an actual control system can be guaranteed through the double-strategy collaborative migration of the task strategy and the difference strategy, the actual control system can control the target robot more accurately, and smooth execution of the task to be executed is achieved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart of a robot control strategy migration method provided by the present invention;
FIG. 2 is a second schematic flow chart of a robot control strategy migration method according to the present invention;
FIG. 3 is a third schematic flowchart of a robot control strategy migration method according to the present invention;
fig. 4 is a schematic diagram of a trajectory of a target object on a motion plane during task strategy migration in the robot control strategy migration method provided by the present invention;
FIG. 5 is a schematic diagram of a target object deflected clockwise by the difference effect in the robot control strategy migration method provided by the present invention;
FIG. 6 is a schematic diagram of a target object deflected counterclockwise due to the difference effect in the robot control strategy migration method provided by the present invention;
FIG. 7 is a schematic top view of a coupling action in the robot control strategy migration method according to the present invention;
FIG. 8 is a schematic top view of a different strategy training task when a target object initially undergoes clockwise deflection in the robot control strategy migration method provided by the present invention;
FIG. 9 is a schematic top view of a strategy training task for counter-clockwise deflection of a target object at the beginning of migration in a robot control strategy migration method according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of a movement locus of a plane where an iron box is located in a single strategy migration method in the prior art;
FIG. 11 is a schematic diagram of a movement locus of a plane where an iron box is located in the robot control strategy migration method provided by the invention;
FIG. 12 is a schematic diagram of a movement locus of a plane on which an iron block is pushed above a paper box in a single strategy migration method in the prior art;
FIG. 13 is a schematic diagram of a movement locus of a plane on which an iron block is placed above a paper box and pushed in the robot control strategy transferring method provided by the invention;
FIG. 14 is a schematic diagram of a movement locus of a plane in which an iron block is placed below a paper box and pushed in a single strategy migration method in the prior art;
FIG. 15 is a schematic diagram of a movement locus of a plane where an iron block is placed below a paper box and pushed in the robot control strategy migration method provided by the invention;
FIG. 16 is a schematic structural diagram of a robot control strategy migration apparatus provided by the present invention;
FIG. 17 is a schematic structural diagram of a robot control strategy migration system provided by the present invention;
fig. 18 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Because the difference between the simulation source domain and the real target domain causes the difference problem from simulation to reality, the application effect of the actual control system of the robot migrated by the obtained control strategy is far from the effect of the simulation level with high probability. In the prior art, when solving the above difference problem, three general solutions are generally adopted, including: a system identification based method, a domain adaptation based method, and a domain randomization based method. Although the three methods have obvious effects in solving the difference problem, it needs to be studied intensively how to better reproduce the application effect of the control strategy in the actual control system. For example, how to reproduce the application effect of the control strategy in the actual control system as much as possible, how to shorten the training time of the control strategy, and the like. Therefore, the embodiment of the invention provides a robot control strategy migration method.
Fig. 1 is a schematic flowchart of a robot control policy migration method provided in an embodiment of the present invention, and as shown in fig. 1, the method includes:
s1, transferring the task strategy of the target robot to an actual control system of the target robot, and determining the actual state of the target robot at the current moment based on the actual control system;
s2, if the difference value between the actual state and the reference state determined based on the task strategy is judged to be out of the preset range, transferring the difference strategy of the target robot to the actual control system, so that the actual control system executes the task strategy and the coupling action under the difference strategy, and further determining the actual state of the target robot at the next moment of the current moment;
and determining the difference strategy based on a state deviation set between a sample actual state set obtained after the task strategy is transferred to the actual control system for multiple times and a reference state set output by the task strategy and a sample correction action corresponding to each transfer.
Specifically, in the robot control policy migration method provided in the embodiment of the present invention, an execution subject is a robot control policy migration apparatus, the apparatus may be configured in a server, where the server may be a local server or a cloud server, and the local server may specifically be a computer, which is not specifically limited in the embodiment of the present invention.
In general, the difference problem is given by the following form:
Figure 302653DEST_PATH_IMAGE001
wherein P is and
Figure 118162DEST_PATH_IMAGE002
a transition probability distribution respectively representing the dynamic characteristics of the virtual simulation end and the actual control system,
Figure 671634DEST_PATH_IMAGE003
And
Figure 965212DEST_PATH_IMAGE004
respectively representing the generation probability distribution of the virtual simulation end and the actual control system related to the image information, R representing the actual control system,
Figure 412374DEST_PATH_IMAGE005
the action of the task at the time t is represented,
Figure 274151DEST_PATH_IMAGE006
indicating the reference state at time t.
Figure 439553DEST_PATH_IMAGE007
Represents the reference state of the virtual simulation terminal at the moment t +1,
Figure 412188DEST_PATH_IMAGE008
representing the actual state at time t +1 in the actual control system.
Figure 713856DEST_PATH_IMAGE009
Indicating that the left-side content is available from the right-side content.
Because of the difference of the probability distribution of the virtual and the real sides, the control strategy obtained by the training of the virtual simulation end is transferred to the actual control system,
Figure 871168DEST_PATH_IMAGE010
relative to
Figure 399233DEST_PATH_IMAGE011
There will be differences and differences will be created. In the embodiment of the invention, the problem of virtual-real difference of dynamic characteristics is focused
Figure 34613DEST_PATH_IMAGE012
Hereinafter, it is rewritten as
Figure 456367DEST_PATH_IMAGE013
Wherein, the image information is not used in the learning and migration of the strategy, so the method and the device of the invention omit
Figure 659947DEST_PATH_IMAGE003
And
Figure 534362DEST_PATH_IMAGE004
the influence of (c).
At present, the strategy migration from the virtual simulation end to the actual control system is basically the migration of a single strategy, and for the embodied difference problem, much research is devoted to the strategy, so that the strategy is more robust and has stronger generalization, and the difference is further compensated. Unlike this idea, the embodiment of the present invention first uses a general policy
Figure 973433DEST_PATH_IMAGE014
Decoupling as a task policy
Figure 390639DEST_PATH_IMAGE015
And difference policy
Figure 889754DEST_PATH_IMAGE016
The former is dedicated to task skills, the latter is dedicated to overcoming differences, and the two are further migrated to an actual control system in a coordinated mode, so that a low-difference and strong-robustness migration effect is achieved. Wherein the content of the first and second substances,
Figure 126831DEST_PATH_IMAGE017
the strategy parameters are obtained by training, T represents a task, and G represents a difference.
The difference strategy requires experimental feedback given after the task strategy is migrated to an actual control system to be used as prior knowledge of simulation training setting, so that a learning target of the difference strategy is constructed, the virtual simulation end is regressed to perform training, and the characteristics of virtual-real interaction from 'virtual' to 'real' and then to 'virtual' are reflected. Migration starting points of single strategies are all around
Figure 104015DEST_PATH_IMAGE018
Expanding, trying to obtain a system close to reality under virtual simulation
Figure 500361DEST_PATH_IMAGE002
Is/are as follows
Figure 45743DEST_PATH_IMAGE019
Or setting a parameter space in the virtual simulation end to coverThe cover actually controls the dynamic distribution of the system to strengthen the robustness of the strategy. In contrast, the present invention focuses on how to satisfy the requirements
Figure 894750DEST_PATH_IMAGE020
. In that
Figure 488674DEST_PATH_IMAGE021
Determining,
Figure 739526DEST_PATH_IMAGE006
In the known case, if it is to be changed
Figure 580444DEST_PATH_IMAGE008
The key lies in
Figure 792113DEST_PATH_IMAGE005
To do so
Figure 111099DEST_PATH_IMAGE022
Has been trained, so that it can be assumed that
Figure 482037DEST_PATH_IMAGE023
So that
Figure 369222DEST_PATH_IMAGE024
Figure 927242DEST_PATH_IMAGE005
Although the actual state cannot be made to approach the reference state, the movement tendency is guaranteed, so that the embodiment of the invention has
Figure 49919DEST_PATH_IMAGE025
Further, it should be:
Figure 150730DEST_PATH_IMAGE026
wherein the content of the first and second substances,
Figure 67871DEST_PATH_IMAGE027
for the correction action, can be combined with
Figure 378766DEST_PATH_IMAGE005
Is coupled to obtain
Figure 180500DEST_PATH_IMAGE028
Further assume that
Figure 994873DEST_PATH_IMAGE029
And change the actions in the above formula into strategies
Figure 489439DEST_PATH_IMAGE030
In the form of (1) is:
Figure 22051DEST_PATH_IMAGE031
in the above formula
Figure 486531DEST_PATH_IMAGE032
Although not readily available, it is not
Figure 296355DEST_PATH_IMAGE033
Has been already provided with
Figure 86456DEST_PATH_IMAGE028
Migratory from deficiency or excess
Figure 840786DEST_PATH_IMAGE005
Figure 249902DEST_PATH_IMAGE034
Figure 304445DEST_PATH_IMAGE035
And
Figure 609656DEST_PATH_IMAGE036
it is deduced that
Figure 116860DEST_PATH_IMAGE027
Then it can be obtained, and then based on the action and state adjustment simulation setting training
Figure 188722DEST_PATH_IMAGE037
. Further, the compound of the above formula
Figure 707559DEST_PATH_IMAGE038
Is defined as
Figure 839463DEST_PATH_IMAGE039
Will be
Figure 568384DEST_PATH_IMAGE040
Is defined as
Figure 319303DEST_PATH_IMAGE041
. To this end, general strategies with difference overcoming capabilities are decoupled into task strategies and difference strategies.
As shown in fig. 2, the general strategy is divided into a task strategy and a difference strategy through decoupling, the task strategy and the difference strategy interact with each other through virtual and real, and the task strategy and the difference strategy act on the actual control system of the target robot in a cooperative manner.
First, step S1 is executed to migrate the task strategy of the target robot to the actual control system of the target robot. The target robot is a robot to be controlled, and executes a task to be executed, and the task to be executed may be to push a target object to a target position. The target robot can include the robot clamping jaw in the structure, presss from both sides the target object and gets and push it to the target location through the robot clamping jaw to treat to carry out the task and carry out.
The task strategy can be a pre-trained preset task execution model, the input of the task strategy can be the actual state of the target robot at each moment in the process of executing the task to be executed, and the output can be the task action required to be executed by the target robot at each moment in the process of executing the task to be executed. Wherein, the basic model of the task strategy can be a neural network model.
The actual control system of the target robot means a control system for controlling an actual state of the target robot and a task action to be performed for performing a task to be performed. After the actual control system of the target robot is migrated into the task strategy, the actual state of the target robot at the previous moment is controlled to change through the task action output by the task strategy at the previous moment, and the actual state at the current moment is obtained.
The state of the target robot can be represented by information such as Cartesian position coordinates of the tail end of the clamping jaw of the robot, pose of each joint of the robot and the like. The task action required to be executed by the target robot to execute the task to be executed can be represented by information such as the coordinate change of the Cartesian position of the tail end of the clamping jaw of the robot, the pose change of each joint of the robot and the like.
Then, step S2 is executed to determine whether the difference between the actual state of the target robot at the current moment and the reference state determined based on the task strategy is outside the preset range. The reference state refers to the state of each moment in the process that the target robot executes the task to be executed under the ideal condition that the task strategy is not combined with an actual control system. At each moment, the target robot corresponds to an actual state and a reference state, and then for the current moment, a difference value between the actual state and the reference state of the target robot at the current moment can be calculated, and the difference value can be a difference value between two position coordinates, namely a distance.
The preset range may be a state deviation range which can be allowed and is given in advance, and if the difference value is within the preset range, it indicates that the actual state at the current moment is consistent with the reference state determined by the task strategy to some extent, so that the actual control system may continue to implement control over the target robot through the task strategy until the task to be executed is completed. If the difference value is out of the preset range, the difference value between the actual state at the current moment and the reference state determined by the task strategy is large and cannot be ignored, so that the difference value needs to be corrected by introducing a difference strategy, namely the difference strategy of the target robot is transferred to an actual control system.
The difference strategy can be a difference correction model trained in advance, the input of the difference strategy can be the actual state of the target robot at the current moment, and the output can be the correction action required to be executed by the target robot to enable the actual state at the current moment to be consistent with the reference state or enable the state difference value of the actual state and the reference state to be within a preset range. Wherein the basic model of the difference strategy may be a neural network model.
After the difference strategy is migrated into the actual control system, the correction action output by the difference strategy is coupled with the task action output by the task strategy to obtain a coupling action. The coupling action may be a combination of a correction action and a task action, for example, the correction action is in a north direction, the task action is in a west direction, and the coupling action is in the north-west direction. After determining the coupling action, the actual control system of the target robot can control the target robot to execute the coupling action, and further obtain the actual state of the target robot at the next moment of the current moment.
In the embodiment of the invention, the difference strategy can be determined by a state deviation set between a sample actual state set obtained after the task strategy is transferred to the actual control system for multiple times and a reference state set output by the task strategy. The number of times of migration may be set according to the circumstances, such as once, three times, ten times, one hundred times, and the like, and this is not particularly limited in the embodiment of the present invention. Each time of migration, a sample task action at each time of the migration is obtained, and the actual control system can obtain a sample actual state set including the sample actual state at each time of the migration by executing the sample task action. The reference state set is a set formed by reference states at all times in the process of executing a task to be executed by the target robot, and the state deviation set is a set of state differences between the reference states and the actual states at the same time in the sample actual state set and the reference state set. Through the set, the difference strategy can be determined by combining the sample correction action corresponding to each migration.
The robot control strategy migration method provided by the embodiment of the invention comprises the steps of firstly migrating a task strategy of a target robot to an actual control system of the target robot, and determining the actual state of the target robot at the current moment based on the actual control system; and then when the difference value between the actual state and the reference state determined based on the task strategy is out of the preset range, transferring the difference strategy of the target robot to the actual control system so that the actual control system executes the task strategy and the coupling action under the difference strategy, and further determining the actual state of the target robot at the next moment of the current moment. Due to the fact that the difference strategy is introduced when the difference value between the actual state and the reference state determined based on the task strategy is out of the preset range, the application effect of the task strategy in an actual control system can be guaranteed through the double-strategy collaborative migration of the task strategy and the difference strategy, the actual control system can control the target robot more accurately, and smooth execution of the task to be executed is achieved.
On the basis of the foregoing embodiment, in the robot control policy migration method provided in the embodiment of the present invention, the difference policy is specifically determined by the following method:
transferring the task strategy to the actual control system for multiple times, and determining a sample actual state set of the target robot based on sample actions obtained by the actual control system executing the task strategy each time;
for any transition, determining a sample correction action corresponding to the any transition based on a state deviation set between a reference state set of the target robot and a sample actual state set corresponding to the any transition;
and determining the difference strategy based on the sample correction actions corresponding to the multiple times of migration and the actual state of the sample corresponding to each sample correction action.
Specifically, in the embodiment of the present invention, during the difference policy, the difference policy may be implemented by migrating the task policy to the actual control system for multiple times, where each time the task policy is migrated, the actual control system may execute the sample action at each time of the migration obtained by the task policy, and further determine a sample actual state set of the target robot, where the sample actual state set includes sample actual states of the target robot at different times of the migration.
Since the process of multiple migrations is the same, taking any one of the multiple migrations as an example, for the migration, the reference state set can be expressed as
Figure 82859DEST_PATH_IMAGE042
The set of actual states of the sample can be expressed as
Figure 995452DEST_PATH_IMAGE043
The set of state deviations can be expressed as
Figure 477249DEST_PATH_IMAGE044
And n is the number of states involved in the execution of the task to be executed. And determining a sample correction action corresponding to the migration according to the reference state set of the target robot and the state deviation set between the actual state sets of the samples corresponding to the migration. By matching the sample correction operation with the sample task operation, the state deviation can be reduced, that is, the actual state of the sample tends to the reference state.
And finally, the difference strategy can be obtained by taking the sample correction action corresponding to the multiple times of migration and the actual state of the sample corresponding to each sample correction action as training samples.
In the embodiment of the invention, the difference strategy is obtained by training the sample actual state set obtained by the multiple times of migration of the task strategy and the sample correction action corresponding to each sample actual state as the training sample, so that the correction effect of the difference strategy on the task strategy migrated to the actual control system is better.
On the basis of the foregoing embodiment, the robot control strategy migration method provided in the embodiment of the present invention is a method for determining a sample correction action corresponding to any one of the transitions based on a reference state set of the target robot and a state deviation set between sample actual state sets corresponding to the any one of the transitions, and specifically includes:
selecting a first state deviation exceeding a threshold value from the state deviation set according to a time sequence, and determining an alternative sample correction action set corresponding to the state deviation;
and determining the sample correcting action based on the state deviation, a sample estimated state obtained by correcting the actual state of the sample corresponding to the state deviation through each candidate sample correcting action in the candidate sample correcting action set and a reference state corresponding to the state deviation.
Specifically, in the embodiment of the present invention, when determining the sample correction action, the first state deviation exceeding the threshold c may be selected from the state deviation set in time sequence
Figure 890913DEST_PATH_IMAGE045
The threshold may be a distance threshold, and may be set according to needs, which is not specifically limited in the embodiment of the present invention.
Figure 384342DEST_PATH_IMAGE045
Incorporating the same moment
Figure 592469DEST_PATH_IMAGE046
Figure 295983DEST_PATH_IMAGE047
Can obtain
Figure 388704DEST_PATH_IMAGE048
The state-deviation trajectory obtained after k transitions can be expressed as
Figure 861274DEST_PATH_IMAGE049
According to the state deviation, a candidate sample correction action set corresponding to the state deviation can be determined, and the candidate sample correction action can be an action which is selected manually and can possibly realize a correction action.
Then passing through the state deviation
Figure 381248DEST_PATH_IMAGE050
And each alternative sample correction action in the alternative sample correction action set corrects the actual state of the sample corresponding to the state deviation to obtain an estimated sample state and a reference state corresponding to the state deviation, and the sample correction action is determined. For example, the absolute value of the state deviation between the estimated state of the sample obtained by correcting the actual state of the sample corresponding to the state deviation with each candidate sample correction action and the reference state corresponding to the state deviation may be calculated first, and then the absolute value and the state deviation may be determined
Figure 572058DEST_PATH_IMAGE051
If there is:
Figure 327524DEST_PATH_IMAGE052
wherein h represents a number of 1 to k in a natural order,
Figure 807265DEST_PATH_IMAGE053
represents the estimated state of the sample at the h-th transition,
Figure 622774DEST_PATH_IMAGE054
indicates the reference state at the h-th transition.
Then an alternative sample modification action is illustrated
Figure 35301DEST_PATH_IMAGE055
Can satisfy
Figure 469824DEST_PATH_IMAGE056
And M represents correction, the alternative sample is corrected
Figure 916986DEST_PATH_IMAGE055
As a sample correction operation.
The embodiment of the invention provides a method for determining the sample correcting action, so that the accuracy of the difference strategy obtained by training the sample correcting action can be ensured.
On the basis of the foregoing embodiment, the robot control policy migration method provided in the embodiment of the present invention determines the difference policy based on sample correction actions corresponding to multiple migrations and a sample actual state corresponding to each sample correction action, and specifically includes:
and constructing a training target based on the sample correcting actions corresponding to the multiple times of migration and the sample actual state corresponding to each sample correcting action, and training the sample correcting actions corresponding to the multiple times of migration and the sample actual states corresponding to each sample correcting action based on the training target to obtain the difference strategy.
Specifically, when determining the difference policy in the embodiment of the present invention, a training target may be first constructed according to sample correction actions corresponding to multiple migrations and sample actual states corresponding to the sample correction actions, where a form of the training target is related to task content of a task to be executed, and this is not specifically limited in the embodiment of the present invention. For example, the training target may be such that the difference between the actual state of the target robot and the reference state is minimal.
Then, according to the training target, combining the sample correction actions corresponding to the multiple times of migration and the sample actual state training corresponding to each sample correction action to obtain a difference strategy
Figure 778763DEST_PATH_IMAGE016
In the embodiment of the invention, the obtained difference strategy can be more reliable by training.
On the basis of the foregoing embodiment, the robot control strategy migration method provided in the embodiment of the present invention determines the difference strategy based on the sample correction actions corresponding to the multiple migrations and the sample actual states corresponding to the sample correction actions, and before the determining, further includes:
and eliminating repeated sample correction actions in the sample actual states corresponding to the sample correction actions and the repeated sample correction actions corresponding to the repeated sample correction actions.
Specifically, in the embodiment of the invention, after k times of migration, the migration can be performedTo extract each pair
Figure 944165DEST_PATH_IMAGE057
To obtain a plurality of representative pairs
Figure 41434DEST_PATH_IMAGE058
Namely, the repeated sample correction action and the actual sample state corresponding to the repeated sample correction action are eliminated. Wherein
Figure 218468DEST_PATH_IMAGE059
Indicating the actual state of the sample at the h-th migration. Then, when determining the difference policy based on the sample correction actions corresponding to the plurality of transitions and the actual state of the sample corresponding to each sample correction action, the difference policy may be determined based on the obtained plurality of pairs
Figure 375780DEST_PATH_IMAGE060
A difference policy is determined.
In the embodiment of the invention, the calculation amount in the process of determining the difference strategy can be reduced by eliminating the repeated sample correction action and the actual sample state corresponding to the repeated sample correction action, so that the process of determining the difference strategy is simplified.
On the basis of the above embodiment, the robot control strategy migration method provided in the embodiment of the present invention is obtained by pre-training the task strategy based on a reinforcement learning method, and the reward function adopted during training is determined based on the distance function between the actual position and the target position of the target object involved in the task to be executed of the target robot.
Specifically, in the embodiment of the present invention, the task Policy may be obtained by training through a reinforcement learning method, and the reinforcement learning method may include a Deep Deterministic Policy Gradient (DDPG), a near-end Policy Optimization algorithm (PPO), and the like.
In the process of training the task strategy, the reward function used can be determined according to a distance function between the actual position and the target position of the target object involved in the task to be performed of the target robot. For example:
Figure 762899DEST_PATH_IMAGE061
wherein the content of the first and second substances,
Figure 539225DEST_PATH_IMAGE062
in order to be a function of the reward,
Figure 960979DEST_PATH_IMAGE063
as a function of the distance between the actual position of the target object and the target position at time t,
Figure 164559DEST_PATH_IMAGE064
as a function of the distance between the actual position of the target object and the target position at the initial moment.
In the embodiment of the invention, the related reward functions belong to intensive reward functions, compared with 0 and 1 type sparse reward functions, the training speed is higher, and the reward is gradually increased along with the reduction of the distance. In addition, different from general pushing tasks of the same type, the initial position and the target position of the target object which are randomly updated in each migration are not adopted in the embodiment of the invention, and the two positions are fixed according to the corresponding situation of practical application; this reduces training time, although sacrificing some robustness. Therefore, the experience followed in the task strategy training process is that the training time is shortened as much as possible on the basis that the strategy can meet the task execution requirement, the simulation setting is simplified, the robustness and the generalization capability of the training strategy under the virtual simulation are weakened, and the effectiveness and the robustness of the migration method are highlighted.
As shown in fig. 3, on the basis of the above embodiment, in the embodiment of the present invention, when a target robot is controlled, a task policy is first obtained, then the task policy is migrated to an actual control system, then it is determined whether a difference between an actual state at a current time and a reference state determined by the actual control system is within a preset range, if so, execution is continued until a task to be executed is completed, if not, a difference policy is introduced, and the task policy is returned to, the difference policy and the task policy are subjected to action coupling, a coupling action is obtained to act on the actual control system, and then it is determined whether a difference between an actual state at a next time of the current time and the reference state of the target robot is within a preset range.
On the basis of the above embodiments, the effectiveness and robustness of the robot control strategy migration method provided by the embodiment of the invention are verified based on the adaptive object pushing experiment of the UR3 robot.
Experimental setup of virtual simulation: the simulation environment is composed of a MuJoCo physical engine and an Gym library of OpenAI, and particularly uses a UR3 robot with 6 joint degrees of freedom, the tail end of the robot is provided with two fingers, and the robot is closed to push an experimental object during experiment. The simulation step size is 0.002s, comprising 50 steps per round. The output of the mission strategy is the cartesian coordinates that the robot jaw tip should reach, with the Z coordinate fixed to keep the tip height constant. The inputs of the task strategy comprise the experimental object, the target, the cartesian position coordinates of the end of the robot gripping jaw and the deflection posture of the experimental object. The simulation coordinate system takes the center of the base of the UR3 robot as the zero point of the world coordinates.
Experimental setup of the actual control system: the actual control System is a Robot Operating System (ROS), the target Robot is a UR3 Robot, communication among a computer, a Kinect2 camera and a UR3 Robot is built based on the ROS, and the tail end of the Robot is provided with a Robot finger grip, so that gripping is not involved, and the Robot finger grip is set as a closed normal state. The camera is used for acquiring relevant information such as the pose and the target position of the target object, and various state data of the target robot are acquired by a Moveit package under the ROS. The tail end of the clamping jaw of the robot is perpendicular to the moving plane, and the height of the clamping jaw is fixed and unchanged, so that the tail end is prevented from being collided. In addition, in the embodiment of the invention, the same action outputs the instruction, and the actual robot and the simulated robot have the same execution condition. In fact, a slight error does exist, but compared with the difference influence of the experimental object, the execution error of the robot itself is negligible, and the experimental result of the embodiment of the present invention is not significantly influenced.
In order to ensure the rigor of the experiment, the data types of strategy input and strategy output during simulation training are consistent with those during actual migration, and the coordinates of the two environments are correspondingly the same.
Simulation training of a task strategy: for the task strategy training of the UR3 robot body push experiment, the corresponding simulation environment is designed firstly as described above. The reinforcement learning algorithm used for training is SAC, combined with HER use. The neural network setting and the hyper-parameter of the algorithm program use the corresponding default setting in the stable-bases library.
Analyzing and summarizing the task strategy reality migration experiment: the trained task strategy is firstly tested in the simulation, and the specific position of each step of the pushed target object in the motion plane is recorded and collected as a reference track, as shown in fig. 4, the initial point 42 refers to the actual position of the target object at the initial time, and the terminal point 41 refers to the target position reached by the target object. Then, the task strategy is transferred to the UR3 robot system, because physical characteristics such as actual object mass, gravity center, friction force and the like are not clear, difference influence is generated, through a plurality of object pushing experiments, a plurality of object states different from the reference track are observed, and the position and deflection of the object are changed, namely deviation
Figure 38974DEST_PATH_IMAGE065
Further generalizations can yield: even if the physical parameters of the target object are not known, the influence of the difference is almost two types at the beginning, namely, the two sides of the target object are deflected along the motion direction of the reference track, and only the specific degree is different, as shown in fig. 5 and fig. 6. The squares in fig. 5 and 6 each represent a target object, the circles each represent a contact point of the target robot with the target object, and the arrow direction is the moving direction of the target object at time t. FIG. 5 shows the target object at t,
Figure 478045DEST_PATH_IMAGE066
Figure 629672DEST_PATH_IMAGE067
Figure 128787DEST_PATH_IMAGE068
The moment is deflected clockwise, and figure 6 shows the target object at t,
Figure 490498DEST_PATH_IMAGE066
Figure 608627DEST_PATH_IMAGE067
Figure 4973DEST_PATH_IMAGE068
The moment is deflected in the counter-clockwise direction.
And as the task action evolves, the deflection gradually increases, the task strategy cannot provide corresponding action adjustment, the position also deviates, and finally the task fails. Therefore, the method can further develop the assumption that timely correction action can be given when deflection occurs or not so as to improve the success rate of tasks. The deviation summary and the correction find that the deviation summary and the correction are really prior knowledge required for carrying out the simulation training of the difference strategy in the next step, which is the reason for the actual migration of the task strategy in the method.
Analyzing a difference strategy and performing simulation training: after a period of exploration and accumulation, the exploration is only carried out on the virtual simulation level, and if the strong robustness migration of a single strategy is to be realized, whether an ideal result can be obtained by the existing method is difficult to guarantee. Two strategies are migrated simultaneously, one main 'task' and one main 'correction' are combined, the difference is overcome, and the task is completed, so that the idea is gradually applied. Focusing on an object pushing experiment of the UR3 robot, after training and migration of a task strategy are completed, the deviation of the difference influencing concrete embodiment is found
Figure 409409DEST_PATH_IMAGE065
For a two-sided deflection, it is clear that the corresponding correction adjustment is given an action in an oblique direction (in terms of the direction of movement of the object pushing). Judging the movement direction of the task strategy migration application and the reference track to consider thatTask actions given by task policies
Figure 133783DEST_PATH_IMAGE069
The direction of the movement is basically consistent with the movement direction, and the movement requirement under the reward function is also met. It is then clear that the simulation setup for the difference strategy is such that when an object is deflected clockwise, the robot end requires the difference strategy to give a corrective action vertically upwards with respect to the direction of movement
Figure 180236DEST_PATH_IMAGE070
(ii) a When the object deflects anticlockwise, the correction action given by the difference strategy
Figure 306455DEST_PATH_IMAGE070
Should be vertically downward. Therefore, after the actions given by the two strategies are coupled, the actions are coupled
Figure 147372DEST_PATH_IMAGE071
Will point obliquely upwards or obliquely downwards in the direction of movement, and a corresponding schematic top view is shown in fig. 7, which briefly shows the results from the start of the experiment
Figure 218097DEST_PATH_IMAGE072
Is at the moment
Figure 678028DEST_PATH_IMAGE073
The change of the action execution condition at the moment and the deflection state of the target object. The action outputs of the two strategies are both cartesian coordinates of the next moment, so the action coupling in the embodiment of the invention is vector addition of the coordinates:
Figure 783387DEST_PATH_IMAGE074
in order to make the output action meet the requirement, the training of the difference strategy is designed as the task that the tail end of the clamping jaw of the robot reaches the random point of the designated area, and the top view schematic diagram is shown in fig. 8 and 9. Fig. 8 shows that the target object is deflected clockwise at the beginning of the migration, the end of the robot gripper at the beginning of the migration randomly appears at point a on the fitting line segment from the initial point 81 to the end point 82, and the training task is that the end moves to point B which is perpendicular to the line connecting point a and is at a constant distance from point a in the migration. Fig. 9 shows that the target object has been deflected counterclockwise at the beginning of the migration, the end of the robot gripper at the beginning of the migration randomly appears at point a on the fitting line segment from the initial point 91 to the end point 92, and the training task at this time is that the end moves to point B which is perpendicular to the line connecting point a and is at a constant distance from point a within the migration. The AB distance in the embodiment of the present invention may be set to 0.10 m.
Therefore, it can be seen that the essential reason for designing the difference strategy of the robot control strategy migration method provided in the embodiment of the present invention is to obtain the corrective action, and how to design the corrective action depends on the concrete representation of the real migration difference to be overcome, and the difference is caused because the task strategy is only obtained by virtual simulation. This also reflects the interactive nature of the strategy learning from "virtual" to "real" to "virtual".
Collaborative migration of task policies and difference policies: and after the difference strategy is obtained through training, the difference strategy is combined with the task strategy and is migrated to an actual system. The writing of the program can follow a double-process mode, so that the two strategies can run in parallel at the same time. The execution of the action is adjusted in real time according to specific conditions, and if no difference deflection occurs, only the action given by the task strategy is executed; if deflection occurs, the coupling action of the double strategies is executed. Specifically, in the strategy migration starting stage, the target object starting angle is 135 degrees around the Z axis, and the deflection around the X axis and the Y axis is ignored. Further setting that when the deflection is less than or equal to 130 degrees, the clockwise deflection is considered to occur, a difference strategy is required to give a vertically upward action, the action is coupled with the task action before being executed, the coupling action faces to the obliquely upper direction of the movement direction, and then an object is pushed along the task direction while the deflection is attempted to be corrected until the deflection angle returns to 135 degrees, and then the single execution of the task strategy is recovered. When the deflection is equal to or greater than 140 degrees, the counter-clockwise deflection is considered to occur, the coupling action is directed obliquely downward in the motion direction, and the execution logic is the same as that described above.
In the embodiment of the invention, single migration of the task strategy is used as a basic experiment, and compared with the basic experiment, dual-strategy migration is adopted. The experiment promoted that the paper box pasted with the two-dimensional code has the size of 0.15m multiplied by 0.05m and the self weight of about 60g, and in order to verify the effectiveness and the robustness of the method, in addition to the unfolding experiment of the paper box, an iron block with the weight of 1000g is further arranged in the non-geometric center position of the upper part and the lower part of the inner part of the paper box, so that the mass, the friction force and the gravity center position of the whole paper box are obviously increased and changed, and the experiment is carried out.
And setting the successful experimental mark as that the distance between the two-dimensional code center of the paper box and the two-dimensional code center of the target position is less than 0.02m, and the difference between the deflection angle and the initial angle is not more than +/-5 degrees. End point of experiment
Figure 670572DEST_PATH_IMAGE075
Starting point of
Figure 494171DEST_PATH_IMAGE076
The unit is meter, and the starting position of the carton is not accurately measured in each experiment, but is placed at the approximate position of the starting point. Each set of experiments was performed 50 times and the corresponding success rates were recorded, with the specific results shown in table 1.
Table 1 success rate of experiments on three types of cartons using two migration methods
Figure 616848DEST_PATH_IMAGE077
Experimental results show that the robot control strategy migration method provided by the embodiment of the invention has strong robustness on the change of the physical attributes of the paper boxes, and completes the pushing task with high success rate; the single-strategy migration only has certain task completion capability on the carton without the iron blocks, and the task cannot be completed after the iron blocks are loaded. In addition, the movement locus of the plane where the three types of paper boxes are pushed under the two migration methods is shown in fig. 10 to 15, the solid points in fig. 10 to 15 all represent target positions, fig. 10, 12 and 14 respectively represent the task completion situation when the three types of iron boxes, namely the iron box, the iron block arranged above the paper box and the iron block arranged below the paper box, are migrated by adopting a single strategy, and fig. 11, 13 and 15 respectively represent the task completion situation when the three types of iron boxes, namely the iron block arranged above the paper box and the iron block arranged below the paper box, are migrated by adopting a double strategy. Comparing fig. 10 and fig. 11, it can be seen that the task completion situation can be better by adopting the dual policy migration; as can be seen from comparison between fig. 12 and fig. 13, a task cannot be completed by using single policy migration, but can be completed by using dual policy migration; as can be seen from a comparison of fig. 14 and fig. 15, the task cannot be completed by using single policy migration, but the task can be completed by using dual policy migration. It follows that dual policy migration can effectively make real-time adjustments to the differential impact, whereas single migration cannot. The broken lines appearing in fig. 11, 13, 15 are track changes due to the introduction of the difference strategy.
It should be noted that, for a single migration task strategy, the embodiment of the present invention is not specifically retrained, but the same strategy and the same program are used to perform experiments on three types of paper boxes. When the double-strategy migration method provided by the embodiment of the invention is used for carrying out the pushing experiment on three types of paper boxes, no manual adjustment is carried out on the task strategy and the difference strategy, and the same procedure can be operated on the basis of the same strategy, so that the strong robustness of the method is reflected, and the strategy does not need to be retrained according to the physical attribute change of an experimental object.
At present, researchers have systematically studied the strategy of training a robot pushing body by using domain randomization, but the change of the physical property of the pushed object is limited, and only a paper sheet for increasing the friction force is added at the bottom. In addition, the method has high requirements on equipment, the strategy needs to be trained on 100-core equipment for 8 hours, and the method only needs to be trained on a conventional 4-core computer with an 8G video card for 3 hours (the training time is the sum of the training time of the two types of strategies).
The method comprises the steps of carrying out virtual-real interaction double-strategy migration object promotion experiments based on the UR3 robot, wherein logic behind the virtual-real interaction double-strategy migration object promotion experiments reflects learning training of task skills from 'virtual', learning training for summarizing and correcting the 'real' deviation, learning training for aiming at differences and compensation of 'virtual', and executing application of strong-robustness and low-difference double strategies of 'real'. Under the synergistic effect of the two strategies, the uncertain influence caused by the difference problem is effectively overcome, so that the success rate of the task is remarkably improved.
In conclusion, the robot control strategy migration method for solving the problem of difference from simulation to reality is strong in robustness, high in efficiency and capable of achieving virtual-real interaction.
As shown in fig. 16, on the basis of the above embodiment, an embodiment of the present invention provides a robot control policy migration apparatus, including:
a task strategy migration module 161, configured to migrate a task strategy of a target robot to an actual control system of the target robot, and determine an actual state of the target robot at a current time based on the actual control system;
a difference policy migration module 162, configured to migrate the difference policy of the target robot to the actual control system if it is determined that the difference between the actual state and the reference state determined based on the task policy is outside a preset range, so that the actual control system executes the task policy and a coupling action under the difference policy, and further determines an actual state of the target robot at a next time of the current time;
and determining the difference strategy based on a state deviation set between a sample actual state set obtained after the task strategy is transferred to the actual control system for multiple times and a reference state set output by the task strategy and a sample correction action corresponding to each transfer.
On the basis of the foregoing embodiment, the robot control policy migration apparatus provided in the embodiment of the present invention further includes a difference policy determination module, configured to:
transferring the task strategy to the actual control system for multiple times, and determining a sample actual state set of the target robot based on sample actions obtained by the actual control system executing the task strategy each time;
for any transition, determining a sample correction action corresponding to the any transition based on a state deviation set between a reference state set of the target robot and a sample actual state set corresponding to the any transition;
and determining the difference strategy based on the sample correction actions corresponding to the multiple times of migration and the actual state of the sample corresponding to each sample correction action.
On the basis of the foregoing embodiment, the robot control policy migration apparatus provided in the embodiment of the present invention includes a difference policy determination module, which is specifically configured to:
selecting a first state deviation exceeding a threshold value from the state deviation set according to a time sequence, and determining an alternative sample correction action set corresponding to the state deviation;
and determining the sample correcting action based on the state deviation, a sample estimated state obtained by correcting the actual state of the sample corresponding to the state deviation through each candidate sample correcting action in the candidate sample correcting action set and a reference state corresponding to the state deviation.
On the basis of the foregoing embodiment, in the robot control policy migration apparatus provided in the embodiment of the present invention, the difference policy determining module is further specifically configured to:
the determining the difference policy based on the sample correcting actions corresponding to the multiple times of migration and the actual state of the sample corresponding to each sample correcting action specifically includes:
and constructing a training target based on the sample correcting actions corresponding to the multiple times of migration and the sample actual state corresponding to each sample correcting action, and training the sample correcting actions corresponding to the multiple times of migration and the sample actual states corresponding to each sample correcting action based on the training target to obtain the difference strategy.
On the basis of the foregoing embodiment, the robot control policy migration apparatus provided in the embodiment of the present invention further includes a rejection module, configured to:
and eliminating repeated sample correction actions in the sample actual states corresponding to the sample correction actions and the repeated sample correction actions corresponding to the repeated sample correction actions.
On the basis of the above embodiment, the robot control strategy migration apparatus provided in the embodiment of the present invention is obtained by pre-training the task strategy based on a reinforcement learning method, and the reward function used in the training is determined based on the distance function between the actual position and the target position of the target object involved in the task to be executed of the target robot.
Specifically, the functions of the modules in the robot control policy migration apparatus provided in the embodiment of the present invention correspond to the operation flows of the steps in the method embodiments one to one, and the implementation effects are also consistent.
As shown in fig. 17, on the basis of the foregoing embodiment, an embodiment of the present invention provides a robot control policy migration system, including: an imaging device 171 and the robot control strategy transfer device 172 described in the above embodiments, the robot control strategy transfer device 172 being connected to the imaging device 171; the imaging device 171 is used to acquire the actual state of the target robot.
Fig. 18 illustrates a physical structure diagram of an electronic device, and as shown in fig. 18, the electronic device may include: a processor (processor)1810, a communication Interface 1820, a memory (memory)1830, and a communication bus 1840, wherein the processor 1810, the communication Interface 1820, and the memory 1830 communicate with each other via the communication bus 1840. The processor 1810 may invoke logic instructions in the memory 1830 to perform the robot control strategy migration method provided by the above embodiments, the method including: migrating a task strategy of a target robot to an actual control system of the target robot, and determining the actual state of the target robot at the current moment based on the actual control system; if the difference value between the actual state and the reference state determined based on the task strategy is judged to be out of the preset range, transferring the difference strategy of the target robot to the actual control system so that the actual control system executes the task strategy and the coupling action under the difference strategy, and further determining the actual state of the target robot at the next moment of the current moment; and determining the difference strategy based on a state deviation set between a sample actual state set obtained after the task strategy is transferred to the actual control system for multiple times and a reference state set output by the task strategy and a sample correction action corresponding to each transfer.
In addition, the logic instructions in the memory 1830 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, which includes a computer program stored on a non-transitory computer-readable storage medium, the computer program including program instructions, when the program instructions are executed by a computer, the computer being capable of executing the robot control strategy migration method provided by the above embodiments, the method including: migrating a task strategy of a target robot to an actual control system of the target robot, and determining the actual state of the target robot at the current moment based on the actual control system; if the difference value between the actual state and the reference state determined based on the task strategy is judged to be out of the preset range, transferring the difference strategy of the target robot to the actual control system so that the actual control system executes the task strategy and the coupling action under the difference strategy, and further determining the actual state of the target robot at the next moment of the current moment; and determining the difference strategy based on a state deviation set between a sample actual state set obtained after the task strategy is transferred to the actual control system for multiple times and a reference state set output by the task strategy and a sample correction action corresponding to each transfer.
In yet another aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the robot control policy migration method provided in the foregoing embodiments, the method including: migrating a task strategy of a target robot to an actual control system of the target robot, and determining the actual state of the target robot at the current moment based on the actual control system; if the difference value between the actual state and the reference state determined based on the task strategy is judged to be out of the preset range, transferring the difference strategy of the target robot to the actual control system so that the actual control system executes the task strategy and the coupling action under the difference strategy, and further determining the actual state of the target robot at the next moment of the current moment; and determining the difference strategy based on a state deviation set between a sample actual state set obtained after the task strategy is transferred to the actual control system for multiple times and a reference state set output by the task strategy and a sample correction action corresponding to each transfer.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A robot control strategy migration method is characterized by comprising the following steps:
migrating a task strategy of a target robot to an actual control system of the target robot, and determining the actual state of the target robot at the current moment based on the actual control system;
if the difference value between the actual state and the reference state determined based on the task strategy is judged to be out of the preset range, transferring the difference strategy of the target robot to the actual control system so that the actual control system executes the task strategy and the coupling action under the difference strategy, and further determining the actual state of the target robot at the next moment of the current moment;
and determining the difference strategy based on a state deviation set between a sample actual state set obtained after the task strategy is transferred to the actual control system for multiple times and a reference state set output by the task strategy and a sample correction action corresponding to each transfer.
2. The robot control strategy migration method of claim 1, wherein the difference strategy is specifically determined by:
transferring the task strategy to the actual control system for multiple times, and determining a sample actual state set of the target robot based on sample actions obtained by the actual control system executing the task strategy each time;
for any transition, determining a sample correction action corresponding to the any transition based on a state deviation set between a reference state set of the target robot and a sample actual state set corresponding to the any transition;
and determining the difference strategy based on the sample correction actions corresponding to the multiple times of migration and the actual state of the sample corresponding to each sample correction action.
3. The robot control strategy transfer method according to claim 2, wherein the determining the sample correction action corresponding to the any one transfer based on the state deviation set between the reference state set of the target robot and the actual state set of the sample corresponding to the any one transfer specifically comprises:
selecting a first state deviation exceeding a threshold value from the state deviation set according to a time sequence, and determining an alternative sample correction action set corresponding to the state deviation;
and determining the sample correcting action based on the state deviation, a sample estimated state obtained by correcting the actual state of the sample corresponding to the state deviation through each candidate sample correcting action in the candidate sample correcting action set and a reference state corresponding to the state deviation.
4. The robot control strategy migration method according to claim 2, wherein the determining the difference strategy based on the sample correction actions corresponding to the multiple migrations and the actual state of the sample corresponding to each sample correction action specifically comprises:
and constructing a training target based on the sample correcting actions corresponding to the multiple times of migration and the sample actual state corresponding to each sample correcting action, and training the sample correcting actions corresponding to the multiple times of migration and the sample actual states corresponding to each sample correcting action based on the training target to obtain the difference strategy.
5. The robot control strategy migration method according to claim 4, wherein the determining the difference strategy based on the sample correction actions corresponding to the plurality of migrations and the sample actual states corresponding to the respective sample correction actions further comprises:
and eliminating repeated sample correction actions in the sample actual states corresponding to the sample correction actions and the repeated sample correction actions corresponding to the repeated sample correction actions.
6. A robot control strategy transfer method according to any of claims 1-5, characterized in that the task strategy is pre-trained based on a reinforcement learning method, and the reward function used in training is determined based on a distance function between the actual position and the target position of the target object involved in the task to be performed by the target robot.
7. A robotic control strategy migration apparatus, comprising:
the task strategy migration module is used for migrating a task strategy of the target robot to an actual control system of the target robot and determining the actual state of the target robot at the current moment based on the actual control system;
a difference strategy migration module, configured to migrate the difference strategy of the target robot to the actual control system if it is determined that a difference between the actual state and a reference state determined based on the task strategy is outside a preset range, so that the actual control system executes the task strategy and a coupling action under the difference strategy, and further determines an actual state of the target robot at a next time of the current time;
and determining the difference strategy based on a state deviation set between a sample actual state set obtained after the task strategy is transferred to the actual control system for multiple times and a reference state set output by the task strategy and a sample correction action corresponding to each transfer.
8. A robotic control policy migration system, comprising: the robot control strategy transferring device according to claim 7, wherein the robot control strategy transferring device is connected with the camera device;
the camera device is used for acquiring the actual state of the target robot.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the robot control strategy migration method according to any of claims 1 to 6 are implemented when the program is executed by the processor.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the robot control strategy migration method according to any one of claims 1 to 6.
CN202110603540.9A 2021-05-31 2021-05-31 Robot control strategy migration method, device and system Active CN113050433B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110603540.9A CN113050433B (en) 2021-05-31 2021-05-31 Robot control strategy migration method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110603540.9A CN113050433B (en) 2021-05-31 2021-05-31 Robot control strategy migration method, device and system

Publications (2)

Publication Number Publication Date
CN113050433A true CN113050433A (en) 2021-06-29
CN113050433B CN113050433B (en) 2021-09-14

Family

ID=76518581

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110603540.9A Active CN113050433B (en) 2021-05-31 2021-05-31 Robot control strategy migration method, device and system

Country Status (1)

Country Link
CN (1) CN113050433B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2384863A2 (en) * 2010-01-21 2011-11-09 Institutul de Mecanica Solidelor al Academiei Romane Method and device for dynamic control of a walking robot
CN109765820A (en) * 2019-01-14 2019-05-17 南栖仙策(南京)科技有限公司 A kind of training system for automatic Pilot control strategy
CN110000785A (en) * 2019-04-11 2019-07-12 上海交通大学 Agriculture scene is without calibration robot motion's vision collaboration method of servo-controlling and equipment
CN110083080A (en) * 2018-01-25 2019-08-02 发那科株式会社 Machine learning device and method, servo motor control unit and system
CN111667513A (en) * 2020-06-01 2020-09-15 西北工业大学 Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning
WO2020207789A1 (en) * 2019-04-12 2020-10-15 Robert Bosch Gmbh Method and device for controlling a technical apparatus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2384863A2 (en) * 2010-01-21 2011-11-09 Institutul de Mecanica Solidelor al Academiei Romane Method and device for dynamic control of a walking robot
CN110083080A (en) * 2018-01-25 2019-08-02 发那科株式会社 Machine learning device and method, servo motor control unit and system
CN109765820A (en) * 2019-01-14 2019-05-17 南栖仙策(南京)科技有限公司 A kind of training system for automatic Pilot control strategy
CN110000785A (en) * 2019-04-11 2019-07-12 上海交通大学 Agriculture scene is without calibration robot motion's vision collaboration method of servo-controlling and equipment
WO2020207789A1 (en) * 2019-04-12 2020-10-15 Robert Bosch Gmbh Method and device for controlling a technical apparatus
CN111667513A (en) * 2020-06-01 2020-09-15 西北工业大学 Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning

Also Published As

Publication number Publication date
CN113050433B (en) 2021-09-14

Similar Documents

Publication Publication Date Title
Koç et al. Online optimal trajectory generation for robot table tennis
Peters et al. Reinforcement learning by reward-weighted regression for operational space control
US11823048B1 (en) Generating simulated training examples for training of machine learning model used for robot control
Mouret et al. Crossing the reality gap: a short introduction to the transferability approach
Felip et al. Manipulation primitives: A paradigm for abstraction and execution of grasping and manipulation tasks
CN113076615B (en) High-robustness mechanical arm operation method and system based on antagonistic deep reinforcement learning
Huang et al. Grasping novel objects with a dexterous robotic hand through neuroevolution
Hazard et al. Automated design of robotic hands for in-hand manipulation tasks
Dragan et al. Online customization of teleoperation interfaces
CN113050433B (en) Robot control strategy migration method, device and system
Chen et al. Dextransfer: Real world multi-fingered dexterous grasping with minimal human demonstrations
Liu et al. Grasp pose learning from human demonstration with task constraints
Le et al. Deformation-aware data-driven grasp synthesis
US11679496B2 (en) Robot controller that controls robot, learned model, method of controlling robot, and storage medium
Hilleli et al. Toward deep reinforcement learning without a simulator: An autonomous steering example
Huang et al. Tradeoffs in neuroevolutionary learning-based real-time robotic task design in the imprecise computation framework
CN116968024A (en) Method, computing device and medium for obtaining control strategy for generating shape closure grabbing pose
CN114585487A (en) Mitigating reality gaps by training simulations to real models using vision-based robot task models
Roesler et al. Action learning and grounding in simulated human–robot interactions
US20220410380A1 (en) Learning robotic skills with imitation and reinforcement at scale
Sui et al. Transfer of robot perception module with adversarial learning
Alizadeh Kolagar et al. NAO robot learns to interact with humans through imitation learning from video observation
Argall et al. Learning mobile robot motion control from demonstrated primitives and human feedback
KR20190088093A (en) Learning method for robot
CN114529010A (en) Robot autonomous learning method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant