CN114347043A - Manipulator model learning method and device, electronic equipment and storage medium - Google Patents

Manipulator model learning method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114347043A
CN114347043A CN202210257626.5A CN202210257626A CN114347043A CN 114347043 A CN114347043 A CN 114347043A CN 202210257626 A CN202210257626 A CN 202210257626A CN 114347043 A CN114347043 A CN 114347043A
Authority
CN
China
Prior art keywords
expert
learning
strategy
demonstration data
cost
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210257626.5A
Other languages
Chinese (zh)
Other versions
CN114347043B (en
Inventor
焦家辉
张晟东
王济宇
李志建
蔡维嘉
李腾
张立华
李伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202210257626.5A priority Critical patent/CN114347043B/en
Publication of CN114347043A publication Critical patent/CN114347043A/en
Application granted granted Critical
Publication of CN114347043B publication Critical patent/CN114347043B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Manipulator (AREA)

Abstract

The invention relates to the technical field of intelligent manipulators, and particularly discloses a manipulator model learning method, a manipulator model learning device, electronic equipment and a storage medium, wherein the learning method comprises the following steps: acquiring a plurality of groups of expert demonstration data which are used for the mechanical arm model to learn and relate to the same execution task; generating an expert strategy associated with a learning cost function according to the expert demonstration data, wherein the learning cost function is established based on simulation cost required for simulating the expert demonstration data and the intensity of the expert demonstration data; minimizing the learning cost function to obtain an optimal expert strategy; training the manipulator model according to the optimal expert strategy; the optimal expert strategy finally obtained by the method pushes the manipulator model to simulate the expert demonstration behavior in the range of dense expert demonstration data distribution, so that the manipulator model can accurately simulate and complete the expert demonstration behavior at the simulation cost as low as possible.

Description

Manipulator model learning method and device, electronic equipment and storage medium
Technical Field
The application relates to the technical field of intelligent manipulators, in particular to a manipulator model learning method and device, electronic equipment and a storage medium.
Background
In the production application, the mechanical arm can enhance the universality of the automatic interaction of the mechanical arm through reinforcement learning and efficiently complete complex tasks; existing reinforcement learning models can generally accelerate model convergence by learning an optimal expert behavior strategy in combination with presentation data, but are prone to final reinforcement learning models failing to accurately mimic completing expert presentation behaviors due to shifts in the expert behavior strategy or learning with only minimal modeling costs.
In view of the above problems, no effective technical solution exists at present.
Disclosure of Invention
The application aims to provide a manipulator model learning method, a manipulator model learning device, an electronic device and a storage medium, so that a manipulator model can accurately imitate and finish expert demonstration behaviors at the imitation cost as low as possible.
In a first aspect, the present application provides a manipulator model learning method for training a manipulator model, the learning method including the following steps:
acquiring a plurality of groups of expert demonstration data which are used for the mechanical arm model to learn and relate to the same execution task;
generating an expert strategy associated with a learning cost function according to the expert demonstration data, wherein the learning cost function is established based on simulation cost required for simulating the expert demonstration data and the intensity of the expert demonstration data;
minimizing the learning cost function to obtain an optimal expert strategy;
and training the manipulator model according to the optimal expert strategy.
According to the mechanical arm model learning method, in the process of acquiring the expert strategy, the density degree of expert demonstration data is set as a reward and punishment condition for judging learning cost, so that a learning cost function established based on the simulation cost and the density degree of the expert demonstration data is optimized and fed back through reinforcement learning and supervised behavior simulation cost, the finally acquired optimal expert strategy pushes the mechanical arm model to simulate expert demonstration behaviors in the range of intensive distribution of the expert demonstration data, and the mechanical arm model can accurately simulate and complete the expert demonstration behaviors at the simulation cost as low as possible.
The robot model learning method described above, wherein the step of acquiring a plurality of sets of expert demonstration data for the robot model to learn about the same execution task includes:
the robot is directly operated by a human expert repeatedly in virtual reality or a task demonstration about the same execution task is performed by a human expert teaching mobile robot repeatedly in a real scene to collect a plurality of sets of expert demonstration data.
In the learning method of this example, inputting expert demonstration data as a supervised learning object enables the manipulator model to quickly complete learning of the actions required to perform the task.
The manipulator model learning method according to the present invention, wherein the step of generating an expert strategy associated with a learning cost function from the expert demonstration data includes:
generating the expert strategy according to the expert demonstration data, wherein the expert strategy is used for guiding a manipulator model to generate imitation actions for imitating the expert demonstration data;
obtaining the impersonation cost according to impersonation actions;
acquiring reinforcement learning cost according to the density degree of the expert demonstration data corresponding to the imitation action in all the expert demonstration data;
and establishing a learning cost function according to the simulation cost and the reinforcement learning cost.
In the learning method of the example, as a plurality of groups of expert demonstration behaviors are input, the merits of the simulation behavior can be known according to the positions of the simulation behavior in the distribution of the expert demonstration behaviors, namely whether the simulation behavior can well complete the execution task is known, so that the reinforcement learning cost is set at the positions of the simulation behavior in the distribution of the expert demonstration behaviors for driving the manipulator model to be reinforced and learnt as the reward and punishment condition.
The manipulator model learning method, wherein the step of minimizing the learning cost function to obtain an optimal expert strategy comprises:
training the expert strategy according to a plurality of groups of expert demonstration data to minimize and converge simulation cost in the learning cost function;
minimizing the reinforcement learning cost to obtain an optimal expert strategy.
The learning method of this example preferably improves the efficiency of the optimal expert strategy by training the expert strategy to minimize the convergence modeling cost, then updating through reinforcement learning until the reinforcement learning cost is minimized, and reducing the update variables through step-by-step update behavior.
The method for learning a manipulator model, wherein the training of the expert strategy according to the plurality of sets of expert demonstration data to minimize and converge the simulation cost in the learning cost function comprises:
extracting part of the expert demonstration data to train the expert strategy, so that the simulation cost in the learning cost function is converged;
and updating the learning cost function according to the expert demonstration data gradient to minimize and converge the simulation cost after convergence.
The robot model learning method further includes, before the step of acquiring a plurality of sets of expert demonstration data about the same execution task for the robot model learning, the steps of:
and establishing a state space for training the manipulator model according to the image data characteristics of the operation scene and the data of the airborne sensor.
According to the learning method, the sensing of the manipulator body is directly combined with the image data characteristics acquired in advance through the airborne sensor on the manipulator body to establish the state space to replace the state input to the reinforcement learning training task, so that the sample efficiency and performance which are the same as those of the state-based learning are achieved, the process of calibrating the environment by using external equipment is omitted, and the dependence on the environment equipment is reduced to reduce the operation cost.
The manipulator model learning method comprises the steps that image data features are extracted according to image information of a working scene based on a preset feature extractor, and the preset feature extractor is generated based on life scene images and/or Imagenet database training.
In a second aspect, the present application further provides a manipulator model learning device for training a manipulator model, the learning device includes:
the acquisition module is used for acquiring a plurality of groups of expert demonstration data which are used for the mechanical arm model to learn and relate to the same execution task;
the strategy generation module is used for generating an expert strategy related to a learning cost function according to the expert demonstration data, and the learning cost function is established based on the simulation cost required by simulating the expert demonstration data and the density of the expert demonstration data;
an optimization feedback module for minimizing the learning cost function to obtain an optimal expert strategy;
and the training module is used for training the manipulator model according to the optimal expert strategy.
The utility model provides a manipulator model learning device, in the in-process of acquireing expert's strategy, set up expert's demonstration data's intensive degree for judging the reward punishment condition of learning cost, make the learning cost function based on the intensive degree establishment of imitation cost and expert's demonstration data optimize the feedback through reinforcement learning and the action imitation cost of supervised, the optimum expert's strategy that finally acquires pushes manipulator model to the intensive within range imitation expert's demonstration action of expert's demonstration data distribution, thereby make manipulator model can imitate completion expert's demonstration action under the imitation cost as far as possible accurately.
In a third aspect, the present application further provides an electronic device, comprising a processor and a memory, where the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, perform the steps of the method as provided in the first aspect.
In a fourth aspect, the present application also provides a storage medium having a computer program stored thereon, which when executed by a processor performs the steps of the method as provided in the first aspect above.
From the above, the application provides a manipulator model learning method, a manipulator model learning device, an electronic device and a storage medium, wherein in the process of acquiring an expert strategy, the density degree of expert demonstration data is set as a reward and punishment condition for judging learning cost, so that a learning cost function established based on simulation cost and the density degree of the expert demonstration data is optimized and fed back through reinforcement learning and supervised behavior simulation cost, and the finally acquired optimal expert strategy pushes the manipulator model to simulate the expert demonstration behavior in a range of dense distribution of the expert demonstration data, so that the manipulator model can accurately simulate and complete the expert demonstration behavior at the simulation cost as low as possible.
Drawings
Fig. 1 is a flowchart of a robot model learning method according to an embodiment of the present application.
Fig. 2 is a schematic structural diagram of a manipulator model learning device according to an embodiment of the present application.
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Reference numerals: 201. an acquisition module; 202. a policy generation module; 203. an optimization feedback module; 204. a training module; 301. a processor; 302. a memory; 303. a communication bus.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
The existing manipulator reinforcement learning model is low in training convergence speed and poor in robustness, supervision learning is generally carried out by setting demonstration data, so that the manipulator model learns an optimal expert behavior strategy to accelerate model convergence, reinforcement learning is generally carried out by inputting multiple groups of expert data to drive the manipulator model, if the multiple groups of expert demonstration data are input, the manipulator model is generally driven only with minimum distance cost to select an expert strategy, the expert strategy may be a strategy with poor task completion effect, the expert strategy obtained based on the behavior may only be capable of reluctantly executing tasks, and the manipulator is caused to complete the task with low precision.
In a first aspect, please refer to fig. 1, fig. 1 is a robot model learning method for training a robot model in some embodiments of the present application, the learning method includes the following steps:
s1, acquiring a plurality of groups of expert demonstration data for the mechanical arm model to learn and about the same execution task;
specifically, the manipulator model learning process is a process of finding a strategy to simulate expert demonstration behaviors, and therefore, expert demonstration data representing expert demonstration behaviors, which can be supplied to the manipulator model for learning, needs to be input in advance before the manipulator model is learned, and the expert demonstration data is characteristic data captured in the expert demonstration behaviors.
More specifically, the manipulator model learning process is a learning and correcting process of independent behavior, so that each learning is performed on expert demonstration data corresponding to one execution task.
More specifically, in order to avoid that the data deviation of expert demonstration data generated by a single expert demonstration behavior affects the learning result of a manipulator model, the learning method provided by the embodiment of the application sets multiple groups of expert demonstration data for the same execution task, namely, repeatedly demonstrates multiple expert demonstration behaviors for the same execution task to generate multiple groups of expert demonstration data for the manipulator model learning, and ensures that the manipulator has enough learning objects to more accurately realize behavior learning.
S2, generating an expert strategy related to a learning cost function according to the expert demonstration data, wherein the learning cost function is established based on the simulation cost required by simulating the expert demonstration data and the density of the expert demonstration data;
specifically, the expert strategy is a learning strategy for a manipulator model to simulate expert demonstration data, and in the embodiment of the application, the manipulator learns the learning strategy for the expert demonstration data, for the manipulator model, multiple learning strategies can be provided for the same execution task, and a learning strategy with the lowest learning cost as possible needs to be searched on the basis of quick search to serve as an optimal expert strategy so as to drive the manipulator model to complete learning.
More specifically, the process of obtaining the optimal expert strategy is substantially to obtain the lowest-cost learning strategy under the distribution of the induction state, i.e., the behavior substitution cost of the manipulator behavior and the expert demonstration behavior is minimized, and therefore, it is necessary to establish the expert strategy associated with the learning cost function, i.e., the expert strategy in which the strategy of obtaining one mechanical behavior in the manipulator model is converted into the expert demonstration behavior in combination with the learning cost function, since the expert demonstration data is in a plurality of sets, the expert strategy associated with the learning cost function may be in a plurality of forms, i.e., the step S2 generates a plurality of sets of expert strategies to form an expert strategy set, and it is necessary to determine the value of the learning cost function to obtain the expert strategy finally used for the manipulator model learning in the expert strategy set.
More specifically, since the learning cost dominates the determination of the expert strategy, so that the establishment of the learning cost function is particularly critical, the prior art generally establishes the learning cost function by only the distance between the manipulator behavior and the expert demonstration behavior, the manipulator model can rapidly complete the learning for a single expert demonstration data, but the manipulator model can only learn the expert demonstration behavior closest to the manipulator behavior for a plurality of groups of expert demonstration data, in this case, the manipulator model is easy to learn the expert demonstration behavior with a large deviation value from the task, therefore, in the learning method of the embodiment of the application, the learning cost function is set to be established based on the simulation cost required by simulating expert demonstration data and the density of the expert demonstration data, so that the learning cost function can reflect the learning difficulty and the superiority and inferiority of behaviors after learning.
More specifically, the simulation cost required for simulating expert demo data is a distance cost in a state distribution generated by the expert policy, that is, a distance cost in which the manipulator behavior changes to the corresponding expert demo behavior in that state.
More specifically, the density of the expert demonstration data is the density distribution degree of expert demonstration behaviors corresponding to the expert strategy in all the expert demonstration behaviors, in a plurality of groups of expert demonstration data generated by a plurality of groups of expert demonstration behaviors, the more densely distributed places of the expert demonstration data represent that the behaviors are more accurate, and similarly, the more sparsely distributed places of the expert demonstration data represent that the behaviors deviate from actions required by executing tasks and represent that the behaviors are more inaccurate; therefore, according to the learning method provided by the embodiment of the application, the simulation cost and the density of expert demonstration data are set in the learning cost function, so that the expert strategy can be searched based on two characteristics, and the manipulator model can determine the expert strategy from at least two scales for learning.
More specifically, only increasing the simulation cost can increase the learning cost function, and only becoming more dense can increase the learning cost function.
S3, minimizing a learning cost function to obtain an optimal expert strategy;
specifically, the process of minimizing the learning cost function is to converge the learning cost function to the minimum, and at this time, the optimal expert strategy determined according to the converged learning cost function is the expert strategy with the optimal comprehensive simulation cost and the optimal expert demonstration data density.
And S4, training the manipulator model according to the optimal expert strategy.
Specifically, the manipulator model can execute the manipulator behavior corresponding to the optimal expert strategy after training and learning according to the optimal expert strategy, namely accurately imitate and complete the expert demonstration behavior with the imitation cost as low as possible.
Specifically, the learning method of the embodiment of the application sets expert demonstration data to perform supervised learning of the manipulator model, and guides the manipulator model to perform reinforcement learning according to the learning cost and the intensity, and the reinforcement learning and the supervised behavior simulation cost are minimized together.
According to the mechanical arm model learning method, in the process of acquiring the expert strategy, the density degree of expert demonstration data is set as a reward and punishment condition for judging learning cost, so that a learning cost function established based on simulation cost and the density degree of the expert demonstration data is optimized and fed back through reinforcement learning and supervised behavior simulation cost, the finally acquired optimal expert strategy pushes the mechanical arm model to simulate expert demonstration behaviors in a range with dense expert demonstration data distribution, and the mechanical arm model can accurately simulate and complete expert demonstration behaviors at simulation cost as low as possible.
In some preferred embodiments, the step of acquiring a plurality of sets of expert demonstration data for the manipulator model to learn about the same task to be performed comprises:
and S11, repeating the direct operation of the manipulator by the human expert in virtual reality or the repeated operation demonstration of the same execution task by the human expert teaching mobile manipulator in a real scene to collect a plurality of groups of expert demonstration data.
Specifically, Virtual Reality (VR) is a virtual scene calibrated in advance, that is, a state space of the virtual scene is set in advance, and expert demonstration data in the state space of the virtual reality can be generated quickly by performing operation demonstration in the virtual reality; the real scene is consistent with the state space acquired by the manipulator, so that the manipulator can be directly moved by a human expert to acquire expert demonstration data in the state space; inputting expert demonstration data as a supervised learning object can enable the manipulator model to rapidly complete the learning of actions required for executing tasks.
More specifically, repeated operation in virtual reality or in a real scene enables multiple sets of expert presentation data so that the manipulator model has enough learning samples to perform reinforcement learning.
More specifically, the behavior of repeated operation is executed by human experts, so that a certain difference is generated in each operation process actually, so that the expert demonstration data of each group are slightly different, and under the condition of acquiring enough expert demonstration data, the more dense the expert demonstration data is, the more accurate the behavior characteristic data is, namely, the standard behavior action required by executing the task is.
In some preferred embodiments, the expert presentation data is 30 or more sets, thereby ensuring that there is sufficient data to characterize the distribution of the expert presentation data.
In some preferred embodiments, the step of generating an expert strategy associated with the learning cost function from the expert presentation data comprises:
s21, generating an expert strategy according to the expert demonstration data, wherein the expert strategy is used for guiding the manipulator model to generate simulation actions for simulating the expert demonstration data;
specifically, the simulation learning process of the manipulator model is to search a strategy to drive the manipulator behavior to be converted into an expert demonstration behavior with minimum cost under the distribution of an induction state, that is, the existing behavior strategy of the manipulator model is converted into an expert strategy, so that the expert strategy needs to be generated according to expert demonstration data to drive the manipulator model to learn.
More specifically, since the expert demonstration data are a plurality of groups, the step can generate a set of expert strategies according to the plurality of groups of expert demonstration data, so that the manipulator model can search for a proper expert strategy in the set of expert strategies for learning.
More specifically, the simulation action refers to a behavior after the manipulator model simulates and converts the manipulator model into expert demonstration data based on the behavior of the manipulator model in the distribution of the induction state, that is, the existing behavior strategy of the manipulator model is converted into the expert strategy.
S22, obtaining the simulation cost according to the simulation action;
specifically, the simulation cost is an alternative cost required for the manipulator behavior simulation to be converted into the behavior after the expert demonstration data, and the cost is generally a distance cost, namely an action cost required to be paid for characterizing a change distance between the front behavior and the back behavior.
S23, acquiring reinforcement learning cost according to the density of the expert demonstration data corresponding to the imitation action in all the expert demonstration data;
specifically, the simulated action is a behavior generated by simulating a learning expert demonstration behavior, the merits of the simulated action can be known by analyzing the differences between the simulated action and the expert demonstration behavior generated after simulation, and because a plurality of groups of expert demonstration behaviors are input, the merits of the simulated action can be known according to the positions of the simulated action in the expert demonstration behavior distribution, namely whether the simulated action can well complete an execution task or not is known, so that the simulation action is set at a reinforced learning cost at the positions of the expert demonstration behavior distribution (namely the intensive degree in all expert demonstration data) to serve as a reward condition to drive the manipulator model to be reinforced and learned.
More specifically, the reinforcement learning cost is an uncertainty cost, which is equivalent to a reward and punishment cost, and when the simulation action is at a position where all expert demonstration data are densely distributed, a smaller cost value is given, whereas, when all expert demonstration data are sparsely distributed, a larger cost value is given.
More specifically, the density of the expert demonstration data is inversely related to the variance of the expert strategy, so that the embodiment of the application can set the reinforcement learning cost based on the strategy variance, when the expert strategy variance is larger, the reinforcement learning cost is larger, and conversely, when the expert strategy variance is smaller, the reinforcement learning cost is smaller.
And S24, establishing a learning cost function according to the simulation cost and the reinforcement learning cost.
Specifically, the learning cost function established by combining the two costs in the step can comprehensively represent the action distance required by the manipulator model simulation learning and the good and bad conditions of the simulated action after learning.
In some preferred embodiments, the step of minimizing the learning cost function to obtain the optimal expert strategy comprises:
s31, training expert strategies according to multiple groups of expert demonstration data, and enabling simulation cost in the learning cost function to be minimized and converged;
and S32, minimizing the reinforcement learning cost to obtain the optimal expert strategy.
Specifically, in the reinforcement learning process, in order to simplify the whole learning process and improve the efficiency of obtaining the optimal expert strategy, the embodiment of the present application preferably trains the expert strategy to minimize the convergence simulation cost, then performs reinforcement learning update until the reinforcement learning cost is minimized, and reduces the update variables by step-by-step update behavior to improve the efficiency of the optimal expert strategy.
More specifically, the process of minimizing reinforcement learning costs is through reinforcement learning expert strategy gradient updates until minimizing stage reinforcement learning costs.
In some preferred embodiments, training the expert strategy based on a plurality of sets of expert presentation data, the step of converging the simulated cost minimization in the learning cost function comprises:
s311, extracting part of expert demonstration data to train an expert strategy, and enabling simulation cost in a learning cost function to be converged;
and S312, updating the learning cost function according to the expert demonstration data gradient, and minimizing and converging the simulation cost after convergence.
Specifically, in order to improve the convergence efficiency of the simulation cost in the learning cost function, preliminary training is performed in a sampling mode, and then the learning cost function is updated by using the gradient of expert demonstration data, so that the simulation cost is minimized and converged, and the reinforcement learning efficiency of the manipulator model is improved.
In some preferred embodiments, the learning methods further comprise the step performed before the step of obtaining a plurality of sets of expert demonstration data for the manipulator model to learn about the same task to be performed:
and S0, establishing a state space for manipulator model training according to the image data characteristics of the operation scene and the onboard sensor data.
Specifically, the manipulator in the prior art generally needs external equipment (equipment which is relatively fixed to a space environment and has no direct connection relationship with the manipulator) to calibrate a state space, so as to determine the position and the state of the manipulator in the space, while the learning method in the embodiment of the application does not depend on the calibration of the external equipment, and directly combines the body feeling of the manipulator with the image data features acquired in advance through an onboard sensor (a vision sensor, a touch sensor, a joint encoder and the like) on the manipulator body to establish the state space to replace the state input to the reinforcement learning training task, so that the sample efficiency and the performance which are the same as those of the state-based learning are achieved, the process of calibrating the environment by using the external equipment is omitted, and the dependency on the environment equipment is reduced so as to reduce the operation cost.
More specifically, the on-board sensor data is based on motion information acquired by the robot joint encoder.
In some preferred embodiments, the image data features are extracted according to image information of a job scene based on a preset feature extractor, and the preset feature extractor is generated based on life scene images and/or Imagenet database training.
Specifically, the step trains and acquires the feature extractor through an image database, wherein the image database is generated by labeling living scene images and behaviors or acquired based on an Imagenet database, so that the feature extractor has enough training data, the feature extractor can smoothly extract features aiming at different scenes and behaviors, and the feature extractor can smoothly complete the extraction of the image data features of the scene where the manipulator is located.
In a second aspect, please refer to fig. 2, fig. 2 is a robot model learning apparatus for training a robot model provided in some embodiments of the present application, the learning apparatus includes:
an obtaining module 201, configured to obtain multiple sets of expert demonstration data for the manipulator model to learn and about the same execution task;
a strategy generation module 202, configured to generate an expert strategy associated with a learning cost function according to expert demonstration data, where the learning cost function is established based on an emulation cost required for emulating the expert demonstration data and an intensity of the expert demonstration data;
an optimization feedback module 203 for minimizing a learning cost function to obtain an optimal expert strategy;
and the training module 204 is used for training the manipulator model according to the optimal expert strategy.
The manipulator model learning device provided by the embodiment of the application is in the process of acquiring the expert strategy, the intensive degree of the expert demonstration data is set as a reward and punishment condition for judging the learning cost, the learning cost function established based on the simulation cost and the intensive degree of the expert demonstration data is optimized and fed back through the reinforcement learning and supervised behavior simulation cost, the manipulator model is pushed to the expert demonstration data distribution intensive range to simulate the expert demonstration behavior by the finally acquired optimal expert strategy, and the expert demonstration behavior can be accurately simulated and completed by the manipulator model at the simulation cost as low as possible.
In some preferred embodiments, the manipulator model learning apparatus according to the embodiment of the present application is configured to perform the manipulator model learning method according to the first aspect.
In some preferred embodiments, the learning apparatus further comprises:
and the state space establishing module is used for establishing a state space for training the manipulator model according to the image data characteristics of the operation scene and the data of the airborne sensor.
In a third aspect, referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application, where the present application provides an electronic device, including: the processor 301 and the memory 302, the processor 301 and the memory 302 being interconnected and communicating with each other via a communication bus 303 and/or other form of connection mechanism (not shown), the memory 302 storing a computer program executable by the processor 301, the processor 301 executing the computer program when the computing device is running to perform the method of any of the alternative implementations of the embodiments described above.
In a fourth aspect, the present application provides a storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program performs the method in any optional implementation manner of the foregoing embodiments. The storage medium may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as a Static Random Access Memory (SRAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), an Erasable Programmable Read-Only Memory (EPROM), a Programmable Read-Only Memory (PROM), a Read-Only Memory (ROM), a magnetic Memory, a flash Memory, a magnetic disk, or an optical disk.
Example 1
In order to more clearly describe the learning process of the manipulator model in the manipulator model learning method provided by the embodiment of the present application, a more detailed embodiment of the manipulator model learning method is described, where the learning method includes the following steps:
1. acquiring original image data of various real-life production scenes, labeling targets and behaviors in the original images, establishing an image database, or directly acquiring the image database by using the Imagenet with the largest image recognition in the world, and pre-training the image database to obtain a feature extractor;
2. acquiring image information of a current operation scene through a manipulator body vision sensor, and learning image data characteristics of manipulator operation from the image information by using the pre-trained characteristic extractor
Figure 875652DEST_PATH_IMAGE001
Feature of the learned manipulator operation image data
Figure 871421DEST_PATH_IMAGE001
And on-board sensor data obtained from on-board sensors on other robots
Figure 244634DEST_PATH_IMAGE002
Combination formation
Figure 843106DEST_PATH_IMAGE001
,
Figure 565467DEST_PATH_IMAGE002
]As a state space
Figure 856771DEST_PATH_IMAGE003
Is input into the reinforcement learning strategy learning process of the manipulator model.
Wherein the on-board sensor data
Figure 202433DEST_PATH_IMAGE002
The motion information of the manipulator can be obtained by a joint encoder of the manipulator and then determined, and the image data characteristics
Figure 775235DEST_PATH_IMAGE001
Is mapped by the feature extractorIAs image information, there are
Figure 53900DEST_PATH_IMAGE004
Wherein, in the step (A),
Figure 424095DEST_PATH_IMAGE005
Figure 381687DEST_PATH_IMAGE006
is a high-dimensional image and is a three-dimensional image,
Figure 164704DEST_PATH_IMAGE007
is a feature extractor.
3. Before the mechanical arm reinforcement learning training, collecting demonstration data of operation demonstration of a human expert operating the mechanical arm in Virtual Reality (VR) or collecting demonstration data of operation demonstration of fingers of a moving mechanical arm taught by the human expert in a real scene, thereby forming expert demonstration data; the demonstration is repeated for 30 times to obtain 30 kinds of expert demonstration data as the data base for strategy search.
4. Training an expert strategy set on a plurality of groups of expert demonstration data, taking the prediction deviation of the strategy, namely the density of the expert demonstration data as a cost, establishing a learning cost function by taking the cost combined with simulation cost required by simulating expert demonstration behaviors as total learning cost, carrying out optimization feedback of the expert strategies based on the learning cost function, finally pushing the expert strategies to an expert demonstration data centralized distribution area, and ensuring that the expert behaviors are simulated in the distribution of the expert demonstration data.
The method mainly comprises the steps that an additional cost is added on the basis of the intensive distribution degree of expert demonstration data on the basis of standard behavior demonstration cost, the additional cost represents action difference caused by different strategies based on posterior sampling under the condition of giving demonstration data, so that the manipulator can execute high-similarity operation with an expert on the basis of expert demonstration data distribution when learning a strategy, and if the expert strategy deviates from the distribution of the expert demonstration data, the additional cost urges the strategy to be converted into the data distribution of expert demonstration.
The specific flow of the step is as follows:
training expert strategy set on multiple groups of expert demonstration data, and defining expert strategy asπ*Searching for a strategy in the manipulator strategyπThe strategy ofπWith an expert strategy ofπ*Distance between actions taken
Figure 563456DEST_PATH_IMAGE008
Comprises the following steps:
Figure 55617DEST_PATH_IMAGE009
wherein, in the step (A),Ein the interest of expectation,
Figure 346177DEST_PATH_IMAGE010
indicating a statesSatisfaction of routing policyπDistribution of induced statesd π
Figure 90142DEST_PATH_IMAGE011
Indicating a statesLower execution policyπ
Figure 654985DEST_PATH_IMAGE012
Expressing the expert policy to be executed in the state sπ*
Figure 318047DEST_PATH_IMAGE013
Representing the overall varying distance between the two strategies.
Besides the expert demonstration distribution, the data is sparse, and the strategy variance is large; however, in the expert demonstration distribution, the data is dense, the variance of the strategy is low, and therefore, an uncertainty cost, namely the additional cost is defined, and the uncertainty cost is reduced to the minimum through the reinforced learning reward and punishment mechanism so as to encourage the expert strategy to return to the expert demonstration distribution (namely the data dense area)
Figure 615343DEST_PATH_IMAGE014
Is defined as
Figure 22054DEST_PATH_IMAGE015
Wherein, in the step (A),
Figure 333081DEST_PATH_IMAGE016
representation policyπSatisfying a posterior probability distribution
Figure 307990DEST_PATH_IMAGE017
VarIn the form of a variance of a random variable,sin the state of being in the first place,ais an action.
The total learning cost function
Figure 586525DEST_PATH_IMAGE018
The following settings are set:
Figure 780614DEST_PATH_IMAGE020
wherein
Figure 195415DEST_PATH_IMAGE021
Representing an actionaSatisfy in statesStrategy under conditionsπEIn the interest of expectation,
Figure 216592DEST_PATH_IMAGE022
indicating a statesSatisfy the expert policy ofπ*Distribution of induced statesd π*
The first half of the learning cost function is the standard behavior emulation cost
Figure 982423DEST_PATH_IMAGE023
The state distribution generated according to the expert strategy is obtained through calculation; the latter half is the reinforcement learning cost
Figure 871881DEST_PATH_IMAGE024
The state distributions generated by the current strategy are calculated and optimized using the strategy gradients.
The optimization process of reinforcement learning cost can be optimized by any strategy gradient method, such as: dominant motion evaluation (A2C) algorithm.
The reinforcement learning cost pushes the manipulator strategy to expert distribution, and the supervised standard behavior simulation cost ensures that the manipulator strategy simulates expert behaviors in the expert distribution; training through different guide samples in the demonstration data to obtain an expert model, wherein the expert model is integrated to approximate posterior probability distribution
Figure 658965DEST_PATH_IMAGE025
Since the demo data is fixed, the final cross-supervised behavior demo update and reinforcement learning strategy gradient update to minimize the posterior variance.
The optimization process of reinforcement learning cost can be carried out in the following way:
is provided withDTo showSample displaying, inputting expert demonstration data
Figure 100311DEST_PATH_IMAGE026
N is more than or equal to 2, i is the number of corresponding expert demonstration data in the demonstration sample, and the strategy is initializedπAnd expert policy set
Figure 104170DEST_PATH_IMAGE027
(ii) a FromDIs sampled to obtainD e D e ~D,|D e |=|DI, throughEAfter the secondary expert demonstration training, in the sampleD e Training expert strategies on the basis ofπ eTo minimize
Figure 390795DEST_PATH_IMAGE028
To converge and then use the presentation dataDPerforming gradient updates to minimize small batches of data sets in (1)
Figure 921133DEST_PATH_IMAGE029
(ii) a And finally, performing gradient updating of the reinforcement learning strategy until the truncated additional cost is minimized, thereby completing optimization of the reinforcement learning cost, namely obtaining the optimal expert strategy corresponding to the minimization of the learning cost function.
In summary, the embodiment of the application provides a manipulator model learning method, a manipulator model learning device, an electronic device and a storage medium, wherein in the process of acquiring an expert strategy, the density degree of expert demonstration data is set as a reward and punishment condition for judging learning cost, so that a learning cost function established based on simulation cost and the density degree of the expert demonstration data is optimized and fed back through reinforcement learning and supervised behavior simulation cost, and the finally acquired optimal expert strategy pushes the manipulator model to simulate the expert demonstration behavior in a range of dense distribution of the expert demonstration data, so that the manipulator model can accurately simulate and complete the expert demonstration behavior at the simulation cost as low as possible.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may also be distributed on multiple groups of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A manipulator model learning method is used for training a manipulator model, and is characterized by comprising the following steps:
acquiring a plurality of groups of expert demonstration data which are used for the mechanical arm model to learn and relate to the same execution task;
generating an expert strategy associated with a learning cost function according to the expert demonstration data, wherein the learning cost function is established based on simulation cost required for simulating the expert demonstration data and the intensity of the expert demonstration data;
minimizing the learning cost function to obtain an optimal expert strategy;
and training the manipulator model according to the optimal expert strategy.
2. The robot model learning method according to claim 1, wherein the step of acquiring a plurality of sets of expert demonstration data on the same execution task for the robot model learning comprises:
the robot is directly operated by a human expert repeatedly in virtual reality or a task demonstration about the same execution task is performed by a human expert teaching mobile robot repeatedly in a real scene to collect a plurality of sets of expert demonstration data.
3. The manipulator model learning method according to claim 1, wherein the step of generating an expert strategy associated with a learning cost function from the expert demonstration data comprises:
generating the expert strategy according to the expert demonstration data, wherein the expert strategy is used for guiding a manipulator model to generate imitation actions for imitating the expert demonstration data;
obtaining the impersonation cost according to impersonation actions;
acquiring reinforcement learning cost according to the density degree of the expert demonstration data corresponding to the imitation action in all the expert demonstration data;
and establishing a learning cost function according to the simulation cost and the reinforcement learning cost.
4. A manipulator model learning method according to claim 3, wherein the step of minimizing the learning cost function to obtain an optimal expert strategy comprises:
training the expert strategy according to a plurality of groups of expert demonstration data to minimize and converge simulation cost in the learning cost function;
minimizing the reinforcement learning cost to obtain an optimal expert strategy.
5. The manipulator model learning method according to claim 4, wherein the training of the expert strategy according to the plurality of sets of expert demonstration data to converge the minimization of the simulation cost in the learning cost function comprises:
extracting part of the expert demonstration data to train the expert strategy, so that the simulation cost in the learning cost function is converged;
and updating the learning cost function according to the expert demonstration data gradient to minimize and converge the simulation cost after convergence.
6. A manipulator model learning method according to claim 1, further comprising a step performed before the step of acquiring a plurality of sets of expert demonstration data on the same task to be performed for the manipulator model learning:
and establishing a state space for training the manipulator model according to the image data characteristics of the operation scene and the data of the airborne sensor.
7. The manipulator model learning method according to claim 6, wherein the image data features are extracted from image information of a job scene based on a preset feature extractor, and the preset feature extractor is generated based on a life scene image and/or Imagenet database training.
8. A manipulator model learning device for training a manipulator model, the learning device comprising:
the acquisition module is used for acquiring a plurality of groups of expert demonstration data which are used for the mechanical arm model to learn and relate to the same execution task;
the strategy generation module is used for generating an expert strategy related to a learning cost function according to the expert demonstration data, and the learning cost function is established based on the simulation cost required by simulating the expert demonstration data and the density of the expert demonstration data;
an optimization feedback module for minimizing the learning cost function to obtain an optimal expert strategy;
and the training module is used for training the manipulator model according to the optimal expert strategy.
9. An electronic device comprising a processor and a memory, said memory storing computer readable instructions which, when executed by said processor, perform the steps of the method according to any one of claims 1 to 7.
10. A storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of the method according to any one of claims 1-7.
CN202210257626.5A 2022-03-16 2022-03-16 Manipulator model learning method and device, electronic equipment and storage medium Active CN114347043B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210257626.5A CN114347043B (en) 2022-03-16 2022-03-16 Manipulator model learning method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210257626.5A CN114347043B (en) 2022-03-16 2022-03-16 Manipulator model learning method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114347043A true CN114347043A (en) 2022-04-15
CN114347043B CN114347043B (en) 2022-06-03

Family

ID=81094796

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210257626.5A Active CN114347043B (en) 2022-03-16 2022-03-16 Manipulator model learning method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114347043B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115277264A (en) * 2022-09-28 2022-11-01 季华实验室 Subtitle generating method based on federal learning, electronic equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190091859A1 (en) * 2017-09-26 2019-03-28 Siemens Aktiengesellschaft Method and system for automatic robot control policy generation via cad-based deep inverse reinforcement learning
CN110473162A (en) * 2018-05-11 2019-11-19 精工爱普生株式会社 The generation method of machine learning device, photography time estimation device and learning model
CN110745136A (en) * 2019-09-20 2020-02-04 中国科学技术大学 Driving self-adaptive control method
CN110784618A (en) * 2018-07-25 2020-02-11 精工爱普生株式会社 Scanning system, storage medium, and machine learning apparatus
CN112447065A (en) * 2019-08-16 2021-03-05 北京地平线机器人技术研发有限公司 Trajectory planning method and device
CN112540671A (en) * 2019-09-20 2021-03-23 辉达公司 Remote operation of a vision-based smart robotic system
CN113043275A (en) * 2021-03-29 2021-06-29 南京工业职业技术大学 Micro-part assembling method based on expert demonstration and reinforcement learning
CN113887845A (en) * 2021-12-07 2022-01-04 中国南方电网有限责任公司超高压输电公司广州局 Extreme event prediction method, apparatus, device, medium, and computer program product
CN113971746A (en) * 2021-12-24 2022-01-25 季华实验室 Garbage classification method and device based on single hand teaching and intelligent sorting system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190091859A1 (en) * 2017-09-26 2019-03-28 Siemens Aktiengesellschaft Method and system for automatic robot control policy generation via cad-based deep inverse reinforcement learning
CN110473162A (en) * 2018-05-11 2019-11-19 精工爱普生株式会社 The generation method of machine learning device, photography time estimation device and learning model
CN110784618A (en) * 2018-07-25 2020-02-11 精工爱普生株式会社 Scanning system, storage medium, and machine learning apparatus
CN112447065A (en) * 2019-08-16 2021-03-05 北京地平线机器人技术研发有限公司 Trajectory planning method and device
CN110745136A (en) * 2019-09-20 2020-02-04 中国科学技术大学 Driving self-adaptive control method
CN112540671A (en) * 2019-09-20 2021-03-23 辉达公司 Remote operation of a vision-based smart robotic system
CN113043275A (en) * 2021-03-29 2021-06-29 南京工业职业技术大学 Micro-part assembling method based on expert demonstration and reinforcement learning
CN113887845A (en) * 2021-12-07 2022-01-04 中国南方电网有限责任公司超高压输电公司广州局 Extreme event prediction method, apparatus, device, medium, and computer program product
CN113971746A (en) * 2021-12-24 2022-01-25 季华实验室 Garbage classification method and device based on single hand teaching and intelligent sorting system

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
YANQIN MA ET AL.: "An Efficient Robot Precision Assembly Skill Learning Framework Based on Several Demonstrations", 《 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING 》 *
YANQIN MA ET AL.: "Efficient Insertion Control for Precision Assembly Based on Demonstration Learning and Reinforcement Learning", 《IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS》 *
张祺琛: "元强化学习的研究与应用", 《中国优秀硕士学位论文全文电子期刊网》 *
李帅龙: "模仿学习方法综述及其在机器人领域的应用", 《计算机工程与应用》 *
李超: "面向含能部件的机器人柔顺装配控制系统关键技术研究", 《中国硕士学位论文全文电子期刊网 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115277264A (en) * 2022-09-28 2022-11-01 季华实验室 Subtitle generating method based on federal learning, electronic equipment and storage medium
CN115277264B (en) * 2022-09-28 2023-03-24 季华实验室 Subtitle generating method based on federal learning, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN114347043B (en) 2022-06-03

Similar Documents

Publication Publication Date Title
CN110781765B (en) Human body posture recognition method, device, equipment and storage medium
WO2022100363A1 (en) Robot control method, apparatus and device, and storage medium and program product
Daniel et al. Active reward learning with a novel acquisition function
CN110472594A (en) Method for tracking target, information insertion method and equipment
Judah et al. Active lmitation learning: formal and practical reductions to IID learning.
CN113408621B (en) Rapid simulation learning method, system and equipment for robot skill learning
CN109940614B (en) Mechanical arm multi-scene rapid motion planning method integrating memory mechanism
CN114347043B (en) Manipulator model learning method and device, electronic equipment and storage medium
CN106529838A (en) Virtual assembling method and device
CN109925718A (en) A kind of system and method for distributing the micro- end map of game
CN114139637A (en) Multi-agent information fusion method and device, electronic equipment and readable storage medium
Toma et al. Pathbench: A benchmarking platform for classical and learned path planning algorithms
CN108255059B (en) Robot control method based on simulator training
CN115761905A (en) Diver action identification method based on skeleton joint points
JPWO2020003670A1 (en) Information processing device and information processing method
CN102737279A (en) Information processing device, information processing method, and program
Nazarczuk et al. V2A-Vision to Action: Learning robotic arm actions based on vision and language
CN116276973A (en) Visual perception grabbing training method based on deep learning
CN116362265A (en) Text translation method, device, equipment and storage medium
Ren et al. InsActor: Instruction-driven Physics-based Characters
CN113476833A (en) Game action recognition method and device, electronic equipment and storage medium
CN113284257A (en) Modularized generation and display method and system for virtual scene content
CN112785721A (en) LeapMotion gesture recognition-based VR electrical and electronic experiment system design method
CN113761355A (en) Information recommendation method, device, equipment and computer readable storage medium
Babadi et al. Learning Task-Agnostic Action Spaces for Movement Optimization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant