CN114347043A

CN114347043A - Manipulator model learning method and device, electronic equipment and storage medium

Info

Publication number: CN114347043A
Application number: CN202210257626.5A
Authority: CN
Inventors: 焦家辉; 张晟东; 王济宇; 李志建; 蔡维嘉; 李腾; 张立华; 李伟
Original assignee: Individual
Current assignee: Individual
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2022-04-15
Anticipated expiration: 2042-03-16
Also published as: CN114347043B

Abstract

The invention relates to the technical field of intelligent manipulators, and particularly discloses a manipulator model learning method, a manipulator model learning device, electronic equipment and a storage medium, wherein the learning method comprises the following steps: acquiring a plurality of groups of expert demonstration data which are used for the mechanical arm model to learn and relate to the same execution task; generating an expert strategy associated with a learning cost function according to the expert demonstration data, wherein the learning cost function is established based on simulation cost required for simulating the expert demonstration data and the intensity of the expert demonstration data; minimizing the learning cost function to obtain an optimal expert strategy; training the manipulator model according to the optimal expert strategy; the optimal expert strategy finally obtained by the method pushes the manipulator model to simulate the expert demonstration behavior in the range of dense expert demonstration data distribution, so that the manipulator model can accurately simulate and complete the expert demonstration behavior at the simulation cost as low as possible.

Description

Manipulator model learning method and device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of intelligent manipulators, in particular to a manipulator model learning method and device, electronic equipment and a storage medium.

Background

In the production application, the mechanical arm can enhance the universality of the automatic interaction of the mechanical arm through reinforcement learning and efficiently complete complex tasks; existing reinforcement learning models can generally accelerate model convergence by learning an optimal expert behavior strategy in combination with presentation data, but are prone to final reinforcement learning models failing to accurately mimic completing expert presentation behaviors due to shifts in the expert behavior strategy or learning with only minimal modeling costs.

In view of the above problems, no effective technical solution exists at present.

Disclosure of Invention

The application aims to provide a manipulator model learning method, a manipulator model learning device, an electronic device and a storage medium, so that a manipulator model can accurately imitate and finish expert demonstration behaviors at the imitation cost as low as possible.

In a first aspect, the present application provides a manipulator model learning method for training a manipulator model, the learning method including the following steps:

acquiring a plurality of groups of expert demonstration data which are used for the mechanical arm model to learn and relate to the same execution task;

generating an expert strategy associated with a learning cost function according to the expert demonstration data, wherein the learning cost function is established based on simulation cost required for simulating the expert demonstration data and the intensity of the expert demonstration data;

minimizing the learning cost function to obtain an optimal expert strategy;

and training the manipulator model according to the optimal expert strategy.

According to the mechanical arm model learning method, in the process of acquiring the expert strategy, the density degree of expert demonstration data is set as a reward and punishment condition for judging learning cost, so that a learning cost function established based on the simulation cost and the density degree of the expert demonstration data is optimized and fed back through reinforcement learning and supervised behavior simulation cost, the finally acquired optimal expert strategy pushes the mechanical arm model to simulate expert demonstration behaviors in the range of intensive distribution of the expert demonstration data, and the mechanical arm model can accurately simulate and complete the expert demonstration behaviors at the simulation cost as low as possible.

The robot model learning method described above, wherein the step of acquiring a plurality of sets of expert demonstration data for the robot model to learn about the same execution task includes:

the robot is directly operated by a human expert repeatedly in virtual reality or a task demonstration about the same execution task is performed by a human expert teaching mobile robot repeatedly in a real scene to collect a plurality of sets of expert demonstration data.

In the learning method of this example, inputting expert demonstration data as a supervised learning object enables the manipulator model to quickly complete learning of the actions required to perform the task.

The manipulator model learning method according to the present invention, wherein the step of generating an expert strategy associated with a learning cost function from the expert demonstration data includes:

generating the expert strategy according to the expert demonstration data, wherein the expert strategy is used for guiding a manipulator model to generate imitation actions for imitating the expert demonstration data;

obtaining the impersonation cost according to impersonation actions;

acquiring reinforcement learning cost according to the density degree of the expert demonstration data corresponding to the imitation action in all the expert demonstration data;

and establishing a learning cost function according to the simulation cost and the reinforcement learning cost.

In the learning method of the example, as a plurality of groups of expert demonstration behaviors are input, the merits of the simulation behavior can be known according to the positions of the simulation behavior in the distribution of the expert demonstration behaviors, namely whether the simulation behavior can well complete the execution task is known, so that the reinforcement learning cost is set at the positions of the simulation behavior in the distribution of the expert demonstration behaviors for driving the manipulator model to be reinforced and learnt as the reward and punishment condition.

The manipulator model learning method, wherein the step of minimizing the learning cost function to obtain an optimal expert strategy comprises:

training the expert strategy according to a plurality of groups of expert demonstration data to minimize and converge simulation cost in the learning cost function;

minimizing the reinforcement learning cost to obtain an optimal expert strategy.

The learning method of this example preferably improves the efficiency of the optimal expert strategy by training the expert strategy to minimize the convergence modeling cost, then updating through reinforcement learning until the reinforcement learning cost is minimized, and reducing the update variables through step-by-step update behavior.

The method for learning a manipulator model, wherein the training of the expert strategy according to the plurality of sets of expert demonstration data to minimize and converge the simulation cost in the learning cost function comprises:

extracting part of the expert demonstration data to train the expert strategy, so that the simulation cost in the learning cost function is converged;

and updating the learning cost function according to the expert demonstration data gradient to minimize and converge the simulation cost after convergence.

The robot model learning method further includes, before the step of acquiring a plurality of sets of expert demonstration data about the same execution task for the robot model learning, the steps of:

and establishing a state space for training the manipulator model according to the image data characteristics of the operation scene and the data of the airborne sensor.

According to the learning method, the sensing of the manipulator body is directly combined with the image data characteristics acquired in advance through the airborne sensor on the manipulator body to establish the state space to replace the state input to the reinforcement learning training task, so that the sample efficiency and performance which are the same as those of the state-based learning are achieved, the process of calibrating the environment by using external equipment is omitted, and the dependence on the environment equipment is reduced to reduce the operation cost.

The manipulator model learning method comprises the steps that image data features are extracted according to image information of a working scene based on a preset feature extractor, and the preset feature extractor is generated based on life scene images and/or Imagenet database training.

In a second aspect, the present application further provides a manipulator model learning device for training a manipulator model, the learning device includes:

the acquisition module is used for acquiring a plurality of groups of expert demonstration data which are used for the mechanical arm model to learn and relate to the same execution task;

the strategy generation module is used for generating an expert strategy related to a learning cost function according to the expert demonstration data, and the learning cost function is established based on the simulation cost required by simulating the expert demonstration data and the density of the expert demonstration data;

an optimization feedback module for minimizing the learning cost function to obtain an optimal expert strategy;

and the training module is used for training the manipulator model according to the optimal expert strategy.

The utility model provides a manipulator model learning device, in the in-process of acquireing expert's strategy, set up expert's demonstration data's intensive degree for judging the reward punishment condition of learning cost, make the learning cost function based on the intensive degree establishment of imitation cost and expert's demonstration data optimize the feedback through reinforcement learning and the action imitation cost of supervised, the optimum expert's strategy that finally acquires pushes manipulator model to the intensive within range imitation expert's demonstration action of expert's demonstration data distribution, thereby make manipulator model can imitate completion expert's demonstration action under the imitation cost as far as possible accurately.

In a third aspect, the present application further provides an electronic device, comprising a processor and a memory, where the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, perform the steps of the method as provided in the first aspect.

In a fourth aspect, the present application also provides a storage medium having a computer program stored thereon, which when executed by a processor performs the steps of the method as provided in the first aspect above.

From the above, the application provides a manipulator model learning method, a manipulator model learning device, an electronic device and a storage medium, wherein in the process of acquiring an expert strategy, the density degree of expert demonstration data is set as a reward and punishment condition for judging learning cost, so that a learning cost function established based on simulation cost and the density degree of the expert demonstration data is optimized and fed back through reinforcement learning and supervised behavior simulation cost, and the finally acquired optimal expert strategy pushes the manipulator model to simulate the expert demonstration behavior in a range of dense distribution of the expert demonstration data, so that the manipulator model can accurately simulate and complete the expert demonstration behavior at the simulation cost as low as possible.

Drawings

Fig. 1 is a flowchart of a robot model learning method according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of a manipulator model learning device according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Reference numerals: 201. an acquisition module; 202. a policy generation module; 203. an optimization feedback module; 204. a training module; 301. a processor; 302. a memory; 303. a communication bus.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

The existing manipulator reinforcement learning model is low in training convergence speed and poor in robustness, supervision learning is generally carried out by setting demonstration data, so that the manipulator model learns an optimal expert behavior strategy to accelerate model convergence, reinforcement learning is generally carried out by inputting multiple groups of expert data to drive the manipulator model, if the multiple groups of expert demonstration data are input, the manipulator model is generally driven only with minimum distance cost to select an expert strategy, the expert strategy may be a strategy with poor task completion effect, the expert strategy obtained based on the behavior may only be capable of reluctantly executing tasks, and the manipulator is caused to complete the task with low precision.

In a first aspect, please refer to fig. 1, fig. 1 is a robot model learning method for training a robot model in some embodiments of the present application, the learning method includes the following steps:

s1, acquiring a plurality of groups of expert demonstration data for the mechanical arm model to learn and about the same execution task;

specifically, the manipulator model learning process is a process of finding a strategy to simulate expert demonstration behaviors, and therefore, expert demonstration data representing expert demonstration behaviors, which can be supplied to the manipulator model for learning, needs to be input in advance before the manipulator model is learned, and the expert demonstration data is characteristic data captured in the expert demonstration behaviors.

More specifically, the manipulator model learning process is a learning and correcting process of independent behavior, so that each learning is performed on expert demonstration data corresponding to one execution task.

More specifically, in order to avoid that the data deviation of expert demonstration data generated by a single expert demonstration behavior affects the learning result of a manipulator model, the learning method provided by the embodiment of the application sets multiple groups of expert demonstration data for the same execution task, namely, repeatedly demonstrates multiple expert demonstration behaviors for the same execution task to generate multiple groups of expert demonstration data for the manipulator model learning, and ensures that the manipulator has enough learning objects to more accurately realize behavior learning.

S2, generating an expert strategy related to a learning cost function according to the expert demonstration data, wherein the learning cost function is established based on the simulation cost required by simulating the expert demonstration data and the density of the expert demonstration data;

specifically, the expert strategy is a learning strategy for a manipulator model to simulate expert demonstration data, and in the embodiment of the application, the manipulator learns the learning strategy for the expert demonstration data, for the manipulator model, multiple learning strategies can be provided for the same execution task, and a learning strategy with the lowest learning cost as possible needs to be searched on the basis of quick search to serve as an optimal expert strategy so as to drive the manipulator model to complete learning.

More specifically, the process of obtaining the optimal expert strategy is substantially to obtain the lowest-cost learning strategy under the distribution of the induction state, i.e., the behavior substitution cost of the manipulator behavior and the expert demonstration behavior is minimized, and therefore, it is necessary to establish the expert strategy associated with the learning cost function, i.e., the expert strategy in which the strategy of obtaining one mechanical behavior in the manipulator model is converted into the expert demonstration behavior in combination with the learning cost function, since the expert demonstration data is in a plurality of sets, the expert strategy associated with the learning cost function may be in a plurality of forms, i.e., the step S2 generates a plurality of sets of expert strategies to form an expert strategy set, and it is necessary to determine the value of the learning cost function to obtain the expert strategy finally used for the manipulator model learning in the expert strategy set.

More specifically, since the learning cost dominates the determination of the expert strategy, so that the establishment of the learning cost function is particularly critical, the prior art generally establishes the learning cost function by only the distance between the manipulator behavior and the expert demonstration behavior, the manipulator model can rapidly complete the learning for a single expert demonstration data, but the manipulator model can only learn the expert demonstration behavior closest to the manipulator behavior for a plurality of groups of expert demonstration data, in this case, the manipulator model is easy to learn the expert demonstration behavior with a large deviation value from the task, therefore, in the learning method of the embodiment of the application, the learning cost function is set to be established based on the simulation cost required by simulating expert demonstration data and the density of the expert demonstration data, so that the learning cost function can reflect the learning difficulty and the superiority and inferiority of behaviors after learning.

More specifically, the simulation cost required for simulating expert demo data is a distance cost in a state distribution generated by the expert policy, that is, a distance cost in which the manipulator behavior changes to the corresponding expert demo behavior in that state.

More specifically, the density of the expert demonstration data is the density distribution degree of expert demonstration behaviors corresponding to the expert strategy in all the expert demonstration behaviors, in a plurality of groups of expert demonstration data generated by a plurality of groups of expert demonstration behaviors, the more densely distributed places of the expert demonstration data represent that the behaviors are more accurate, and similarly, the more sparsely distributed places of the expert demonstration data represent that the behaviors deviate from actions required by executing tasks and represent that the behaviors are more inaccurate; therefore, according to the learning method provided by the embodiment of the application, the simulation cost and the density of expert demonstration data are set in the learning cost function, so that the expert strategy can be searched based on two characteristics, and the manipulator model can determine the expert strategy from at least two scales for learning.

More specifically, only increasing the simulation cost can increase the learning cost function, and only becoming more dense can increase the learning cost function.

S3, minimizing a learning cost function to obtain an optimal expert strategy;

specifically, the process of minimizing the learning cost function is to converge the learning cost function to the minimum, and at this time, the optimal expert strategy determined according to the converged learning cost function is the expert strategy with the optimal comprehensive simulation cost and the optimal expert demonstration data density.

And S4, training the manipulator model according to the optimal expert strategy.

Specifically, the manipulator model can execute the manipulator behavior corresponding to the optimal expert strategy after training and learning according to the optimal expert strategy, namely accurately imitate and complete the expert demonstration behavior with the imitation cost as low as possible.

Specifically, the learning method of the embodiment of the application sets expert demonstration data to perform supervised learning of the manipulator model, and guides the manipulator model to perform reinforcement learning according to the learning cost and the intensity, and the reinforcement learning and the supervised behavior simulation cost are minimized together.

According to the mechanical arm model learning method, in the process of acquiring the expert strategy, the density degree of expert demonstration data is set as a reward and punishment condition for judging learning cost, so that a learning cost function established based on simulation cost and the density degree of the expert demonstration data is optimized and fed back through reinforcement learning and supervised behavior simulation cost, the finally acquired optimal expert strategy pushes the mechanical arm model to simulate expert demonstration behaviors in a range with dense expert demonstration data distribution, and the mechanical arm model can accurately simulate and complete expert demonstration behaviors at simulation cost as low as possible.

In some preferred embodiments, the step of acquiring a plurality of sets of expert demonstration data for the manipulator model to learn about the same task to be performed comprises:

and S11, repeating the direct operation of the manipulator by the human expert in virtual reality or the repeated operation demonstration of the same execution task by the human expert teaching mobile manipulator in a real scene to collect a plurality of groups of expert demonstration data.

Specifically, Virtual Reality (VR) is a virtual scene calibrated in advance, that is, a state space of the virtual scene is set in advance, and expert demonstration data in the state space of the virtual reality can be generated quickly by performing operation demonstration in the virtual reality; the real scene is consistent with the state space acquired by the manipulator, so that the manipulator can be directly moved by a human expert to acquire expert demonstration data in the state space; inputting expert demonstration data as a supervised learning object can enable the manipulator model to rapidly complete the learning of actions required for executing tasks.

More specifically, repeated operation in virtual reality or in a real scene enables multiple sets of expert presentation data so that the manipulator model has enough learning samples to perform reinforcement learning.

More specifically, the behavior of repeated operation is executed by human experts, so that a certain difference is generated in each operation process actually, so that the expert demonstration data of each group are slightly different, and under the condition of acquiring enough expert demonstration data, the more dense the expert demonstration data is, the more accurate the behavior characteristic data is, namely, the standard behavior action required by executing the task is.

In some preferred embodiments, the expert presentation data is 30 or more sets, thereby ensuring that there is sufficient data to characterize the distribution of the expert presentation data.

In some preferred embodiments, the step of generating an expert strategy associated with the learning cost function from the expert presentation data comprises:

s21, generating an expert strategy according to the expert demonstration data, wherein the expert strategy is used for guiding the manipulator model to generate simulation actions for simulating the expert demonstration data;

specifically, the simulation learning process of the manipulator model is to search a strategy to drive the manipulator behavior to be converted into an expert demonstration behavior with minimum cost under the distribution of an induction state, that is, the existing behavior strategy of the manipulator model is converted into an expert strategy, so that the expert strategy needs to be generated according to expert demonstration data to drive the manipulator model to learn.

More specifically, since the expert demonstration data are a plurality of groups, the step can generate a set of expert strategies according to the plurality of groups of expert demonstration data, so that the manipulator model can search for a proper expert strategy in the set of expert strategies for learning.

More specifically, the simulation action refers to a behavior after the manipulator model simulates and converts the manipulator model into expert demonstration data based on the behavior of the manipulator model in the distribution of the induction state, that is, the existing behavior strategy of the manipulator model is converted into the expert strategy.

S22, obtaining the simulation cost according to the simulation action;

specifically, the simulation cost is an alternative cost required for the manipulator behavior simulation to be converted into the behavior after the expert demonstration data, and the cost is generally a distance cost, namely an action cost required to be paid for characterizing a change distance between the front behavior and the back behavior.

S23, acquiring reinforcement learning cost according to the density of the expert demonstration data corresponding to the imitation action in all the expert demonstration data;

specifically, the simulated action is a behavior generated by simulating a learning expert demonstration behavior, the merits of the simulated action can be known by analyzing the differences between the simulated action and the expert demonstration behavior generated after simulation, and because a plurality of groups of expert demonstration behaviors are input, the merits of the simulated action can be known according to the positions of the simulated action in the expert demonstration behavior distribution, namely whether the simulated action can well complete an execution task or not is known, so that the simulation action is set at a reinforced learning cost at the positions of the expert demonstration behavior distribution (namely the intensive degree in all expert demonstration data) to serve as a reward condition to drive the manipulator model to be reinforced and learned.

More specifically, the reinforcement learning cost is an uncertainty cost, which is equivalent to a reward and punishment cost, and when the simulation action is at a position where all expert demonstration data are densely distributed, a smaller cost value is given, whereas, when all expert demonstration data are sparsely distributed, a larger cost value is given.

More specifically, the density of the expert demonstration data is inversely related to the variance of the expert strategy, so that the embodiment of the application can set the reinforcement learning cost based on the strategy variance, when the expert strategy variance is larger, the reinforcement learning cost is larger, and conversely, when the expert strategy variance is smaller, the reinforcement learning cost is smaller.

And S24, establishing a learning cost function according to the simulation cost and the reinforcement learning cost.

Specifically, the learning cost function established by combining the two costs in the step can comprehensively represent the action distance required by the manipulator model simulation learning and the good and bad conditions of the simulated action after learning.

In some preferred embodiments, the step of minimizing the learning cost function to obtain the optimal expert strategy comprises:

s31, training expert strategies according to multiple groups of expert demonstration data, and enabling simulation cost in the learning cost function to be minimized and converged;

and S32, minimizing the reinforcement learning cost to obtain the optimal expert strategy.

Specifically, in the reinforcement learning process, in order to simplify the whole learning process and improve the efficiency of obtaining the optimal expert strategy, the embodiment of the present application preferably trains the expert strategy to minimize the convergence simulation cost, then performs reinforcement learning update until the reinforcement learning cost is minimized, and reduces the update variables by step-by-step update behavior to improve the efficiency of the optimal expert strategy.

More specifically, the process of minimizing reinforcement learning costs is through reinforcement learning expert strategy gradient updates until minimizing stage reinforcement learning costs.

In some preferred embodiments, training the expert strategy based on a plurality of sets of expert presentation data, the step of converging the simulated cost minimization in the learning cost function comprises:

s311, extracting part of expert demonstration data to train an expert strategy, and enabling simulation cost in a learning cost function to be converged;

and S312, updating the learning cost function according to the expert demonstration data gradient, and minimizing and converging the simulation cost after convergence.

Specifically, in order to improve the convergence efficiency of the simulation cost in the learning cost function, preliminary training is performed in a sampling mode, and then the learning cost function is updated by using the gradient of expert demonstration data, so that the simulation cost is minimized and converged, and the reinforcement learning efficiency of the manipulator model is improved.

In some preferred embodiments, the learning methods further comprise the step performed before the step of obtaining a plurality of sets of expert demonstration data for the manipulator model to learn about the same task to be performed:

and S0, establishing a state space for manipulator model training according to the image data characteristics of the operation scene and the onboard sensor data.

Specifically, the manipulator in the prior art generally needs external equipment (equipment which is relatively fixed to a space environment and has no direct connection relationship with the manipulator) to calibrate a state space, so as to determine the position and the state of the manipulator in the space, while the learning method in the embodiment of the application does not depend on the calibration of the external equipment, and directly combines the body feeling of the manipulator with the image data features acquired in advance through an onboard sensor (a vision sensor, a touch sensor, a joint encoder and the like) on the manipulator body to establish the state space to replace the state input to the reinforcement learning training task, so that the sample efficiency and the performance which are the same as those of the state-based learning are achieved, the process of calibrating the environment by using the external equipment is omitted, and the dependency on the environment equipment is reduced so as to reduce the operation cost.

More specifically, the on-board sensor data is based on motion information acquired by the robot joint encoder.

In some preferred embodiments, the image data features are extracted according to image information of a job scene based on a preset feature extractor, and the preset feature extractor is generated based on life scene images and/or Imagenet database training.

Specifically, the step trains and acquires the feature extractor through an image database, wherein the image database is generated by labeling living scene images and behaviors or acquired based on an Imagenet database, so that the feature extractor has enough training data, the feature extractor can smoothly extract features aiming at different scenes and behaviors, and the feature extractor can smoothly complete the extraction of the image data features of the scene where the manipulator is located.

In a second aspect, please refer to fig. 2, fig. 2 is a robot model learning apparatus for training a robot model provided in some embodiments of the present application, the learning apparatus includes:

an obtaining module 201, configured to obtain multiple sets of expert demonstration data for the manipulator model to learn and about the same execution task;

a strategy generation module 202, configured to generate an expert strategy associated with a learning cost function according to expert demonstration data, where the learning cost function is established based on an emulation cost required for emulating the expert demonstration data and an intensity of the expert demonstration data;

an optimization feedback module 203 for minimizing a learning cost function to obtain an optimal expert strategy;

and the training module 204 is used for training the manipulator model according to the optimal expert strategy.

The manipulator model learning device provided by the embodiment of the application is in the process of acquiring the expert strategy, the intensive degree of the expert demonstration data is set as a reward and punishment condition for judging the learning cost, the learning cost function established based on the simulation cost and the intensive degree of the expert demonstration data is optimized and fed back through the reinforcement learning and supervised behavior simulation cost, the manipulator model is pushed to the expert demonstration data distribution intensive range to simulate the expert demonstration behavior by the finally acquired optimal expert strategy, and the expert demonstration behavior can be accurately simulated and completed by the manipulator model at the simulation cost as low as possible.

In some preferred embodiments, the manipulator model learning apparatus according to the embodiment of the present application is configured to perform the manipulator model learning method according to the first aspect.

In some preferred embodiments, the learning apparatus further comprises:

and the state space establishing module is used for establishing a state space for training the manipulator model according to the image data characteristics of the operation scene and the data of the airborne sensor.

In a third aspect, referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application, where the present application provides an electronic device, including: the processor 301 and the memory 302, the processor 301 and the memory 302 being interconnected and communicating with each other via a communication bus 303 and/or other form of connection mechanism (not shown), the memory 302 storing a computer program executable by the processor 301, the processor 301 executing the computer program when the computing device is running to perform the method of any of the alternative implementations of the embodiments described above.

In a fourth aspect, the present application provides a storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program performs the method in any optional implementation manner of the foregoing embodiments. The storage medium may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as a Static Random Access Memory (SRAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), an Erasable Programmable Read-Only Memory (EPROM), a Programmable Read-Only Memory (PROM), a Read-Only Memory (ROM), a magnetic Memory, a flash Memory, a magnetic disk, or an optical disk.

Example 1

In order to more clearly describe the learning process of the manipulator model in the manipulator model learning method provided by the embodiment of the present application, a more detailed embodiment of the manipulator model learning method is described, where the learning method includes the following steps:

1. acquiring original image data of various real-life production scenes, labeling targets and behaviors in the original images, establishing an image database, or directly acquiring the image database by using the Imagenet with the largest image recognition in the world, and pre-training the image database to obtain a feature extractor;

2. acquiring image information of a current operation scene through a manipulator body vision sensor, and learning image data characteristics of manipulator operation from the image information by using the pre-trained characteristic extractor

Feature of the learned manipulator operation image data

And on-board sensor data obtained from on-board sensors on other robots

Combination formation

,

]As a state space

Is input into the reinforcement learning strategy learning process of the manipulator model.

Wherein the on-board sensor data

The motion information of the manipulator can be obtained by a joint encoder of the manipulator and then determined, and the image data characteristics

Is mapped by the feature extractorIAs image information, there are

Wherein, in the step (A),

，

is a high-dimensional image and is a three-dimensional image,

is a feature extractor.

3. Before the mechanical arm reinforcement learning training, collecting demonstration data of operation demonstration of a human expert operating the mechanical arm in Virtual Reality (VR) or collecting demonstration data of operation demonstration of fingers of a moving mechanical arm taught by the human expert in a real scene, thereby forming expert demonstration data; the demonstration is repeated for 30 times to obtain 30 kinds of expert demonstration data as the data base for strategy search.

4. Training an expert strategy set on a plurality of groups of expert demonstration data, taking the prediction deviation of the strategy, namely the density of the expert demonstration data as a cost, establishing a learning cost function by taking the cost combined with simulation cost required by simulating expert demonstration behaviors as total learning cost, carrying out optimization feedback of the expert strategies based on the learning cost function, finally pushing the expert strategies to an expert demonstration data centralized distribution area, and ensuring that the expert behaviors are simulated in the distribution of the expert demonstration data.

The method mainly comprises the steps that an additional cost is added on the basis of the intensive distribution degree of expert demonstration data on the basis of standard behavior demonstration cost, the additional cost represents action difference caused by different strategies based on posterior sampling under the condition of giving demonstration data, so that the manipulator can execute high-similarity operation with an expert on the basis of expert demonstration data distribution when learning a strategy, and if the expert strategy deviates from the distribution of the expert demonstration data, the additional cost urges the strategy to be converted into the data distribution of expert demonstration.

The specific flow of the step is as follows:

training expert strategy set on multiple groups of expert demonstration data, and defining expert strategy asπ*Searching for a strategy in the manipulator strategyπThe strategy ofπWith an expert strategy ofπ*Distance between actions taken

Comprises the following steps:

wherein, in the step (A),Ein the interest of expectation,

indicating a statesSatisfaction of routing policyπDistribution of induced statesd _π，

Indicating a statesLower execution policyπ，

Expressing the expert policy to be executed in the state sπ*，

Representing the overall varying distance between the two strategies.

Besides the expert demonstration distribution, the data is sparse, and the strategy variance is large; however, in the expert demonstration distribution, the data is dense, the variance of the strategy is low, and therefore, an uncertainty cost, namely the additional cost is defined, and the uncertainty cost is reduced to the minimum through the reinforced learning reward and punishment mechanism so as to encourage the expert strategy to return to the expert demonstration distribution (namely the data dense area)

Is defined as

Wherein, in the step (A),

representation policyπSatisfying a posterior probability distribution

，VarIn the form of a variance of a random variable,sin the state of being in the first place,ais an action.

The total learning cost function

The following settings are set:

wherein

Representing an actionaSatisfy in statesStrategy under conditionsπ，EIn the interest of expectation,

indicating a statesSatisfy the expert policy ofπ*Distribution of induced statesd _π*。

The first half of the learning cost function is the standard behavior emulation cost

The state distribution generated according to the expert strategy is obtained through calculation; the latter half is the reinforcement learning cost

The state distributions generated by the current strategy are calculated and optimized using the strategy gradients.

The optimization process of reinforcement learning cost can be optimized by any strategy gradient method, such as: dominant motion evaluation (A2C) algorithm.

The reinforcement learning cost pushes the manipulator strategy to expert distribution, and the supervised standard behavior simulation cost ensures that the manipulator strategy simulates expert behaviors in the expert distribution; training through different guide samples in the demonstration data to obtain an expert model, wherein the expert model is integrated to approximate posterior probability distribution

Since the demo data is fixed, the final cross-supervised behavior demo update and reinforcement learning strategy gradient update to minimize the posterior variance.

The optimization process of reinforcement learning cost can be carried out in the following way:

is provided withDTo showSample displaying, inputting expert demonstration data

N is more than or equal to 2, i is the number of corresponding expert demonstration data in the demonstration sample, and the strategy is initializedπAnd expert policy set

(ii) a FromDIs sampled to obtainD _e，D _e~D，|D _e|=|DI, throughEAfter the secondary expert demonstration training, in the sampleD _eTraining expert strategies on the basis ofπ _eTo minimize

To converge and then use the presentation dataDPerforming gradient updates to minimize small batches of data sets in (1)

(ii) a And finally, performing gradient updating of the reinforcement learning strategy until the truncated additional cost is minimized, thereby completing optimization of the reinforcement learning cost, namely obtaining the optimal expert strategy corresponding to the minimization of the learning cost function.

In summary, the embodiment of the application provides a manipulator model learning method, a manipulator model learning device, an electronic device and a storage medium, wherein in the process of acquiring an expert strategy, the density degree of expert demonstration data is set as a reward and punishment condition for judging learning cost, so that a learning cost function established based on simulation cost and the density degree of the expert demonstration data is optimized and fed back through reinforcement learning and supervised behavior simulation cost, and the finally acquired optimal expert strategy pushes the manipulator model to simulate the expert demonstration behavior in a range of dense distribution of the expert demonstration data, so that the manipulator model can accurately simulate and complete the expert demonstration behavior at the simulation cost as low as possible.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may also be distributed on multiple groups of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A manipulator model learning method is used for training a manipulator model, and is characterized by comprising the following steps:

minimizing the learning cost function to obtain an optimal expert strategy;

and training the manipulator model according to the optimal expert strategy.

2. The robot model learning method according to claim 1, wherein the step of acquiring a plurality of sets of expert demonstration data on the same execution task for the robot model learning comprises:

3. The manipulator model learning method according to claim 1, wherein the step of generating an expert strategy associated with a learning cost function from the expert demonstration data comprises:

obtaining the impersonation cost according to impersonation actions;

4. A manipulator model learning method according to claim 3, wherein the step of minimizing the learning cost function to obtain an optimal expert strategy comprises:

5. The manipulator model learning method according to claim 4, wherein the training of the expert strategy according to the plurality of sets of expert demonstration data to converge the minimization of the simulation cost in the learning cost function comprises:

6. A manipulator model learning method according to claim 1, further comprising a step performed before the step of acquiring a plurality of sets of expert demonstration data on the same task to be performed for the manipulator model learning:

7. The manipulator model learning method according to claim 6, wherein the image data features are extracted from image information of a job scene based on a preset feature extractor, and the preset feature extractor is generated based on a life scene image and/or Imagenet database training.

8. A manipulator model learning device for training a manipulator model, the learning device comprising:

9. An electronic device comprising a processor and a memory, said memory storing computer readable instructions which, when executed by said processor, perform the steps of the method according to any one of claims 1 to 7.

10. A storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of the method according to any one of claims 1-7.