CN111136659A

CN111136659A - Mechanical arm action learning method and system based on third person scale imitation learning

Info

Publication number: CN111136659A
Application number: CN202010040178.4A
Authority: CN
Inventors: 章宗长; 俞扬; 姜冲
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2020-05-12
Anticipated expiration: 2040-01-15
Also published as: CN111136659B

Abstract

The invention discloses a mechanical arm action learning method and system based on third person imitation learning, which are used for automatically controlling a mechanical arm, so that the mechanical arm can automatically learn how to complete a corresponding control task by watching third party demonstration. In the invention, the samples exist in a video form without using a large number of sensors to acquire state information; the image difference method is used in the discriminator module, so that the discriminator module can ignore the appearance and the environmental background of a learning object, and can simulate learning by using third-party demonstration data; the acquisition cost of the sample is greatly reduced; the variational discriminator bottleneck is used in the discriminator module to restrict the discrimination accuracy of the discriminator on the demonstration generated by the mechanical arm, so that the training process of the discriminator module and the control strategy module is better balanced; the invention can quickly simulate the demonstration action of the user, has simple and flexible operation and low requirements on the environment and the demonstrator.

Description

Mechanical arm action learning method and system based on third person scale imitation learning

Technical Field

The invention relates to a mechanical arm action learning method and system based on third person imitation learning, and belongs to the technical field of automatic mechanical arm action learning.

Background

The mechanical arm is the most main actuating mechanism of the current robot and is also the most widely applied automatic mechanical device. The traditional mechanical arm control needs to be realized based on action planning programming, the method is high in complexity, high in requirement on professional knowledge ability of a user, and very low in learning efficiency and intelligence degree. With the increasingly complex motion tasks required in reality, the traditional mechanical arm motion control system is difficult to meet the requirements of users.

Simulation is the most direct and effective way for human to learn motor skills, and by watching the demonstration of others, human can quickly learn the corresponding skills. The imitation learning method gives the robot such a fast learning ability of human beings so that the robot can learn corresponding operations from demonstration like human beings. Compared with the traditional mechanical arm automatic control method, the learning mode similar to human learning in the simulated learning mode has higher learning efficiency and intelligence degree, and meanwhile, the burden of operators can be reduced, so that the operators do not need to additionally learn a special programming language to perform action planning programming.

The generation of confrontation imitation learning is one of the most representative learning methods in the current imitation learning. Generating confrontation imitates learning to build two individuals who mutually game and can be continuously improved in the game process, and the individuals are respectively a generator and an arbiter. Wherein the generator is aimed at generating the same sample as the demonstration sample, so that the discriminator cannot judge the source of the sample; the goal of the discriminator is to distinguish the demonstration samples from the generation samples as much as possible; according to the judgment result, the generator and the discriminator can respectively update the parameters thereof and start the next round of game. Through continuous game and improvement, the game and the improvement can finally reach Nash equilibrium, and at the moment, the samples generated by the generator can be falsified and truthful, and the discriminator can not accurately discriminate the sample source any more. Both gaming processes can be formally represented as follows:

where (s, a) is a state-action pair, indicating that the demonstrator or the generator takes action a at state s; d_ωIs a discriminator; pi_θA representation generator (or policy); subscript

Representing the sample from a generator; subscript τ_EThe representative samples are from demonstration.

The imitation learning method can enable the robot to learn corresponding operations from the demonstration provided by the operator. However, the simulation learning method usually requires that the demonstration is from the perspective of the first person, i.e. the demonstration of the handle of the operator, and then the state and motion information (such as joint angle, movement speed, etc.) during the demonstration are recorded as demonstration samples. In order to obtain the operation demonstration of the first person, a large number of sensors, such as infrared distance measuring sensors, pressure sensors, photoelectric encoders and the like, need to be installed on the mechanical arm, which greatly increases the use cost of the mechanical arm. In addition, the sensor data on different mechanical arms may be greatly different, which may result in poor versatility of the demonstration sample of the mechanical arm, and further increase the demonstration use cost.

One solution to this problem is to use demonstration samples in the form of video, i.e. third party demonstration data. However, the third-party demonstration data only includes the observation image under the third-party viewing angle, and does not have detailed state and action information; moreover, the environment background and the appearance of the demonstration in the third-party demonstration video may be different from the mechanical arm itself, that is, there may be some domain feature differences between the two. At this time, if there is no one-to-one correspondence between the demonstration data and the sample generated by the robot arm itself, it is difficult for the robot arm to learn the corresponding control strategy from the demonstration data. Bradly et al propose a Third Person called simulation Learning method (Third-Person simulation Learning) for this situation, which introduces the concept of domain confusion based on generation of countermeasures against simulation Learning, and can fuzzify domain information in a sample, so that an agent can perform simulation Learning using Third-party demonstration data. However, this method requires an additional type of demonstration data to be generated in the demonstrator's domain, and this type of demonstration data is generated in the demonstrator's domain by using a random strategy. The introduction of this type of demonstration also greatly increases the cost of learning.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, the invention provides a mechanical arm action learning method and a mechanical arm action learning system based on third person imitation learning. Secondly, the method uses a variant algorithm of a variant Discriminator bottle neck (VDB) to further weaken the influence of the domain difference information on the strategy learning.

The invention is applicable to a robot arm motion simulation task demonstrated by a third party, and can enable the robot arm to learn some operations from a demonstration sample (such as a demonstration video) only consisting of observation without considering the difference of the field characteristics between a demonstrator and the robot arm.

The technical scheme is as follows: a robot arm action learning method based on third person scale imitation learning enables a robot to learn corresponding operations (such as watching demonstration videos) only from observation of demonstration without using numerous sensors to acquire demonstration information, i.e., replaces a state action pair (s, a) in the imitation learning with an observation pair (o, o') consisting of observation images only. Also, the environmental background of the demonstration sample, the appearance of the demonstrator, etc. may be different from the robotic arm. The method can enable the robot to obtain the learning ability closer to the human, and the sample acquisition cost is lower and the universality is stronger. The method comprises the following steps:

s1, input demonstration sample tau_EBy observing only the image sequence o₁,o₂,o₃,...,o_TThe method comprises the steps of (1) forming a video image, wherein T is a maximum time step, and o is an RGB image directly extracted from the video;

s2, executing the current control strategy pi by the mechanical arm_θRecording the observation image sequence in the strategy execution process to obtain a sample only consisting of the observation sequence

S3, demonstrating the sample tau_ESample generated with mechanical arm

Input to a discriminator D_ωIn (1). The discriminator is a two-classification neural network and consists of 1 input layer, 1 feature extractor F, 1 encoder Enc, 2 full-connection layers and 1 output layer. Wherein the input to the discriminator is a sample from a demonstrator or robotic arm, the sample being in the form of (o)_t,o_t+n) (ii) a The input layer performs a differential operation on the input samples, i.e. o_t+n-o_tSo as to eliminate the field characteristics such as the environment background and the like, and simultaneously, the behavior characteristics related to the strategy can be preliminarily obtained; the feature extractor F is composed of 2 convolutional layers and 2 pooling layers and is used for processing the image and extracting sample features sigma; the encoder Enc encodes the extracted features to obtain codes z-Enc (z | sigma); at the same time, give from

The mutual information I (sigma, z) between the code z of the sample and the characteristic sigma imposes an upper bound I_cTo limit the samples in the discriminator network

The information stream of (2); using the code z as input of the full link layer to obtain the output D of the discriminator network_ω(z)；

S4, updating the discriminator parameter omega according to the discrimination result by using a strategy gradient method, wherein the gradient is g_ω：

Wherein KL represents KL divergence (Kullback-Leibler divergence) for measuring the difference between the two distributions, β is Lagrange multiplier with initial value of 1 × 10^-3；

And

expressing the expectation;

representing a set of samples produced by a robotic arm; tau is_ERepresenting a set of samples produced by a demonstrator; the subscript of E indicates the source of the code z used in the desired process: a robotic arm or a demonstration; r (z) is a standard Gaussian distribution.

S5 control strategy pi of mechanical arm_θOutput of the discriminator as an approximate reward, i.e. r (o)_t,o_t+1)＝-log(1-D_ω(z)), and updating the control strategy pi using the ACKTR method_θThe network parameter theta. And repeating the steps until the sample generated by the mechanical arm and the demonstration sample cannot be distinguished by the discriminator, wherein the mechanical arm can successfully simulate the demonstration sample to finish corresponding operation, and a mechanical arm action learning model is obtained.

S6, when the user starts to use the mechanical arm motion learning model and the system, selecting a demonstration sample source: a demonstration video or a live demonstration is manually entered. If the user selects the on-site demonstration, the demonstration process of the user is recorded by the camera, and each frame of the video is extracted to form a demonstration sample tau_E(ii) a If the user selects to manually input the demonstration video, the system will directly extract each frame from the video input by the user to form the demonstration sample tau_E。

S7, the update goal of the control strategy is to maximize the obtained accumulated reward, i.e., to maximize the probability that the arbiter discriminates the sample generated by the control strategy as the demonstration sample

So as to produce as much control operation as possible similar to the demonstration sample.

In the present invention, Wasserstein distance is used as a measure of the difference between a demonstration sample and a sample generated by a mechanical arm, and a discriminator D is used_ωIs defined as a 1-Lipschitz function updated in ω ═ ω + ψ · RMSProp (ω, g)_ω) Wherein ψ is a learning rate, RMSProp represents a root mean square back propagation algorithm, and is a gradient-based optimization method. In the present invention, the learning rates of all networks are set to 1 × 10^-4. During learning, to prevent the gradient explosion problem, the model tailors the gradient so that it does not exceed a fixed threshold, i.e.

θ is a threshold value. Meanwhile, in order to prevent the situation that the difference between adjacent video frames is too small to obtain ideal behavior characteristics from the video frames, the invention uses images of frames at intervals of n to form a sample, wherein n can be flexibly selected and has a value range of [3,5 ]]。

The invention can enable the mechanical arm to learn corresponding operations by watching third-party demonstration. The third party demonstration sample usually has a difference of domain characteristics with the sample generated by the mechanical arm, and in the generation of confrontation simulation learning, the difference can cause the performance of the discriminator to be too strong, so that the balance of the game between the discriminator and the strategy is broken, and the mechanical arm cannot obtain useful information from the feedback of the discriminator to update the control strategy. Therefore, the method firstly uses an image difference method to quickly eliminate the domain feature information existing in the sample, and furthest reduces the influence of the domain feature difference on the discriminator and the strategy game process. Meanwhile, the method can preliminarily extract the behavior characteristics in the sample, reduce the calculated amount of the model and accelerate the training process of the model. Second, the present invention uses a variational discriminator bottleneck to further constrain the discriminator, i.e., to relate to

The mutual information I (sigma, z) between the coding z of the sample features and the features sigma imposes an upper bound I_cThe method and the device limit the information flow of the discriminator, so that the discrimination of the discriminator on the samples generated by the mechanical arm is disturbed, and the balance of games between the discriminator and the control strategy is better maintained.

In order to achieve the above object, the present invention provides a robot arm motion learning system based on third person imitation learning, including:

the sample acquisition module is used for acquiring a demonstration sample input by a user and a sample generated by the mechanical arm executing a control strategy, then preprocessing the sample, extracting each frame of picture in the sample to form an observation sequence, and storing the observation sequence in the memory for the mechanical arm action learning model to take.

And the discriminator module is used for training a mechanical arm control strategy, and the discriminator aims to distinguish a demonstration sample from a sample generated by the mechanical arm as much as possible. The discriminator is a two-classification neural network and consists of 1 input layer, 1 feature extractor F, 1 encoder Enc, 2 full-connection layers and 1 output layer. Wherein the input to the discriminator is a sample from a demonstrator or robotic arm, the sample being in the form of (o)_t,o_t+n) (ii) a The input layer performs a differential operation on the input samples, i.e. o_t+n-o_tThe method can eliminate the field characteristics such as the environment background and the like, and can also preliminarily extract the behavior characteristics related to the strategy; the feature extractor F is composed of 2 convolutional layers and 2 pooling layers and is used for processing the image and extracting sample features sigma; the encoder Enc encodes the extracted features to obtain codes z-Enc (z | sigma); at the same time, give from

The information stream of (2); using the code z as input of the full link layer to obtain the output D of the discriminator network_ω(z); for each input sample, the discriminator outputs a probability that the sample is a demonstration sample; after the judgment is finished, the discriminator updates the network parameters of the discriminator according to the error between the judgment result and the real result, so that a more accurate judgment result can be obtained in the next round of judgment; for the demonstration sample, the discriminator needs to output one as much as possibleThe high probability value, for the sample generated by the mechanical arm, the discriminator needs to output a probability value as low as possible; when the discriminator is optimal, the discriminator outputs 0 when a sample generated by any mechanical arm is input, and the output of the discriminator is 1 when any demonstration sample is input. If the difference of some domain characteristics (such as environment background, color appearance and the like) exists between the demonstration sample and the mechanical arm generation sample, the sample can be rapidly distinguished by the discriminator according to the domain characteristics, so that the performance of the discriminator is too strong, and the control strategy is difficult to learn according to the distinguishing result of the discriminator. The image difference operation in the input layer of the discriminator can remove most of the domain features related to the environment background and the appearance; added upper bound I in arbiter_cThe method can restrain the discriminator and relieve the problem that the discriminator has over-strong performance due to the difference of the field characteristics. The discriminator and the control strategy are in a mutual game relationship, after continuous game, the discriminator and the control strategy finally reach Nash equilibrium, at the moment, the discrimination accuracy of the discriminator on the samples is 50%, and the two samples can not be distinguished.

The control strategy module is used for controlling the mechanical arm and consists of 1 input layer, 2 convolution layers, 2 pooling layers, 2 hidden layers and 1 output layer; the input layer inputs an observation image of the current mechanical arm state, and the observation image is directly shot and generated by a camera; the output layer outputs control signals of the mechanical arm, namely action information; the action information needs to be transmitted into the control module to be converted into a pulse signal so as to realize the control of the mechanical arm.

Wherein the input to the discriminator module is a set of observation pairs in a sample generated by a robotic arm controlled by a control strategy module

A collection of observation pairs in an exemplary sample

The size of a batch is 1024 pairs. The two samples are respectively input into a discriminator for discrimination; the discriminator uses the Wasserstein distance to measure twoThe difference between species samples.

Therein, a demonstration sample tau_ESample generation with robotic arm

Are all viewed from a sequence of images o₁,o₂,o₃，...,o_TComposition, split it into observation pairs (o) before input to a discriminator₁,o_1+n),(o₂,o_2+n),…,(o_i,o_i+n),…,(o_T-n,o_T) And randomly sampling a batch of samples after the observation pairs are disturbed to be used as the input of the discriminator module. Wherein, n represents that the observation pair is composed of two observation images with interval of n frames, n can be flexibly selected and has the value range of [3,5 ]]。

Has the advantages that: compared with the prior art, the mechanical arm action learning method and system based on third person imitation learning provided by the invention have the advantages that: (1) the method is suitable for the simulation learning task only including observation, can well relieve the requirements of the traditional simulation learning method on state and action information, and greatly lightens the excessive dependence of the model on sensor equipment. (2) The mechanical arm action learning model obtained based on the generation of the confrontation imitation learning training has stronger generalization capability and higher learning efficiency. (3) The image difference method is used in the model to remove the field characteristics of the sample, so that the model can use a third-party demonstration to perform simulation learning, the difference of the demonstration and the field characteristics of the sample generated by the mechanical arm is not needed to be considered, the acquisition difficulty of the demonstration sample is reduced, and the universality of the demonstration is improved. (4) The image difference method can preliminarily extract the behavior characteristics in the sample, reduce the calculated amount in the training process and improve the use efficiency of the sample. (5) The introduction of the bottleneck of the variation discriminator can lead the model to better balance the discriminator and the strategy, and lead the training process to be more stable. In conclusion, the invention has great use value and practical significance.

Drawings

FIG. 1 is an exemplary view of a sample observation of an embodiment of the present invention;

FIG. 2 is a block diagram of an arbiter module according to an embodiment of the present invention;

FIG. 3 is a block diagram of an embodiment of the present invention;

fig. 4 is a control strategy structure diagram according to an embodiment of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

A mechanical arm action learning method based on third person imitation learning comprises the following steps:

s1, input demonstration sample tau_EBy observing only the image sequence o₁,o₂,o₃,...,o_TRather than the sequence of state actions in traditional mock learning s₁,a₁，s₂，a₂，...，s_T-1，a_T-1，s_T}. Wherein T is the maximum time step, and o is an RGB image directly extracted from the video;

Demonstration sample tau_EAnd samples produced by sampling using a robotic arm

There may be differences in the field characteristics, that is, the appearance, the environment background, etc. of the demonstrator may be different from those of the robot arm, specifically as shown in fig. 1, the first behavior is an example of observing the sample, and the second behavior is an example of observing the sample generated by the robot arm;

s3, demonstrating the sample tau_ESample generated with mechanical arm

Go to discriminator D_ωIn (1). The discriminator is a two-classification neural network, which is composed of 1 input layer, 1 feature extractor F, 1 encoder Enc, 2 full-connection layers and 1 output layer, and the network structure is shown in figure 2. Wherein the input to the discriminator is a sample from a demonstrator or robotic arm, the sample being in the form of (o)_t,o_t+n) To prevent the difference between adjacent video frames from being too small to obtain ideal behavior characteristics from them, the present embodiment uses images spaced by n frames to construct a sample, where n has a value of 3; the input layer performs a differential operation on the input samples, i.e. o_t+n-o_tThe method can eliminate the field characteristics such as the environment background and the like, and can also preliminarily extract the behavior characteristics related to the strategy; the feature extractor F is composed of 2 convolutional layers and 2 pooling layers and is used for processing the image and extracting sample features sigma; the encoder Enc encodes the extracted features to obtain codes z to Enc (z | sigma), wherein the encoder Enc is composed of 1 full connection layer; at the same time, the sample generated from the mechanical arm is given

The mutual information I (sigma, z) between the code z and the characteristic sigma of (d) imposes an upper bound I_cTo limit the presence of samples generated from the robot in the network of discriminators

In the present embodiment, I_cSetting to 0.1, solving the constraint optimization problem by using a Lagrange multiplier method, wherein β is a Lagrange multiplier, and the initial value is 1 multiplied by 10^-3(ii) a Using the code z as input of the full link layer to obtain the output D of the discriminator network_ω(z)；

S4, updating the discriminator parameter omega by using a strategy gradient method according to the discrimination result, wherein the gradient is as follows:

s5, of mechanical armControl strategy pi_θOutput of the discriminator as an approximate reward, i.e. r (o)_t,o_t+1)＝-log(1-D_ω(z)) and updates the policy network parameter θ using the ACKTR method. And repeating the steps from S2 to S5 until the sample generated by the mechanical arm cannot be distinguished by the discriminator, wherein the mechanical arm can successfully simulate the sample to finish corresponding operations, and a mechanical arm action learning model is obtained.

S6, when the user starts using the present robotic arm motion learning model and system, the demonstration sample source may be manually selected: a demonstration video or a live demonstration is manually entered. If the user selects the site demonstration, the camera records the demonstration process of the user, and then extracts each frame of the video to form a demonstration sample tau_E(ii) a If the user selects to manually input the demonstration video, the system will directly extract each frame from the video input by the user to form the demonstration sample tau_EThe difference in the field characteristics of appearance, background and the like between the demonstrator and the mechanical arm in the video does not need to be considered.

S7, updating the mechanical arm control strategy aiming at maximizing the obtained accumulated reward, namely maximizing the probability that the discriminator discriminates the sample generated by the control strategy as the demonstration sample

The following is a robot arm action learning system based on third person weighing imitation learning of this embodiment, including: the device comprises a sample acquisition module, a discriminator module, a control strategy module and a control module.

The sample acquisition module is used for acquiring a demonstration sample input by a user and a sample generated by the mechanical arm executing a control strategy, wherein the demonstration sample can be selected from manual input or field recording. And then preprocessing is carried out, each frame of picture in the sample is extracted to form an observation sequence, and the observation sequence is stored in a memory for the mechanical arm action learning model to take.

The discriminator module is used for training and discriminating the mechanical arm control strategyThe purpose of the machine is to distinguish as much as possible between the demonstration sample and the sample generated by the robotic arm. For each input sample, the discriminator outputs a probability that the sample is an exemplary sample. For the exemplary sample, the discriminator needs to output a probability value as high as possible, and for the sample generated by the robot arm, the discriminator needs to output a probability value as low as possible. If the difference of some domain characteristics (such as environment background, color appearance and the like) exists between the demonstration sample and the mechanical arm generation sample, the sample can be rapidly distinguished by the discriminator according to the domain characteristics, so that the discriminator is too strong, and the control strategy is difficult to learn according to the distinguishing result of the discriminator. The image difference operation in the input layer of the discriminator can remove most of the domain features related to the environment background and the appearance; added upper bound I in arbiter_cThe method can restrain the discriminator and relieve the problem of over-strong performance of the discriminator caused by the difference of the field characteristics. The discriminator and the control strategy are in a mutual game relationship, after continuous game, the discriminator and the control strategy finally reach Nash equilibrium, at the moment, the discrimination accuracy of the discriminator on the samples is 50%, and the two samples can not be distinguished. The framework of the specific embodiment is shown in fig. 3, and the discriminator is a two-class neural network, and is composed of 1 input layer, 1 feature extractor F, 1 encoder Enc, 2 full-link layers, and 1 output layer. Wherein the input to the discriminator is a sample from a demonstrator or robotic arm, the sample being in the form of (o)_t,o_t+n) (ii) a The input layer performs a differential operation on the input samples, i.e. o_t+n-o_tEliminating the field characteristics such as the environment background and the like, and preliminarily extracting behavior characteristics related to the strategy; the feature extractor F is composed of 2 convolutional layers and 2 pooling layers and is used for processing the image and extracting sample features sigma; the encoder Enc encodes the extracted features to obtain codes z-Enc (z | sigma); at the same time, give from

The mutual information I (sigma, z) between the code z of the sample and the characteristic sigma imposes an upper bound I_cTo limit the correlation in the discriminator networkSample(s)

The information stream of (2); using the code z as input of the full link layer to obtain the output D of the discriminator network_ω(z); for each input sample, the discriminator outputs a probability that the sample is an exemplary sample.

The control strategy module is used for controlling the mechanical arm and consists of 1 input layer, 2 convolution layers, 2 pooling layers, 2 hidden layers and 1 output layer, and the network structure of the control strategy module is shown in figure 4; the input layer inputs an observation image of the current mechanical arm state, and the observation image is directly shot and generated by a camera; the output layer outputs control signals of the mechanical arm, namely action information; the action information needs to be transmitted into the control module to be converted into a pulse signal so as to realize the control of the mechanical arm.

Wherein the input to the discriminator module is a set of observation pairs in a sample generated by the control strategy

And a set of observation pairs in the demonstration sample

The size of a batch is 1024 pairs. Inputting the two samples into a discriminator together for discrimination; the discriminator uses the Wasserstein distance to measure the difference between two samples.

Therein, a demonstration sample tau_ESample generation with robotic arm

Are all viewed from a sequence of images o₁,o₂,o₃,...,o_TComposition, split it into observation pairs (o) before input to a discriminator₁,o_1+n)，(o₂，o_2+n)，…，(o_i，o_i+n)，…,(o_T-n,o_T) And randomly sampling a batch of samples after the observation pairs are disturbed to be used as the input of the discriminator module. Where n denotes that the observation pair consists of two frames spaced by nFormed by observation images, n can be flexibly selected and has a value range of [3,5 ]]。

Discriminator bottleneck I in discriminator module_cThe method has the functions of limiting mutual information between data before and after the coding of the coder so as to achieve the purpose of limiting the information flow of the discriminator, thereby weakening the discriminator and regulating and controlling the discrimination accuracy of the discriminator, and therefore, the game process between the discriminator and the control strategy can be expressed as follows:

wherein the content of the first and second substances,

representing a robot control strategy, i.e. limiting only the information flow from the sample produced by the robot, z being the code, D_ωAs a network of discriminators, D_ω(z) is the output of the arbiter network.

Claims

1. A mechanical arm action learning method based on third person scale imitation learning is characterized by comprising the following steps:

s1, input demonstration sample tau_EBy observing only the image sequence o₁，o₂，o₃，...，o_TI, T is the maximum time step, and o is an RGB image directly extracted from the video;

S3, demonstrating the sample tau_ESample generated with mechanical arm

Input to a discriminator D_ωPerforming the following steps;

s4, according to the discriminator D_ωThe discrimination result of (2) updates the discriminator parameter omega with the gradient g by using a strategy gradient method_ω：

And S5, repeating the steps until the sample generated by the mechanical arm cannot be distinguished by the discriminator, wherein the mechanical arm can successfully simulate the sample to finish corresponding operations, and a mechanical arm action learning model is obtained.

2. The method of claim 1, wherein when a user begins using the model and system for learning robot arm movements, a source of demonstration samples is selected: manually inputting a demonstration video or a live demonstration; if the user selects the on-site demonstration, the demonstration process of the user is recorded by the camera, and each frame of the video is extracted to form a demonstration sample tau_E(ii) a If the user selects to manually input the demonstration video, each frame of picture is directly extracted from the video input by the user to form a demonstration sample tau_E。

3. The mechanical arm action learning method based on third person weighing imitation learning of claim 1, wherein the discriminator is a two-class neural network, and is composed of 1 input layer, 1 feature extractor F, 1 encoder Enc, 2 fully-connected layers, and 1 output layer; wherein the input to the discriminator is a sample from a demonstrator or robotic arm, the sample being in the form of (o)_t，o_t+n) (ii) a The input layer carries out differential operation on the input samples and preliminarily extracts behavior characteristics related to the strategy; the feature extractor F is composed of 2 convolutional layers and 2 pooling layers and is used for processing the image and extracting sample features sigma; the encoder Enc encodes the extracted features to obtain a codeCodes z to Enc (z | σ); at the same time, give from

The information stream of (2); using the code z as input of the full link layer to obtain the output D of the discriminator network_ω(z)。

4. The method of claim 1, wherein the control strategy of the robot is pi_θOutput of the discriminator as an approximate reward, i.e. r (o)_t，o_t+1)＝-log(1-D_ω(z)) and updating policy network parameters using an ACKTR method; the update of the control strategy aims at maximizing the cumulative reward obtained by the control strategy, i.e. maximizing the probability that the arbiter discriminates the samples generated by the control strategy as demonstration samples

5. The method as claimed in claim 1, wherein the method updates the parameter ω of the discriminator using a gradient g method according to the result of the discrimination_ω：

Using the Wasserstein distance as a measure of the difference between a demonstration sample and a sample generated by a mechanical arm, and using the discriminator D_ωIs defined as a 1-Lipschitz function, and the updating mode is ω ═ ω + ψ · RMSProp (ω, g)_ω) Where ψ is a learning rate.

6. As claimed in claim 1The mechanical arm action learning method based on third person scale simulation learning is characterized in that in the learning process, in order to prevent the problem of gradient explosion, the gradient is cut by the model so as not to exceed a fixed threshold value, namely

Theta is a threshold value; meanwhile, in order to prevent the difference between adjacent video frames from being too small to obtain ideal behavior characteristics therefrom, images of interval n frames are used to construct samples.

7. A mechanical arm action learning system based on third person scale imitation learning is characterized by comprising:

the sample acquisition module is used for acquiring a demonstration sample input by a user and a sample generated by the mechanical arm execution control strategy, then preprocessing the sample, extracting each frame of picture in the sample to form an observation sequence, and storing the observation sequence in the memory for the mechanical arm action learning model to take;

the discriminator module is used for training a mechanical arm control strategy, and for each input sample, the discriminator outputs the probability that the sample is taken as a demonstration sample; removing the domain features in the input layer of the discriminator through image difference operation; by added upper bound I in the arbiter_cConstraining the discriminator; the discriminator and the control strategy are in a mutual game relationship, after continuous game, the discriminator and the control strategy finally reach Nash equilibrium, at the moment, the discrimination accuracy of the discriminator on the samples is 50 percent, and the two samples can not be distinguished;

the control strategy module is used for controlling the mechanical arm and consists of 1 input layer, 2 convolution layers, 2 pooling layers, 2 hidden layers and 1 output layer; the input layer inputs an observation image of the current mechanical arm state; the output layer outputs control signals of the mechanical arm, namely action information; the action information needs to be transmitted into the control module to be converted into a pulse signal so as to realize the control of the mechanical arm.

8. As claimed inThe system for learning robot arm movements based on third person scale imitation learning of claim 7, wherein the input of the discriminator module is a set of observation pairs in a sample generated by a control strategy

And a set of observation pairs in the demonstration sample

Inputting the two samples into a discriminator together for discrimination; the discriminator uses the Wasserstein distance to measure the difference between two samples.

9. The system of claim 7, wherein the discriminator is a two-class neural network comprising 1 input layer, 1 feature extractor F, 1 encoder Enc, 2 fully-connected layers, and 1 output layer; wherein the input to the discriminator is a sample from a demonstrator or robotic arm, the sample being in the form of (o)_t，o_t+n) (ii) a The input layer carries out differential operation on the input samples and preliminarily extracts behavior characteristics related to the strategy; the feature extractor F is composed of 2 convolutional layers and 2 pooling layers and is used for processing the image and extracting sample features sigma; the encoder Enc encodes the extracted features to obtain codes z-Enc (z | sigma); at the same time, give from