CN111136659A - Mechanical arm action learning method and system based on third person scale imitation learning - Google Patents

Mechanical arm action learning method and system based on third person scale imitation learning Download PDF

Info

Publication number
CN111136659A
CN111136659A CN202010040178.4A CN202010040178A CN111136659A CN 111136659 A CN111136659 A CN 111136659A CN 202010040178 A CN202010040178 A CN 202010040178A CN 111136659 A CN111136659 A CN 111136659A
Authority
CN
China
Prior art keywords
discriminator
sample
mechanical arm
demonstration
learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010040178.4A
Other languages
Chinese (zh)
Other versions
CN111136659B (en
Inventor
章宗长
俞扬
姜冲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202010040178.4A priority Critical patent/CN111136659B/en
Publication of CN111136659A publication Critical patent/CN111136659A/en
Application granted granted Critical
Publication of CN111136659B publication Critical patent/CN111136659B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • B25J9/163Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control

Landscapes

  • Engineering & Computer Science (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Manipulator (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a mechanical arm action learning method and system based on third person imitation learning, which are used for automatically controlling a mechanical arm, so that the mechanical arm can automatically learn how to complete a corresponding control task by watching third party demonstration. In the invention, the samples exist in a video form without using a large number of sensors to acquire state information; the image difference method is used in the discriminator module, so that the discriminator module can ignore the appearance and the environmental background of a learning object, and can simulate learning by using third-party demonstration data; the acquisition cost of the sample is greatly reduced; the variational discriminator bottleneck is used in the discriminator module to restrict the discrimination accuracy of the discriminator on the demonstration generated by the mechanical arm, so that the training process of the discriminator module and the control strategy module is better balanced; the invention can quickly simulate the demonstration action of the user, has simple and flexible operation and low requirements on the environment and the demonstrator.

Description

Mechanical arm action learning method and system based on third person scale imitation learning
Technical Field
The invention relates to a mechanical arm action learning method and system based on third person imitation learning, and belongs to the technical field of automatic mechanical arm action learning.
Background
The mechanical arm is the most main actuating mechanism of the current robot and is also the most widely applied automatic mechanical device. The traditional mechanical arm control needs to be realized based on action planning programming, the method is high in complexity, high in requirement on professional knowledge ability of a user, and very low in learning efficiency and intelligence degree. With the increasingly complex motion tasks required in reality, the traditional mechanical arm motion control system is difficult to meet the requirements of users.
Simulation is the most direct and effective way for human to learn motor skills, and by watching the demonstration of others, human can quickly learn the corresponding skills. The imitation learning method gives the robot such a fast learning ability of human beings so that the robot can learn corresponding operations from demonstration like human beings. Compared with the traditional mechanical arm automatic control method, the learning mode similar to human learning in the simulated learning mode has higher learning efficiency and intelligence degree, and meanwhile, the burden of operators can be reduced, so that the operators do not need to additionally learn a special programming language to perform action planning programming.
The generation of confrontation imitation learning is one of the most representative learning methods in the current imitation learning. Generating confrontation imitates learning to build two individuals who mutually game and can be continuously improved in the game process, and the individuals are respectively a generator and an arbiter. Wherein the generator is aimed at generating the same sample as the demonstration sample, so that the discriminator cannot judge the source of the sample; the goal of the discriminator is to distinguish the demonstration samples from the generation samples as much as possible; according to the judgment result, the generator and the discriminator can respectively update the parameters thereof and start the next round of game. Through continuous game and improvement, the game and the improvement can finally reach Nash equilibrium, and at the moment, the samples generated by the generator can be falsified and truthful, and the discriminator can not accurately discriminate the sample source any more. Both gaming processes can be formally represented as follows:
Figure BDA0002367470040000011
where (s, a) is a state-action pair, indicating that the demonstrator or the generator takes action a at state s; dωIs a discriminator; piθA representation generator (or policy); subscript
Figure BDA0002367470040000012
Representing the sample from a generator; subscript τEThe representative samples are from demonstration.
The imitation learning method can enable the robot to learn corresponding operations from the demonstration provided by the operator. However, the simulation learning method usually requires that the demonstration is from the perspective of the first person, i.e. the demonstration of the handle of the operator, and then the state and motion information (such as joint angle, movement speed, etc.) during the demonstration are recorded as demonstration samples. In order to obtain the operation demonstration of the first person, a large number of sensors, such as infrared distance measuring sensors, pressure sensors, photoelectric encoders and the like, need to be installed on the mechanical arm, which greatly increases the use cost of the mechanical arm. In addition, the sensor data on different mechanical arms may be greatly different, which may result in poor versatility of the demonstration sample of the mechanical arm, and further increase the demonstration use cost.
One solution to this problem is to use demonstration samples in the form of video, i.e. third party demonstration data. However, the third-party demonstration data only includes the observation image under the third-party viewing angle, and does not have detailed state and action information; moreover, the environment background and the appearance of the demonstration in the third-party demonstration video may be different from the mechanical arm itself, that is, there may be some domain feature differences between the two. At this time, if there is no one-to-one correspondence between the demonstration data and the sample generated by the robot arm itself, it is difficult for the robot arm to learn the corresponding control strategy from the demonstration data. Bradly et al propose a Third Person called simulation Learning method (Third-Person simulation Learning) for this situation, which introduces the concept of domain confusion based on generation of countermeasures against simulation Learning, and can fuzzify domain information in a sample, so that an agent can perform simulation Learning using Third-party demonstration data. However, this method requires an additional type of demonstration data to be generated in the demonstrator's domain, and this type of demonstration data is generated in the demonstrator's domain by using a random strategy. The introduction of this type of demonstration also greatly increases the cost of learning.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, the invention provides a mechanical arm action learning method and a mechanical arm action learning system based on third person imitation learning. Secondly, the method uses a variant algorithm of a variant Discriminator bottle neck (VDB) to further weaken the influence of the domain difference information on the strategy learning.
The invention is applicable to a robot arm motion simulation task demonstrated by a third party, and can enable the robot arm to learn some operations from a demonstration sample (such as a demonstration video) only consisting of observation without considering the difference of the field characteristics between a demonstrator and the robot arm.
The technical scheme is as follows: a robot arm action learning method based on third person scale imitation learning enables a robot to learn corresponding operations (such as watching demonstration videos) only from observation of demonstration without using numerous sensors to acquire demonstration information, i.e., replaces a state action pair (s, a) in the imitation learning with an observation pair (o, o') consisting of observation images only. Also, the environmental background of the demonstration sample, the appearance of the demonstrator, etc. may be different from the robotic arm. The method can enable the robot to obtain the learning ability closer to the human, and the sample acquisition cost is lower and the universality is stronger. The method comprises the following steps:
s1, input demonstration sample tauEBy observing only the image sequence o1,o2,o3,...,oTThe method comprises the steps of (1) forming a video image, wherein T is a maximum time step, and o is an RGB image directly extracted from the video;
s2, executing the current control strategy pi by the mechanical armθRecording the observation image sequence in the strategy execution process to obtain a sample only consisting of the observation sequence
Figure BDA0002367470040000031
S3, demonstrating the sample tauESample generated with mechanical arm
Figure BDA0002367470040000032
Input to a discriminator DωIn (1). The discriminator is a two-classification neural network and consists of 1 input layer, 1 feature extractor F, 1 encoder Enc, 2 full-connection layers and 1 output layer. Wherein the input to the discriminator is a sample from a demonstrator or robotic arm, the sample being in the form of (o)t,ot+n) (ii) a The input layer performs a differential operation on the input samples, i.e. ot+n-otSo as to eliminate the field characteristics such as the environment background and the like, and simultaneously, the behavior characteristics related to the strategy can be preliminarily obtained; the feature extractor F is composed of 2 convolutional layers and 2 pooling layers and is used for processing the image and extracting sample features sigma; the encoder Enc encodes the extracted features to obtain codes z-Enc (z | sigma); at the same time, give from
Figure BDA0002367470040000033
The mutual information I (sigma, z) between the code z of the sample and the characteristic sigma imposes an upper bound IcTo limit the samples in the discriminator network
Figure BDA0002367470040000034
The information stream of (2); using the code z as input of the full link layer to obtain the output D of the discriminator networkω(z);
S4, updating the discriminator parameter omega according to the discrimination result by using a strategy gradient method, wherein the gradient is gω
Figure BDA0002367470040000035
Wherein KL represents KL divergence (Kullback-Leibler divergence) for measuring the difference between the two distributions, β is Lagrange multiplier with initial value of 1 × 10-3
Figure BDA0002367470040000036
And
Figure BDA0002367470040000037
expressing the expectation;
Figure BDA0002367470040000038
representing a set of samples produced by a robotic arm; tau isERepresenting a set of samples produced by a demonstrator; the subscript of E indicates the source of the code z used in the desired process: a robotic arm or a demonstration; r (z) is a standard Gaussian distribution.
S5 control strategy pi of mechanical armθOutput of the discriminator as an approximate reward, i.e. r (o)t,ot+1)=-log(1-Dω(z)), and updating the control strategy pi using the ACKTR methodθThe network parameter theta. And repeating the steps until the sample generated by the mechanical arm and the demonstration sample cannot be distinguished by the discriminator, wherein the mechanical arm can successfully simulate the demonstration sample to finish corresponding operation, and a mechanical arm action learning model is obtained.
S6, when the user starts to use the mechanical arm motion learning model and the system, selecting a demonstration sample source: a demonstration video or a live demonstration is manually entered. If the user selects the on-site demonstration, the demonstration process of the user is recorded by the camera, and each frame of the video is extracted to form a demonstration sample tauE(ii) a If the user selects to manually input the demonstration video, the system will directly extract each frame from the video input by the user to form the demonstration sample tauE
S7, the update goal of the control strategy is to maximize the obtained accumulated reward, i.e., to maximize the probability that the arbiter discriminates the sample generated by the control strategy as the demonstration sample
Figure BDA0002367470040000041
So as to produce as much control operation as possible similar to the demonstration sample.
In the present invention, Wasserstein distance is used as a measure of the difference between a demonstration sample and a sample generated by a mechanical arm, and a discriminator D is usedωIs defined as a 1-Lipschitz function updated in ω ═ ω + ψ · RMSProp (ω, g)ω) Wherein ψ is a learning rate, RMSProp represents a root mean square back propagation algorithm, and is a gradient-based optimization method. In the present invention, the learning rates of all networks are set to 1 × 10-4. During learning, to prevent the gradient explosion problem, the model tailors the gradient so that it does not exceed a fixed threshold, i.e.
Figure BDA0002367470040000042
θ is a threshold value. Meanwhile, in order to prevent the situation that the difference between adjacent video frames is too small to obtain ideal behavior characteristics from the video frames, the invention uses images of frames at intervals of n to form a sample, wherein n can be flexibly selected and has a value range of [3,5 ]]。
The invention can enable the mechanical arm to learn corresponding operations by watching third-party demonstration. The third party demonstration sample usually has a difference of domain characteristics with the sample generated by the mechanical arm, and in the generation of confrontation simulation learning, the difference can cause the performance of the discriminator to be too strong, so that the balance of the game between the discriminator and the strategy is broken, and the mechanical arm cannot obtain useful information from the feedback of the discriminator to update the control strategy. Therefore, the method firstly uses an image difference method to quickly eliminate the domain feature information existing in the sample, and furthest reduces the influence of the domain feature difference on the discriminator and the strategy game process. Meanwhile, the method can preliminarily extract the behavior characteristics in the sample, reduce the calculated amount of the model and accelerate the training process of the model. Second, the present invention uses a variational discriminator bottleneck to further constrain the discriminator, i.e., to relate to
Figure BDA0002367470040000043
The mutual information I (sigma, z) between the coding z of the sample features and the features sigma imposes an upper bound IcThe method and the device limit the information flow of the discriminator, so that the discrimination of the discriminator on the samples generated by the mechanical arm is disturbed, and the balance of games between the discriminator and the control strategy is better maintained.
In order to achieve the above object, the present invention provides a robot arm motion learning system based on third person imitation learning, including:
the sample acquisition module is used for acquiring a demonstration sample input by a user and a sample generated by the mechanical arm executing a control strategy, then preprocessing the sample, extracting each frame of picture in the sample to form an observation sequence, and storing the observation sequence in the memory for the mechanical arm action learning model to take.
And the discriminator module is used for training a mechanical arm control strategy, and the discriminator aims to distinguish a demonstration sample from a sample generated by the mechanical arm as much as possible. The discriminator is a two-classification neural network and consists of 1 input layer, 1 feature extractor F, 1 encoder Enc, 2 full-connection layers and 1 output layer. Wherein the input to the discriminator is a sample from a demonstrator or robotic arm, the sample being in the form of (o)t,ot+n) (ii) a The input layer performs a differential operation on the input samples, i.e. ot+n-otThe method can eliminate the field characteristics such as the environment background and the like, and can also preliminarily extract the behavior characteristics related to the strategy; the feature extractor F is composed of 2 convolutional layers and 2 pooling layers and is used for processing the image and extracting sample features sigma; the encoder Enc encodes the extracted features to obtain codes z-Enc (z | sigma); at the same time, give from
Figure BDA0002367470040000051
The mutual information I (sigma, z) between the code z of the sample and the characteristic sigma imposes an upper bound IcTo limit the samples in the discriminator network
Figure BDA0002367470040000052
The information stream of (2); using the code z as input of the full link layer to obtain the output D of the discriminator networkω(z); for each input sample, the discriminator outputs a probability that the sample is a demonstration sample; after the judgment is finished, the discriminator updates the network parameters of the discriminator according to the error between the judgment result and the real result, so that a more accurate judgment result can be obtained in the next round of judgment; for the demonstration sample, the discriminator needs to output one as much as possibleThe high probability value, for the sample generated by the mechanical arm, the discriminator needs to output a probability value as low as possible; when the discriminator is optimal, the discriminator outputs 0 when a sample generated by any mechanical arm is input, and the output of the discriminator is 1 when any demonstration sample is input. If the difference of some domain characteristics (such as environment background, color appearance and the like) exists between the demonstration sample and the mechanical arm generation sample, the sample can be rapidly distinguished by the discriminator according to the domain characteristics, so that the performance of the discriminator is too strong, and the control strategy is difficult to learn according to the distinguishing result of the discriminator. The image difference operation in the input layer of the discriminator can remove most of the domain features related to the environment background and the appearance; added upper bound I in arbitercThe method can restrain the discriminator and relieve the problem that the discriminator has over-strong performance due to the difference of the field characteristics. The discriminator and the control strategy are in a mutual game relationship, after continuous game, the discriminator and the control strategy finally reach Nash equilibrium, at the moment, the discrimination accuracy of the discriminator on the samples is 50%, and the two samples can not be distinguished.
The control strategy module is used for controlling the mechanical arm and consists of 1 input layer, 2 convolution layers, 2 pooling layers, 2 hidden layers and 1 output layer; the input layer inputs an observation image of the current mechanical arm state, and the observation image is directly shot and generated by a camera; the output layer outputs control signals of the mechanical arm, namely action information; the action information needs to be transmitted into the control module to be converted into a pulse signal so as to realize the control of the mechanical arm.
Wherein the input to the discriminator module is a set of observation pairs in a sample generated by a robotic arm controlled by a control strategy module
Figure BDA0002367470040000061
A collection of observation pairs in an exemplary sample
Figure BDA0002367470040000062
The size of a batch is 1024 pairs. The two samples are respectively input into a discriminator for discrimination; the discriminator uses the Wasserstein distance to measure twoThe difference between species samples.
Therein, a demonstration sample tauESample generation with robotic arm
Figure BDA0002367470040000063
Are all viewed from a sequence of images o1,o2,o3,...,oTComposition, split it into observation pairs (o) before input to a discriminator1,o1+n),(o2,o2+n),…,(oi,oi+n),…,(oT-n,oT) And randomly sampling a batch of samples after the observation pairs are disturbed to be used as the input of the discriminator module. Wherein, n represents that the observation pair is composed of two observation images with interval of n frames, n can be flexibly selected and has the value range of [3,5 ]]。
Has the advantages that: compared with the prior art, the mechanical arm action learning method and system based on third person imitation learning provided by the invention have the advantages that: (1) the method is suitable for the simulation learning task only including observation, can well relieve the requirements of the traditional simulation learning method on state and action information, and greatly lightens the excessive dependence of the model on sensor equipment. (2) The mechanical arm action learning model obtained based on the generation of the confrontation imitation learning training has stronger generalization capability and higher learning efficiency. (3) The image difference method is used in the model to remove the field characteristics of the sample, so that the model can use a third-party demonstration to perform simulation learning, the difference of the demonstration and the field characteristics of the sample generated by the mechanical arm is not needed to be considered, the acquisition difficulty of the demonstration sample is reduced, and the universality of the demonstration is improved. (4) The image difference method can preliminarily extract the behavior characteristics in the sample, reduce the calculated amount in the training process and improve the use efficiency of the sample. (5) The introduction of the bottleneck of the variation discriminator can lead the model to better balance the discriminator and the strategy, and lead the training process to be more stable. In conclusion, the invention has great use value and practical significance.
Drawings
FIG. 1 is an exemplary view of a sample observation of an embodiment of the present invention;
FIG. 2 is a block diagram of an arbiter module according to an embodiment of the present invention;
FIG. 3 is a block diagram of an embodiment of the present invention;
fig. 4 is a control strategy structure diagram according to an embodiment of the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
A mechanical arm action learning method based on third person imitation learning comprises the following steps:
s1, input demonstration sample tauEBy observing only the image sequence o1,o2,o3,...,oTRather than the sequence of state actions in traditional mock learning s1,a1,s2,a2,...,sT-1,aT-1,sT}. Wherein T is the maximum time step, and o is an RGB image directly extracted from the video;
s2, executing the current control strategy pi by the mechanical armθRecording the observation image sequence in the strategy execution process to obtain a sample only consisting of the observation sequence
Figure BDA0002367470040000071
Demonstration sample tauEAnd samples produced by sampling using a robotic arm
Figure BDA0002367470040000072
There may be differences in the field characteristics, that is, the appearance, the environment background, etc. of the demonstrator may be different from those of the robot arm, specifically as shown in fig. 1, the first behavior is an example of observing the sample, and the second behavior is an example of observing the sample generated by the robot arm;
s3, demonstrating the sample tauESample generated with mechanical arm
Figure BDA0002367470040000073
Go to discriminator DωIn (1). The discriminator is a two-classification neural network, which is composed of 1 input layer, 1 feature extractor F, 1 encoder Enc, 2 full-connection layers and 1 output layer, and the network structure is shown in figure 2. Wherein the input to the discriminator is a sample from a demonstrator or robotic arm, the sample being in the form of (o)t,ot+n) To prevent the difference between adjacent video frames from being too small to obtain ideal behavior characteristics from them, the present embodiment uses images spaced by n frames to construct a sample, where n has a value of 3; the input layer performs a differential operation on the input samples, i.e. ot+n-otThe method can eliminate the field characteristics such as the environment background and the like, and can also preliminarily extract the behavior characteristics related to the strategy; the feature extractor F is composed of 2 convolutional layers and 2 pooling layers and is used for processing the image and extracting sample features sigma; the encoder Enc encodes the extracted features to obtain codes z to Enc (z | sigma), wherein the encoder Enc is composed of 1 full connection layer; at the same time, the sample generated from the mechanical arm is given
Figure BDA0002367470040000074
The mutual information I (sigma, z) between the code z and the characteristic sigma of (d) imposes an upper bound IcTo limit the presence of samples generated from the robot in the network of discriminators
Figure BDA0002367470040000075
In the present embodiment, IcSetting to 0.1, solving the constraint optimization problem by using a Lagrange multiplier method, wherein β is a Lagrange multiplier, and the initial value is 1 multiplied by 10-3(ii) a Using the code z as input of the full link layer to obtain the output D of the discriminator networkω(z);
S4, updating the discriminator parameter omega by using a strategy gradient method according to the discrimination result, wherein the gradient is as follows:
Figure BDA0002367470040000076
s5, of mechanical armControl strategy piθOutput of the discriminator as an approximate reward, i.e. r (o)t,ot+1)=-log(1-Dω(z)) and updates the policy network parameter θ using the ACKTR method. And repeating the steps from S2 to S5 until the sample generated by the mechanical arm cannot be distinguished by the discriminator, wherein the mechanical arm can successfully simulate the sample to finish corresponding operations, and a mechanical arm action learning model is obtained.
S6, when the user starts using the present robotic arm motion learning model and system, the demonstration sample source may be manually selected: a demonstration video or a live demonstration is manually entered. If the user selects the site demonstration, the camera records the demonstration process of the user, and then extracts each frame of the video to form a demonstration sample tauE(ii) a If the user selects to manually input the demonstration video, the system will directly extract each frame from the video input by the user to form the demonstration sample tauEThe difference in the field characteristics of appearance, background and the like between the demonstrator and the mechanical arm in the video does not need to be considered.
S7, updating the mechanical arm control strategy aiming at maximizing the obtained accumulated reward, namely maximizing the probability that the discriminator discriminates the sample generated by the control strategy as the demonstration sample
Figure BDA0002367470040000081
So as to produce as much control operation as possible similar to the demonstration sample.
The following is a robot arm action learning system based on third person weighing imitation learning of this embodiment, including: the device comprises a sample acquisition module, a discriminator module, a control strategy module and a control module.
The sample acquisition module is used for acquiring a demonstration sample input by a user and a sample generated by the mechanical arm executing a control strategy, wherein the demonstration sample can be selected from manual input or field recording. And then preprocessing is carried out, each frame of picture in the sample is extracted to form an observation sequence, and the observation sequence is stored in a memory for the mechanical arm action learning model to take.
The discriminator module is used for training and discriminating the mechanical arm control strategyThe purpose of the machine is to distinguish as much as possible between the demonstration sample and the sample generated by the robotic arm. For each input sample, the discriminator outputs a probability that the sample is an exemplary sample. For the exemplary sample, the discriminator needs to output a probability value as high as possible, and for the sample generated by the robot arm, the discriminator needs to output a probability value as low as possible. If the difference of some domain characteristics (such as environment background, color appearance and the like) exists between the demonstration sample and the mechanical arm generation sample, the sample can be rapidly distinguished by the discriminator according to the domain characteristics, so that the discriminator is too strong, and the control strategy is difficult to learn according to the distinguishing result of the discriminator. The image difference operation in the input layer of the discriminator can remove most of the domain features related to the environment background and the appearance; added upper bound I in arbitercThe method can restrain the discriminator and relieve the problem of over-strong performance of the discriminator caused by the difference of the field characteristics. The discriminator and the control strategy are in a mutual game relationship, after continuous game, the discriminator and the control strategy finally reach Nash equilibrium, at the moment, the discrimination accuracy of the discriminator on the samples is 50%, and the two samples can not be distinguished. The framework of the specific embodiment is shown in fig. 3, and the discriminator is a two-class neural network, and is composed of 1 input layer, 1 feature extractor F, 1 encoder Enc, 2 full-link layers, and 1 output layer. Wherein the input to the discriminator is a sample from a demonstrator or robotic arm, the sample being in the form of (o)t,ot+n) (ii) a The input layer performs a differential operation on the input samples, i.e. ot+n-otEliminating the field characteristics such as the environment background and the like, and preliminarily extracting behavior characteristics related to the strategy; the feature extractor F is composed of 2 convolutional layers and 2 pooling layers and is used for processing the image and extracting sample features sigma; the encoder Enc encodes the extracted features to obtain codes z-Enc (z | sigma); at the same time, give from
Figure BDA0002367470040000091
The mutual information I (sigma, z) between the code z of the sample and the characteristic sigma imposes an upper bound IcTo limit the correlation in the discriminator networkSample(s)
Figure BDA0002367470040000092
The information stream of (2); using the code z as input of the full link layer to obtain the output D of the discriminator networkω(z); for each input sample, the discriminator outputs a probability that the sample is an exemplary sample.
The control strategy module is used for controlling the mechanical arm and consists of 1 input layer, 2 convolution layers, 2 pooling layers, 2 hidden layers and 1 output layer, and the network structure of the control strategy module is shown in figure 4; the input layer inputs an observation image of the current mechanical arm state, and the observation image is directly shot and generated by a camera; the output layer outputs control signals of the mechanical arm, namely action information; the action information needs to be transmitted into the control module to be converted into a pulse signal so as to realize the control of the mechanical arm.
Wherein the input to the discriminator module is a set of observation pairs in a sample generated by the control strategy
Figure BDA0002367470040000093
And a set of observation pairs in the demonstration sample
Figure BDA0002367470040000094
The size of a batch is 1024 pairs. Inputting the two samples into a discriminator together for discrimination; the discriminator uses the Wasserstein distance to measure the difference between two samples.
Therein, a demonstration sample tauESample generation with robotic arm
Figure BDA0002367470040000095
Are all viewed from a sequence of images o1,o2,o3,...,oTComposition, split it into observation pairs (o) before input to a discriminator1,o1+n),(o2,o2+n),…,(oi,oi+n),…,(oT-n,oT) And randomly sampling a batch of samples after the observation pairs are disturbed to be used as the input of the discriminator module. Where n denotes that the observation pair consists of two frames spaced by nFormed by observation images, n can be flexibly selected and has a value range of [3,5 ]]。
Discriminator bottleneck I in discriminator modulecThe method has the functions of limiting mutual information between data before and after the coding of the coder so as to achieve the purpose of limiting the information flow of the discriminator, thereby weakening the discriminator and regulating and controlling the discrimination accuracy of the discriminator, and therefore, the game process between the discriminator and the control strategy can be expressed as follows:
Figure BDA0002367470040000101
Figure BDA0002367470040000102
wherein the content of the first and second substances,
Figure BDA0002367470040000103
representing a robot control strategy, i.e. limiting only the information flow from the sample produced by the robot, z being the code, DωAs a network of discriminators, Dω(z) is the output of the arbiter network.

Claims (9)

1. A mechanical arm action learning method based on third person scale imitation learning is characterized by comprising the following steps:
s1, input demonstration sample tauEBy observing only the image sequence o1,o2,o3,...,oTI, T is the maximum time step, and o is an RGB image directly extracted from the video;
s2, executing the current control strategy pi by the mechanical armθRecording the observation image sequence in the strategy execution process to obtain a sample only consisting of the observation sequence
Figure FDA0002367470030000011
S3, demonstrating the sample tauESample generated with mechanical arm
Figure FDA0002367470030000012
Input to a discriminator DωPerforming the following steps;
s4, according to the discriminator DωThe discrimination result of (2) updates the discriminator parameter omega with the gradient g by using a strategy gradient methodω
Figure FDA0002367470030000013
And S5, repeating the steps until the sample generated by the mechanical arm cannot be distinguished by the discriminator, wherein the mechanical arm can successfully simulate the sample to finish corresponding operations, and a mechanical arm action learning model is obtained.
2. The method of claim 1, wherein when a user begins using the model and system for learning robot arm movements, a source of demonstration samples is selected: manually inputting a demonstration video or a live demonstration; if the user selects the on-site demonstration, the demonstration process of the user is recorded by the camera, and each frame of the video is extracted to form a demonstration sample tauE(ii) a If the user selects to manually input the demonstration video, each frame of picture is directly extracted from the video input by the user to form a demonstration sample tauE
3. The mechanical arm action learning method based on third person weighing imitation learning of claim 1, wherein the discriminator is a two-class neural network, and is composed of 1 input layer, 1 feature extractor F, 1 encoder Enc, 2 fully-connected layers, and 1 output layer; wherein the input to the discriminator is a sample from a demonstrator or robotic arm, the sample being in the form of (o)t,ot+n) (ii) a The input layer carries out differential operation on the input samples and preliminarily extracts behavior characteristics related to the strategy; the feature extractor F is composed of 2 convolutional layers and 2 pooling layers and is used for processing the image and extracting sample features sigma; the encoder Enc encodes the extracted features to obtain a codeCodes z to Enc (z | σ); at the same time, give from
Figure FDA0002367470030000014
The mutual information I (sigma, z) between the code z of the sample and the characteristic sigma imposes an upper bound IcTo limit the samples in the discriminator network
Figure FDA0002367470030000015
The information stream of (2); using the code z as input of the full link layer to obtain the output D of the discriminator networkω(z)。
4. The method of claim 1, wherein the control strategy of the robot is piθOutput of the discriminator as an approximate reward, i.e. r (o)t,ot+1)=-log(1-Dω(z)) and updating policy network parameters using an ACKTR method; the update of the control strategy aims at maximizing the cumulative reward obtained by the control strategy, i.e. maximizing the probability that the arbiter discriminates the samples generated by the control strategy as demonstration samples
Figure FDA0002367470030000021
5. The method as claimed in claim 1, wherein the method updates the parameter ω of the discriminator using a gradient g method according to the result of the discriminationω
Figure FDA0002367470030000022
Using the Wasserstein distance as a measure of the difference between a demonstration sample and a sample generated by a mechanical arm, and using the discriminator DωIs defined as a 1-Lipschitz function, and the updating mode is ω ═ ω + ψ · RMSProp (ω, g)ω) Where ψ is a learning rate.
6. As claimed in claim 1The mechanical arm action learning method based on third person scale simulation learning is characterized in that in the learning process, in order to prevent the problem of gradient explosion, the gradient is cut by the model so as not to exceed a fixed threshold value, namely
Figure FDA0002367470030000023
Theta is a threshold value; meanwhile, in order to prevent the difference between adjacent video frames from being too small to obtain ideal behavior characteristics therefrom, images of interval n frames are used to construct samples.
7. A mechanical arm action learning system based on third person scale imitation learning is characterized by comprising:
the sample acquisition module is used for acquiring a demonstration sample input by a user and a sample generated by the mechanical arm execution control strategy, then preprocessing the sample, extracting each frame of picture in the sample to form an observation sequence, and storing the observation sequence in the memory for the mechanical arm action learning model to take;
the discriminator module is used for training a mechanical arm control strategy, and for each input sample, the discriminator outputs the probability that the sample is taken as a demonstration sample; removing the domain features in the input layer of the discriminator through image difference operation; by added upper bound I in the arbitercConstraining the discriminator; the discriminator and the control strategy are in a mutual game relationship, after continuous game, the discriminator and the control strategy finally reach Nash equilibrium, at the moment, the discrimination accuracy of the discriminator on the samples is 50 percent, and the two samples can not be distinguished;
the control strategy module is used for controlling the mechanical arm and consists of 1 input layer, 2 convolution layers, 2 pooling layers, 2 hidden layers and 1 output layer; the input layer inputs an observation image of the current mechanical arm state; the output layer outputs control signals of the mechanical arm, namely action information; the action information needs to be transmitted into the control module to be converted into a pulse signal so as to realize the control of the mechanical arm.
8. As claimed inThe system for learning robot arm movements based on third person scale imitation learning of claim 7, wherein the input of the discriminator module is a set of observation pairs in a sample generated by a control strategy
Figure FDA0002367470030000031
And a set of observation pairs in the demonstration sample
Figure FDA0002367470030000032
Inputting the two samples into a discriminator together for discrimination; the discriminator uses the Wasserstein distance to measure the difference between two samples.
9. The system of claim 7, wherein the discriminator is a two-class neural network comprising 1 input layer, 1 feature extractor F, 1 encoder Enc, 2 fully-connected layers, and 1 output layer; wherein the input to the discriminator is a sample from a demonstrator or robotic arm, the sample being in the form of (o)t,ot+n) (ii) a The input layer carries out differential operation on the input samples and preliminarily extracts behavior characteristics related to the strategy; the feature extractor F is composed of 2 convolutional layers and 2 pooling layers and is used for processing the image and extracting sample features sigma; the encoder Enc encodes the extracted features to obtain codes z-Enc (z | sigma); at the same time, give from
Figure FDA0002367470030000033
The mutual information I (sigma, z) between the code z of the sample and the characteristic sigma imposes an upper bound IcTo limit the samples in the discriminator network
Figure FDA0002367470030000034
The information stream of (2); using the code z as input of the full link layer to obtain the output D of the discriminator networkω(z)。
CN202010040178.4A 2020-01-15 2020-01-15 Mechanical arm action learning method and system based on third person scale imitation learning Active CN111136659B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010040178.4A CN111136659B (en) 2020-01-15 2020-01-15 Mechanical arm action learning method and system based on third person scale imitation learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010040178.4A CN111136659B (en) 2020-01-15 2020-01-15 Mechanical arm action learning method and system based on third person scale imitation learning

Publications (2)

Publication Number Publication Date
CN111136659A true CN111136659A (en) 2020-05-12
CN111136659B CN111136659B (en) 2022-06-21

Family

ID=70525016

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010040178.4A Active CN111136659B (en) 2020-01-15 2020-01-15 Mechanical arm action learning method and system based on third person scale imitation learning

Country Status (1)

Country Link
CN (1) CN111136659B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112162564A (en) * 2020-09-25 2021-01-01 南京大学 Unmanned aerial vehicle flight control method based on simulation learning and reinforcement learning algorithm
CN112529160A (en) * 2020-12-09 2021-03-19 南京大学 High-dimensional simulation learning method for video image data recorded by camera equipment
CN112809689A (en) * 2021-02-26 2021-05-18 同济大学 Language-guidance-based mechanical arm action element simulation learning method and storage medium
CN112975968A (en) * 2021-02-26 2021-06-18 同济大学 Mechanical arm simulation learning method based on third visual angle variable main body demonstration video
CN113552871A (en) * 2021-01-08 2021-10-26 腾讯科技(深圳)有限公司 Robot control method and device based on artificial intelligence and electronic equipment
CN114660947A (en) * 2022-05-19 2022-06-24 季华实验室 Robot gait autonomous learning method and device, electronic equipment and storage medium
CN114779661A (en) * 2022-04-22 2022-07-22 北京科技大学 Chemical synthesis robot system based on multi-classification generation confrontation imitation learning algorithm
CN117464683A (en) * 2023-11-23 2024-01-30 中机生产力促进中心有限公司 Method for controlling mechanical arm to simulate video motion

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109015661A (en) * 2018-09-29 2018-12-18 重庆固高科技长江研究院有限公司 The method of industrial robot iterative learning amendment trajectory error
CN109702744A (en) * 2019-01-15 2019-05-03 北京工业大学 A method of the robot learning by imitation based on dynamic system model
CN109794937A (en) * 2019-01-29 2019-05-24 南京邮电大学 A kind of Soccer robot collaboration method based on intensified learning
CN109948117A (en) * 2019-03-13 2019-06-28 南京航空航天大学 A kind of satellite method for detecting abnormality fighting network self-encoding encoder
CN109948781A (en) * 2019-03-21 2019-06-28 中国人民解放军国防科技大学 Continuous action online learning control method and system for automatic driving vehicle
CN110121749A (en) * 2016-11-23 2019-08-13 通用电气公司 Deep learning medical system and method for Image Acquisition

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110121749A (en) * 2016-11-23 2019-08-13 通用电气公司 Deep learning medical system and method for Image Acquisition
CN109015661A (en) * 2018-09-29 2018-12-18 重庆固高科技长江研究院有限公司 The method of industrial robot iterative learning amendment trajectory error
CN109702744A (en) * 2019-01-15 2019-05-03 北京工业大学 A method of the robot learning by imitation based on dynamic system model
CN109794937A (en) * 2019-01-29 2019-05-24 南京邮电大学 A kind of Soccer robot collaboration method based on intensified learning
CN109948117A (en) * 2019-03-13 2019-06-28 南京航空航天大学 A kind of satellite method for detecting abnormality fighting network self-encoding encoder
CN109948781A (en) * 2019-03-21 2019-06-28 中国人民解放军国防科技大学 Continuous action online learning control method and system for automatic driving vehicle

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112162564A (en) * 2020-09-25 2021-01-01 南京大学 Unmanned aerial vehicle flight control method based on simulation learning and reinforcement learning algorithm
CN112162564B (en) * 2020-09-25 2021-09-28 南京大学 Unmanned aerial vehicle flight control method based on simulation learning and reinforcement learning algorithm
CN112529160A (en) * 2020-12-09 2021-03-19 南京大学 High-dimensional simulation learning method for video image data recorded by camera equipment
CN113552871A (en) * 2021-01-08 2021-10-26 腾讯科技(深圳)有限公司 Robot control method and device based on artificial intelligence and electronic equipment
CN113552871B (en) * 2021-01-08 2022-11-29 腾讯科技(深圳)有限公司 Robot control method and device based on artificial intelligence and electronic equipment
CN112809689B (en) * 2021-02-26 2022-06-14 同济大学 Language-guidance-based mechanical arm action element simulation learning method and storage medium
CN112975968A (en) * 2021-02-26 2021-06-18 同济大学 Mechanical arm simulation learning method based on third visual angle variable main body demonstration video
CN112975968B (en) * 2021-02-26 2022-06-28 同济大学 Mechanical arm imitation learning method based on third visual angle variable main body demonstration video
CN112809689A (en) * 2021-02-26 2021-05-18 同济大学 Language-guidance-based mechanical arm action element simulation learning method and storage medium
CN114779661A (en) * 2022-04-22 2022-07-22 北京科技大学 Chemical synthesis robot system based on multi-classification generation confrontation imitation learning algorithm
CN114660947A (en) * 2022-05-19 2022-06-24 季华实验室 Robot gait autonomous learning method and device, electronic equipment and storage medium
CN114660947B (en) * 2022-05-19 2022-07-29 季华实验室 Robot gait autonomous learning method and device, electronic equipment and storage medium
CN117464683A (en) * 2023-11-23 2024-01-30 中机生产力促进中心有限公司 Method for controlling mechanical arm to simulate video motion
CN117464683B (en) * 2023-11-23 2024-05-14 中机生产力促进中心有限公司 Method for controlling mechanical arm to simulate video motion

Also Published As

Publication number Publication date
CN111136659B (en) 2022-06-21

Similar Documents

Publication Publication Date Title
CN111136659B (en) Mechanical arm action learning method and system based on third person scale imitation learning
CN110364049B (en) Professional skill training auxiliary teaching system with automatic deviation degree feedback data closed-loop deviation rectification control and auxiliary teaching method
CN108319932A (en) A kind of method and device for the more image faces alignment fighting network based on production
CN110503053A (en) Human motion recognition method based on cyclic convolution neural network
CN107240049B (en) Automatic evaluation method and system for remote action teaching quality in immersive environment
CN108831238A (en) A kind of educational system control method based on virtual experimental
CN110580470A (en) Monitoring method and device based on face recognition, storage medium and computer equipment
WO2019053052A1 (en) A method for (re-)training a machine learning component
CN113076615B (en) High-robustness mechanical arm operation method and system based on antagonistic deep reinforcement learning
CN108920805B (en) Driver behavior modeling system with state feature extraction function
CN107492377A (en) Method and apparatus for controlling self-timer aircraft
CN113031437A (en) Water pouring service robot control method based on dynamic model reinforcement learning
CN117218498B (en) Multi-modal large language model training method and system based on multi-modal encoder
CN111966217A (en) Unmanned aerial vehicle control method and system based on gestures and eye movements
CN111282272A (en) Information processing method, computer readable medium and electronic device
CN114967937B (en) Virtual human motion generation method and system
CN116844084A (en) Sports motion analysis and correction method and system integrating blockchain
CN113633983A (en) Method, device, electronic equipment and medium for controlling expression of virtual character
CN116957866A (en) Individualized teaching device of digital man teacher
CN115909839A (en) Medical education training assessment system and method based on VR technology
CN110766216A (en) End-to-end mobile robot path navigation simulation method and system
CN113408443B (en) Gesture posture prediction method and system based on multi-view images
CN112446253A (en) Skeleton behavior identification method and device
CN114078347B (en) Teenager STEAM education system and method
CN115565146A (en) Perception model training method and system for acquiring aerial view characteristics based on self-encoder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant