CN116151385A

CN116151385A - Robot autonomous learning method based on generation of countermeasure network

Info

Publication number: CN116151385A
Application number: CN202111344484.8A
Authority: CN
Inventors: 库涛; 俞宁; 林乐新; 刘金鑫; 李进
Original assignee: Shenyang Institute of Automation of CAS
Current assignee: Shenyang Institute of Automation of CAS
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2023-05-23

Abstract

The invention constructs a robot autonomous learning method based on generating an countermeasure network, and is applied to robot autonomous learning with few samples or zero samples in an industrial scene. The method comprises the following steps: 1) Establishing a chain model for the behavior of the robot based on the Markov chain; 2) Acquiring more samples by utilizing the generated countermeasure network according to the existing samples or expert data; 3) Obtaining a reward function and training an optimal decision through inverse reinforcement learning; 4) Acquiring an optimal value function and an optimal strategy function according to the reward function and the optimal strategy; 5) And (5) completing the establishment of an autonomous learning model of the robot. The robot autonomous learning method based on the generated countermeasure network is mainly oriented to the situation that an experience sample is absent in an industrial scene, and the goal of robot autonomous learning is achieved through the combination of the generated countermeasure network and the inverse reinforcement learning, so that the automation and intelligence level of the robot is improved.

Description

Robot autonomous learning method based on generation of countermeasure network

Technical Field

The invention belongs to the field of intelligent control of robots and autonomous learning of robots, and particularly relates to an autonomous learning method of a robot based on a generated countermeasure network.

Background

The robot autonomous learning method mainly refers to a machine learning method which enables a robot to accumulate experience data through self interaction with the environment so as to autonomously perform action decision. Robot autonomous learning belongs to one of important means of robot control, and often plays an important role in functions of robot environment perception, behavior control, dynamic decision, automatic execution and the like in an intelligent integrated control system. The method not only needs a decision-making method learned by a robot to have high optimization degree, but also has extremely high requirements on indexes such as learning speed, reaction speed and the like. Therefore, the continuous improvement of the autonomous learning method of the robot is an important subject of the current robot research.

Typically, such learning methods require extensive sample training and key parameter settings by human beings to ensure learning efficiency and accuracy. This makes the learning result of the robot often limited by data and size and human parameter settings. Meanwhile, if the data set has contaminated data, the final optimization degree is likely to be greatly reduced, and even the actual requirements cannot be met. In addition, the method requires a designer to have a higher experience base on an actual scene, so that parameters can be accurately set, if the designer cannot accurately judge the actual requirement, deviation of the learning direction is likely to occur, and finally the expected decision capability cannot be achieved. The problems are the problems which need to be solved in the autonomous learning of the robot at present.

Disclosure of Invention

The invention combines the generation of the countermeasure network technology and the inverse reinforcement learning method, combines the two technologies into a whole, and provides the robot autonomous learning method, which aims to reduce the dependence of the robot autonomous learning on expert samples, improve the learning efficiency of the robot and increase the optimization degree of the robot autonomous decision.

The technical scheme adopted by the invention for achieving the purpose is as follows:

a robot autonomous learning method based on generating a countermeasure network, comprising the steps of:

constructing a Markov chain model, acquiring a complete action track and decision steps of the robot, sampling the action track and decision steps to generate a real sample set representing the action, and storing the real sample set into a real sample pool;

randomly generating signals and transmitting the signals into a generator, generating samples by the generator, and storing the generated samples into a virtual sample pool;

the generated sample is transmitted into a discriminator, the discriminator compares the generated sample with the real sample, the generated sample is dynamically adjusted according to the comparison result, and the virtual sample pool is updated;

mixing the updated virtual sample pool with the real sample pool to form a mixed sample pool, and randomly extracting data in the mixed sample pool;

randomly generating a strategy and executing the strategy;

sampling the executed strategy, and comparing the sampling result with the data extracted from the mixed sample pool to obtain a reward function and an optimal strategy;

training a Markov chain model according to the reward function, taking the state of the robot as the input of the model, obtaining the corresponding action, and completing the autonomous learning of the robot.

The construction of the Markov chain model is specifically as follows: and establishing a five-tuple (S, A, P, R, gamma) according to the Markov chain model, wherein the set S represents a current state set, the set A represents a next moment action set, P is the probability of various actions in A, R is a reward function, and gamma epsilon (0, 1) is a discount coefficient.

The discriminator compares the generated sample with the real sample, specifically: the generated sample and the real sample are mixed to form a training sample, the training sample is sent into a discriminator for discrimination, and the probability D (x) of the training sample from the generated sample is output.

And dynamically adjusting and generating samples according to the comparison result, namely respectively calculating the loss function of the discriminator and the loss function of the generator according to the probability D (x), and stopping adjusting and generating the samples when the loss function of the discriminator and the loss function of the generator reach Nash equilibrium.

Loss function L of said arbiter _{discriminator} (D) The method comprises the following steps:

L _{discriminator} (D)＝E _x～P [-log D(x)]+E _x～G [-log(1-D(x))]

wherein ,E_x～P [-log D(x)]Representing loss of separation of real samples into generated samples, E _x～G [-log(1-D(x))]Representing the loss of splitting the generated samples into true samples.

The loss function L of the generator _generator (G) The method comprises the following steps:

L _generator (G)＝E _x～G [-log D(x)]+E _x～G [log(1-D(x))]

wherein ,E_x～G [-log D(x)]Representing loss of the discriminator classifying the generated samples into generated samples, E _x～G [log(1-D(x))]Representing the loss of the discriminator classifying the generated sample into a true sample.

Evaluating a policy using a value function comprising a representation of a state value function V _π (s) and a function Q representing an action value _π (s, a) wherein:

wherein pi (s, a) is the policy of the (s, a) state, R is the reward function, P (s, a, s ') is the probability of the state s→s', and action a 'is the action of the next state s'.

The executed strategy is sampled, and the sampling result is compared with the data extracted from the mixed sample pool to obtain a reward function and an optimal strategy, specifically:

optimum value function V ^*(s) and Q^* (s, a) is obtained by the following formula,

further obtain action a ₁ Is the optimal strategy pi ^* The conditions for(s) are:

wherein ,

action a is performed when the state is s ₁ Probability of state occurrence s→s->

An optimal state value function representing a state s', P _sa (s ') represents the probability that any action a epsilon A is performed when the state is s, the state occurs s→s',

bond V ^* (s) obtaining

Finally, the rewarding function and the optimal strategy are obtained.

Training a Markov chain model according to the reward function, specifically, obtaining parameters in the reward function by calculating the maximum entropy, and further determining the Markov chain model.

The calculated maximum entropy is specifically as follows:

max-p log p

where p is the probability, l _i Representing the ith in the probabilistic modelTrace, f represents feature expectations, f _E Representing expert characteristics expectations τ _i An ith element in the expert sample set;

wherein ,λ_i (i=0 to n) is the i-th parameter in the bonus function R.

The invention has the following beneficial effects and advantages:

1. the invention provides a robot autonomous learning method based on a generated countermeasure network, which is mainly oriented to the robot learning problem in the application scene of the robot, and realizes the robot learning under the condition of few samples by combining the generated countermeasure network model and the inverse reinforcement learning method, thereby reducing the dependence on the size of a sample data set and effectively improving the learning efficiency.

2. The robot learning method is autonomously learned by the robot, basically does not need human intervention, reduces the interference of human factors, and improves the optimization degree of robot decision.

3. The robot learning method adopts an inverse reinforcement learning method, can obtain a proper reward function according to the environment, and finally trains out an optimal strategy function. The generalization performance of the robot is greatly improved.

4. The robot learning method adopts the generation of the countermeasure network model, and can generate a large number of approximately real samples, so that a large amount of data can be learned when the real samples are fewer, more samples with higher optimization degree can be obtained, the final performance is not influenced by the optimization degree of the samples, and the intelligent performance of the robot is effectively improved.

Drawings

FIG. 1 is a flow chart of a robot autonomous learning process;

fig. 2 is a diagram of the relationship between the parts.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

As shown in fig. 1 and 2, step one: a markov chain model is built. A five-tuple (S, A, P, R, gamma) can be established according to the Markov chain model, wherein the set S represents the current state set, the set A represents the next moment action set, P is the probability of various actions in A, R is the reward function, gamma E (0, 1) is the discount coefficient, and the discount coefficient is used for calculating the accumulated reward value.

Step two: a small number of expert samples or instances are provided and placed into a real sample pool. Expert samples or examples generally refer to complete motion trajectories or decision steps, and after systematic sampling, a series of motion sets d= { τ are generated ₁ ,τ ₂ ,…,τ _n }. Storing the sampled action sets in a real sample pool D ₁ Waiting for subsequent comparisons.

And thirdly, randomly generating signals, and transmitting the signals into a generator to generate corresponding data. The random signal is typically noise and is used to characterize a certain random environmental element. The generator G generates corresponding samples x from the signal, and the samples x to G are expressed as x to G.

And step four, the generator transmits the data to a discriminator, and the discriminator compares the data and feeds back the result to the generator. The task of the arbiter D is to classify the sample input as the output of the generator, or as an actual sample from the underlying data distribution p (x). And searching similar samples from the real sample pool, mixing the samples with the samples x generated in the generator, sending the samples into a discriminator for discrimination, and outputting the probability D (x) of training samples from the generated samples. Further, the arbiter loss can be calculated. The loss of the arbiter is the average logarithmic probability that it assigns to the correct class, evaluated on a mixed set of actual samples and output from the generator,

L _{discriminator} (D)＝E _x～P [-log D(x)]+E _x～G [-log(1-D(x))]

And fifthly, the generator adjusts the sample according to the feedback result of the discriminator. The task of the generator is to produce an output that is classified by the arbiter as coming from the underlying data distribution. If the loss of the arbiter is large, it indicates that the quality of the generator that generates the set of samples is high, otherwise, the arbiter quality is high. The loss of generator generates the sum of the average log probability that the sample was classified as correct and the average log probability that the sample was classified as incorrect,

L _generator (G)＝E _x～G [-log D(x)]+E _x～G [log(1-D(x))]

wherein ,E_x～G [-log D(x)]Representing the loss of dividing the generated samples into generated samples, E _x～G [log(1- D(x))]Representing the loss of splitting the generated samples into true samples.

And step six, mixing the real sample pool and the virtual sample pool. Through solving two loss functions, the generator model G and the discriminant model D finally achieve a Nash balance, and the generated samples have higher similarity with the real data and are stored in the sample pool D ₂ Referred to as a virtual sample pool. At this time, the real sample pool and the virtual sample pool are fully mixed to form a mixed sample pool D _d So that samples can be randomly drawn to real samples or generated samples with a certain probability when the samples are drawn.

Step seven, randomly extracting data D 'from the mixed sample pool' _d . At this time, it is possible to draw either the generated sample or the real sample, but through the generation of continuous updates against the network, the generated sample also has a sample quality similar to that of the real sample.

And step eight, randomly generating a strategy and executing the strategy. A policy q can be randomly generated according to the environment _k The policy is enforced and sampled. The concept of a value function is typically introduced when evaluating policies. In general, V _π (s) represents a state value function and Q _π (s, a) represents an action value function. The calculation formula is that,

wherein pi (s, a) is the strategy of the (s, a) state, R is the reward function, and P (s, a, s ') is the probability of the state s- & gts'.

And step nine, comparing the executed strategy with the data in the sample pool, and updating the rewarding value. Sampling the executed strategy and comparing it with the high quality sample in the mixed sample cell, using the current strategy sample D' _s With high quality sample D' _d Finding the optimal reward function under the current condition. Optimum value function V ^*(s) and Q^* (s, a) can be obtained by the following formula,

further, the action a is known ₁ Is the optimal strategy pi ^* The filling conditions of(s) are that

And can be written as

Bond V ^* (s) it can be seen that

Eventually, a bonus function and an optimal strategy can be obtained.

And step ten, optimizing the strategy function. After the optimal rewarding function is found, the current rewarding value can be determined, and the strategy function is optimized according to the rewarding value and the high-quality sample, so that the performance of the strategy function is improved.

And step eleven, obtaining a reward function and an optimal decision. Through continuous optimization, the strategy corresponding to the optimal rewarding function under any condition can be finally obtained.

And step twelve, training a model according to the reward function, obtaining an optimal value function and a strategy function, and completing model construction. Through the reward function, reinforcement learning training can be performed, and finally the optimal value function strategy function is obtained. The reward function typically obtained by inverse reinforcement learning has some ambiguity and it is often also necessary to find the maximum entropy in order to avoid ambiguity. Namely, the following problems are obtained,

max-p log p

where p is the probability, l _i Representing trajectories in a probabilistic model, f representing feature expectations, f _E Representing expert feature expectations;

wherein ,λ_i (i=0 to n) is a parameter in the bonus function.

As described above, while the present invention has been particularly shown and described, it is not to be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The robot autonomous learning method based on the generation of the countermeasure network is characterized by comprising the following steps:

randomly generating a strategy and executing the strategy;

2. The autonomous learning method of a robot based on generating an countermeasure network according to claim 1, wherein the constructing a markov chain model specifically includes: and establishing a five-tuple (S, A, P, R, gamma) according to the Markov chain model, wherein the set S represents a current state set, the set A represents a next moment action set, P is the probability of various actions in A, R is a reward function, and gamma epsilon (0, 1) is a discount coefficient.

3. The autonomous learning method of a robot based on generating an countermeasure network according to claim 1, wherein the discriminator compares the generated sample with the true sample, specifically: the generated sample and the real sample are mixed to form a training sample, the training sample is sent into a discriminator for discrimination, and the probability D (x) of the training sample from the generated sample is output.

4. A robot autonomous learning method based on generating an countermeasure network according to claim 1 or 3, wherein the generating samples are dynamically adjusted according to the comparison result, specifically, the loss function of the discriminator and the loss function of the generator are calculated according to the probability D (x), respectively, and when the loss function of the discriminator and the loss function of the generator reach nash equilibrium, the adjusting of the generating samples is stopped.

5. The autonomous learning method of a robot based on generating an countermeasure network according to claim 4, wherein the loss function L of the arbiter _{discriminator} (D) The method comprises the following steps:

L _{discriminator} (D)＝E _x～P [-logD(x)]+E _x～G [-log(1-D(x))]

wherein ,E_x～P [-logD(x)]Representing loss of separation of real samples into generated samples, E _x～G [-log(1-D(x))]Representing the loss of splitting the generated samples into true samples.

6. The autonomous learning method of a robot based on generating an countermeasure network according to claim 4, wherein the loss function L of the generator _generator (G) The method comprises the following steps:

L _generator (G)＝E _x～G [-logD(x)]+E _x～G [log(1-D(x))]

wherein ,E_x～G [-logD(x)]Representing loss of the discriminator classifying the generated samples into generated samples, E _x～G [log(1-D(x))]Representing the loss of the discriminator classifying the generated sample into a true sample.

7. A method of autonomous learning by a robot based on generating an countermeasure network according to claim 1, characterized by using a value function evaluation strategy, the value function including a state-representative value function V _π (s) and a function Q representing an action value _π (s, a) wherein:

8. The autonomous learning method of a robot based on generating an countermeasure network according to claim 1, wherein the steps of sampling the executed policy and comparing the sampling result with the data extracted from the mixed sample pool to obtain a reward function and an optimal policy are as follows:

wherein ,

An optimal state value function representing a state s', P _sa (s') means that arbitrary motion is performed when the state is sAs a e a, the probability of the state occurrence s→s',

bond V ^* (s) obtaining

Finally, the rewarding function and the optimal strategy are obtained.

9. The autonomous learning method of a robot based on generating a countermeasure network according to claim 1, wherein the training of the markov chain model according to the reward function is specifically that parameters in the reward function are obtained by calculating a maximum entropy, and the markov chain model is further determined.

10. The autonomous learning method of a robot based on generating an countermeasure network according to claim 9, wherein the calculating the maximum entropy is specifically:

max-plogp

where p is the probability, l _i Represents the ith trace in the probabilistic model, f represents the feature expectations, f _E Representing expert characteristics expectations τ _i An ith element in the expert sample set;

wherein ,λ_i (i=0 to n) is the i-th parameter in the bonus function R.