CN108255059B

CN108255059B - Robot control method based on simulator training

Info

Publication number: CN108255059B
Application number: CN201810054083.0A
Authority: CN
Inventors: 俞扬; 张超; 周志华
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2018-01-19
Filing date: 2018-01-19
Publication date: 2021-03-19
Anticipated expiration: 2038-01-19
Also published as: CN108255059A

Abstract

The invention discloses a robot control method based on simulator training, which is used for simulating and building a task environment to be executed by a robotA model, establishing a simulator; in a simulator, randomly generating T robots with different performance parameters, respectively training strategies for each robot, and finally obtaining a base strategy set formed by each strategy; in the simulator, M robots with different performance parameters are generated randomly, the optimal combination weight of a base strategy set used by each robot in task execution is obtained through optimization in the M robots, and each robot executes a random action sequence to obtain a characteristic F_i(A) And optimal combining weights

Respectively serving as the input and the label of the regression model, and optimizing to obtain an optimal regression model theta; in the simulator, N robots with different performance parameters are randomly generated, and the optimal action is optimized on the N robots; in the same task, enabling the robot with unknown different performance parameters to execute optimal action A^*And obtaining the optimal action strategy of the robot.

Description

Robot control method based on simulator training

Technical Field

The invention relates to a robot control method based on simulator training, which can be used for controlling equipment such as robots, mechanical arms, moving devices and the like and belongs to the technical field of robots.

Background

At present, more and more entering people's life of robot for reduce the human labor in daily life, supplementary people accomplish the task, like navigation, tracking, object are snatched, part assembly and high-risk article transport etc.. However, the existing control method for the robot is often a fixed strategy, that is, actions are executed strictly according to a fixed programming flow after repeated attempts are made by professionals for a specific task, so that no small manpower is required to be introduced in task execution. In addition, since robots in life are various in types and performance parameters of individuals are different, such as differentiated sensor parameters, appearance parameters, and movable range parameters, even when the same task is performed, a uniform fixed program flow preset by a professional cannot be well used due to individual differences, and independent debugging needs to be performed for each individual. Although a feasible action strategy can be updated and solved in real time when the robot executes a task through numerical calculation in the automatic control field, a large amount of distribution assumptions are introduced, the relevant performance parameters of the robot need to be input in advance, although the whole process reduces certain manpower introduction, the manpower participation is still needed, and the action strategy of the robot obtained by the method is very sensitive to the input performance parameters, so that the relevant performance parameters of the robot with high precision need to be input when the task is executed. The process of the method is that the robot continuously interacts with the environment in a simulator to try and error, and optimizes the action strategy of the robot to finally obtain the action strategy meeting the task requirements, but the action strategy finally learned by the reinforcement learning also has high correlation with the performance parameters of the robot, so that the effective action strategy of the unknown robots with different performances in the same task still cannot be obtained.

Therefore, a new technical solution is needed to solve the problem when the task of the robot is executed, especially when each robot in the same task has unknown differential performance parameters.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects of the prior art, the invention provides a robot control method based on simulator training, which can obtain unknown action strategies of robots with different performance parameters in the same task, and the robots under the action strategies can effectively meet the task requirements.

The technical scheme is as follows: a robot control method based on simulator training comprises the following steps:

step 1: carrying out simulation modeling on a task environment to be executed, establishing a simulator which is the same as or similar to the task, and establishing four factors of reinforcement learning aiming at the task design: state s, action a, reward function R (s, a), state transition probability P (s' | s, a);

step 2: in a simulator, T robots with different performance parameters are randomly generated, a reinforcement learning algorithm is used for training each robot respectively to obtain respective action strategies pi as a base strategy, and a base strategy set is finally obtained

And combining strategies

Wherein w is a weight coefficient;

and step 3: in the simulator, M robots with different performance parameters are generated randomly, and the optimal combination weight of a base strategy set used by each robot in task execution is obtained through optimization in the M robots

Wherein τ is a plurality of state-action pairs(s) of the robot when performing a task₀,a₀,s₁,a₁,...,s_t,a_t) The formed track is formed by the following steps,

performing a combined strategy for a robot_wThen generating probability of track tau, R (tau) being total reward obtained on track tau, then making said M robots each execute a given string of initial random actions A, and using output state of every robot after executing action A as characteristic F_i(A) Each robot is characterized by F_i(A) And optimal combining weights

Respectively used as the input and label of the regression model, and optimized to obtain the optimal regression model theta, namely

And 4, step 4: in the simulator, N robots with different performance parameters are generated randomly, and the optimal action is optimized on the N robots

And 5: in the same task, enabling the robot with unknown different performance parameters to execute optimal action A^*Obtaining the optimal action strategy of the robot

The reinforcement learning algorithm used in the step 2 adopts a trust domain strategy optimization algorithm (TRPO), and the value range of the weight coefficient w is 0-1.

The optimal combination weight optimization algorithm of the base strategy set used in the step 3 adopts a serialized random axis shrinkage algorithm (SRACOS), the regression model optimization algorithm used adopts a gradient descent algorithm, and a given string of initial random actions A comprises 5 actions.

The optimal action optimization algorithm used in the step 4 adopts a serialized random axis shrinkage algorithm (SRACOS), and the optimal action A^*Comprising 5 actions.

Has the advantages that: the action strategy of some robots when executing tasks is a fixed program flow written by professionals after repeated trials in advance, a large amount of manpower is required to be introduced, although the action strategy of the robots can be automatically solved through numerical calculation, the task requirements can be well completed only by manually inputting high-precision performance parameters of the robots, and the traditional reinforcement learning introduced in the later stage greatly reduces manual participation, but because the action strategy learned by the reinforcement learning is highly correlated with the performance parameters of the robots, the robots with different unknown performances under the same task still do not have generalization and can not directly obtain corresponding effective action strategies.

Compared with the prior art, the robot control method based on simulator training provided by the invention has the advantages that a group of optimal actions are optimized in the simulator, and when the robot with different unknown performance parameters performs the actions, a group of output states can be obtained, so that the states of the robot can be indirectly recognized, the action strategies of the similar robots in the past are combined, the action strategies of the robot under the task are finally and directly obtained, and the task requirements can be effectively completed.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

As shown in fig. 1, the robot control method based on simulator training includes the following steps:

And combining strategies

Wherein w is a weight coefficient, pi_t(a | s) denotes the policy model π_tTaking the state s as an input and outputting the action a;

performing a combined strategy for a robot_wProbability of post-production of trace τR (tau) is the total reward obtained on the track tau, then the M robots are all caused to execute a given string of initial random actions A, and the output state of each robot after executing the actions A is taken as a characteristic F_i(A) Each robot is characterized by F_i(A) And optimal combining weights

The reinforcement learning algorithm used in the method adopts a trust domain strategy optimization algorithm (TRPO), and the value range of the weight coefficient w is 0-1.

The optimal combined weight optimization algorithm of the base strategy set used in the method adopts a serialized random axis shrinkage algorithm (SRACOS), the regression model optimization algorithm used adopts a gradient descent algorithm, and a given string of initial random actions A comprises 5 actions.

The optimal action optimization algorithm used in the method adopts a serialized random axis shrinkage algorithm (SRACOS) optimal action A^*Comprising 5 actions.

Claims

1. A robot control method based on simulator training is characterized by comprising the following steps:

step 1: carrying out simulation modeling on a task environment to be executed, establishing a simulator, and constructing four factors of reinforcement learning aiming at task design: state s, action a, reward function R (s, a), state transition probability P (s' | s, a);

And combining strategies

Wherein w is a weight coefficient;

Then, the M robots all execute a given string of initial random actions A, and the output state of each robot after executing the actions A is taken as a characteristic F_i(A) Each robot is characterized by F_i(A) And optimal combining weights

2. The robot control method based on simulator training as claimed in claim 1, wherein the reinforcement learning algorithm used in step 2 is a trust domain strategy optimization algorithm, and the value range of the weight coefficient w is 0-1.

3. The simulator training based robot control method of claim 1, wherein the optimal combining weight optimization algorithm of the base strategy set used in step 3 is a serialized random axis shrinkage algorithm, the regression model optimization algorithm used is a gradient descent algorithm, and a given string of initial random actions a comprises 5 actions.

4. The robot control method based on simulator training as claimed in claim 1, wherein the optimal motion optimization algorithm used in step 4 is a serialized random axis shrinkage algorithm, optimal motion a^*Comprising 5 actions.

5. The method of claim 1, wherein the optimal combination weights of the set of base strategies used by each robot in performing the task in step 3 are determined based on the optimal combination weights of the set of base strategies used by the robot in performing the task

is made into a machineRobot execution combination strategy pi_wThe probability of the trace τ being generated later, R (τ) being the total reward earned on the trace τ.