Background
At present, more and more entering people's life of robot for reduce the human labor in daily life, supplementary people accomplish the task, like navigation, tracking, object are snatched, part assembly and high-risk article transport etc.. However, the existing control method for the robot is often a fixed strategy, that is, actions are executed strictly according to a fixed programming flow after repeated attempts are made by professionals for a specific task, so that no small manpower is required to be introduced in task execution. In addition, since robots in life are various in types and performance parameters of individuals are different, such as differentiated sensor parameters, appearance parameters, and movable range parameters, even when the same task is performed, a uniform fixed program flow preset by a professional cannot be well used due to individual differences, and independent debugging needs to be performed for each individual. Although a feasible action strategy can be updated and solved in real time when the robot executes a task through numerical calculation in the automatic control field, a large amount of distribution assumptions are introduced, the relevant performance parameters of the robot need to be input in advance, although the whole process reduces certain manpower introduction, the manpower participation is still needed, and the action strategy of the robot obtained by the method is very sensitive to the input performance parameters, so that the relevant performance parameters of the robot with high precision need to be input when the task is executed. The process of the method is that the robot continuously interacts with the environment in a simulator to try and error, and optimizes the action strategy of the robot to finally obtain the action strategy meeting the task requirements, but the action strategy finally learned by the reinforcement learning also has high correlation with the performance parameters of the robot, so that the effective action strategy of the unknown robots with different performances in the same task still cannot be obtained.
Therefore, a new technical solution is needed to solve the problem when the task of the robot is executed, especially when each robot in the same task has unknown differential performance parameters.
Disclosure of Invention
The purpose of the invention is as follows: in order to overcome the defects of the prior art, the invention provides a robot control method based on simulator training, which can obtain unknown action strategies of robots with different performance parameters in the same task, and the robots under the action strategies can effectively meet the task requirements.
The technical scheme is as follows: a robot control method based on simulator training comprises the following steps:
step 1: carrying out simulation modeling on a task environment to be executed, establishing a simulator which is the same as or similar to the task, and establishing four factors of reinforcement learning aiming at the task design: state s, action a, reward function R (s, a), state transition probability P (s' | s, a);
step 2: in a simulator, T robots with different performance parameters are randomly generated, a reinforcement learning algorithm is used for training each robot respectively to obtain respective action strategies pi as a base strategy, and a base strategy set is finally obtained
And combining strategies
Wherein w is a weight coefficient;
and step 3: in the simulator, M robots with different performance parameters are generated randomly, and the optimal combination weight of a base strategy set used by each robot in task execution is obtained through optimization in the M robots
Wherein τ is a plurality of state-action pairs(s) of the robot when performing a task
0,a
0,s
1,a
1,...,s
t,a
t) The formed track is formed by the following steps,
performing a combined strategy for a robot
wThen generating probability of track tau, R (tau) being total reward obtained on track tau, then making said M robots each execute a given string of initial random actions A, and using output state of every robot after executing action A as characteristic F
i(A) Each robot is characterized by F
i(A) And optimal combining weights
Respectively used as the input and label of the regression model, and optimized to obtain the optimal regression model theta, namely
And 4, step 4: in the simulator, N robots with different performance parameters are generated randomly, and the optimal action is optimized on the N robots
And 5: in the same task, enabling the robot with unknown different performance parameters to execute optimal action A
*Obtaining the optimal action strategy of the robot
The reinforcement learning algorithm used in the step 2 adopts a trust domain strategy optimization algorithm (TRPO), and the value range of the weight coefficient w is 0-1.
The optimal combination weight optimization algorithm of the base strategy set used in the step 3 adopts a serialized random axis shrinkage algorithm (SRACOS), the regression model optimization algorithm used adopts a gradient descent algorithm, and a given string of initial random actions A comprises 5 actions.
The optimal action optimization algorithm used in the step 4 adopts a serialized random axis shrinkage algorithm (SRACOS), and the optimal action A*Comprising 5 actions.
Has the advantages that: the action strategy of some robots when executing tasks is a fixed program flow written by professionals after repeated trials in advance, a large amount of manpower is required to be introduced, although the action strategy of the robots can be automatically solved through numerical calculation, the task requirements can be well completed only by manually inputting high-precision performance parameters of the robots, and the traditional reinforcement learning introduced in the later stage greatly reduces manual participation, but because the action strategy learned by the reinforcement learning is highly correlated with the performance parameters of the robots, the robots with different unknown performances under the same task still do not have generalization and can not directly obtain corresponding effective action strategies.
Compared with the prior art, the robot control method based on simulator training provided by the invention has the advantages that a group of optimal actions are optimized in the simulator, and when the robot with different unknown performance parameters performs the actions, a group of output states can be obtained, so that the states of the robot can be indirectly recognized, the action strategies of the similar robots in the past are combined, the action strategies of the robot under the task are finally and directly obtained, and the task requirements can be effectively completed.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
As shown in fig. 1, the robot control method based on simulator training includes the following steps:
step 1: carrying out simulation modeling on a task environment to be executed, establishing a simulator which is the same as or similar to the task, and establishing four factors of reinforcement learning aiming at the task design: state s, action a, reward function R (s, a), state transition probability P (s' | s, a);
step 2: in a simulator, T robots with different performance parameters are randomly generated, a reinforcement learning algorithm is used for training each robot respectively to obtain respective action strategies pi as a base strategy, and a base strategy set is finally obtained
And combining strategies
Wherein w is a weight coefficient, pi
t(a | s) denotes the policy model π
tTaking the state s as an input and outputting the action a;
and step 3: in the simulator, M robots with different performance parameters are generated randomly, and the optimal combination weight of a base strategy set used by each robot in task execution is obtained through optimization in the M robots
Wherein τ is a plurality of state-action pairs(s) of the robot when performing a task
0,a
0,s
1,a
1,...,s
t,a
t) The formed track is formed by the following steps,
performing a combined strategy for a robot
wProbability of post-production of trace τR (tau) is the total reward obtained on the track tau, then the M robots are all caused to execute a given string of initial random actions A, and the output state of each robot after executing the actions A is taken as a characteristic F
i(A) Each robot is characterized by F
i(A) And optimal combining weights
Respectively used as the input and label of the regression model, and optimized to obtain the optimal regression model theta, namely
And 4, step 4: in the simulator, N robots with different performance parameters are generated randomly, and the optimal action is optimized on the N robots
And 5: in the same task, enabling the robot with unknown different performance parameters to execute optimal action A
*Obtaining the optimal action strategy of the robot
The reinforcement learning algorithm used in the method adopts a trust domain strategy optimization algorithm (TRPO), and the value range of the weight coefficient w is 0-1.
The optimal combined weight optimization algorithm of the base strategy set used in the method adopts a serialized random axis shrinkage algorithm (SRACOS), the regression model optimization algorithm used adopts a gradient descent algorithm, and a given string of initial random actions A comprises 5 actions.
The optimal action optimization algorithm used in the method adopts a serialized random axis shrinkage algorithm (SRACOS) optimal action A*Comprising 5 actions.