CN113919475B - Robot skill learning method and device, electronic equipment and storage medium - Google Patents

Robot skill learning method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113919475B
CN113919475B CN202111537547.1A CN202111537547A CN113919475B CN 113919475 B CN113919475 B CN 113919475B CN 202111537547 A CN202111537547 A CN 202111537547A CN 113919475 B CN113919475 B CN 113919475B
Authority
CN
China
Prior art keywords
robot
action
skill learning
description information
learning model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111537547.1A
Other languages
Chinese (zh)
Other versions
CN113919475A (en
Inventor
王睿
张天栋
王宇
王硕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202111537547.1A priority Critical patent/CN113919475B/en
Publication of CN113919475A publication Critical patent/CN113919475A/en
Application granted granted Critical
Publication of CN113919475B publication Critical patent/CN113919475B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/008Artificial life, i.e. computing arrangements simulating life based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour

Abstract

The invention discloses a method, a device, electronic equipment and a storage medium for robot skill learning, wherein the method comprises the following steps: acquiring environmental states of a plurality of continuous equal-interval moments; the environment state comprises a robot state and a task stage identifier; inputting a plurality of continuous environment states at equal intervals to a trained robot skill learning model to obtain an action description information sequence of the robot skill; and determining the action sequence executed by the robot according to the action description information sequence. According to the robot skill learning method and the robot skill learning system, a plurality of continuous environment states at equal intervals are input into the robot skill learning model, and the action description information sequence of the robot learning skill is obtained, so that the robot skill learning is realized, the problems of difficult convergence and low success rate caused by the fact that a multi-stage complex task is faced are solved, the robustness is improved, and the efficient and accurate learning of the robot complex skill is realized.

Description

Robot skill learning method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for robot skill learning, electronic equipment and a storage medium.
Background
At present, various autonomous intelligent robots are widely applied to the fields of manufacturing, oceans, national defense and the like. With the development of robots and artificial intelligence technologies, the autonomous ability of the robots is continuously improved, and the robots can replace human beings to complete complex tasks in more fields.
As a robot skill learning algorithm with wide application, the reinforcement learning utilizes the interaction of the robot and the environment to learn the mapping from the state to the action, and optimizes an optimal strategy network under the guidance of a reward function to guide the robot to autonomously complete a specified task. Compared with the traditional control method, the existing robot skill learning method still has more problems and challenges in practical use, and particularly when a multi-stage complex task is faced, the problems of overlong learning time, difficulty in convergence, low success rate and the like easily occur.
In summary, there is a need for a method for robot skill learning to solve the above problems in the prior art.
Disclosure of Invention
Due to the problems of the existing methods, the invention provides a method, a device, an electronic device and a storage medium for robot skill learning.
In a first aspect, the present invention provides a method of robot skill learning, comprising:
acquiring environmental states of a plurality of continuous equal-interval moments; the environment state comprises a robot state and a task stage identifier;
inputting the environment states of the plurality of continuous equal-interval moments into a trained robot skill learning model to obtain an action description information sequence of the robot skill;
determining an action sequence executed by the robot according to the action description information sequence;
the trained robot skill learning model is obtained by training performance evaluation results obtained by executing action sequences by the robot under different environmental states and different environmental states.
Before the inputting the environmental states of the plurality of consecutive equal-interval time points into the trained robot skill learning model to obtain the motion description information sequence of the robot skill, the method further includes:
acquiring a task of the robot skill learning;
dividing the task into N subtasks;
dividing the difficulty of each of the N subtasks and generating M subtasks; m, N is a positive integer;
and training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model.
Further, the dividing difficulty of each of the N subtasks and generating M subtasks includes:
acquiring a subtask target of each subtask in the N subtasks;
determining an allowable error of each subtask according to the subtask target;
determining a difficulty sequence of the allowable error by adopting a difficulty increasing function;
dividing the difficulty of each subtask according to the difficulty sequence to obtain a plurality of subtasks with different difficulties;
and combining the plurality of subtasks with different difficulties by adopting a series-parallel strategy to obtain M subtasks.
Further, the training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model includes:
for the 1 st sub-course, the following steps are performed:
acquiring a preset number of training sample sets; each group of training samples comprises a first environment state, action description information, a second environment state and action rewards; the first environment state is an environment state before the action corresponding to the action description information is executed; the second environment state is the environment state after the action corresponding to the action description information is executed; the action reward is a reward value after the action corresponding to the action description information is executed;
determining a loss function of the robot skill learning model according to the first environment state, the action description information, the second environment state and the action reward;
updating parameters of the robot skill learning model according to the loss function and evaluating the performance of the robot skill learning model to obtain a performance evaluation result;
and if the performance evaluation result or the training time reaches the threshold value, repeating the steps for the 2 nd sub-course until the performance evaluation result or the training time of the Mth sub-course reaches the threshold value, and obtaining the trained robot skill learning model.
Further, the acquiring a preset number of training sample sets includes:
acquiring the first environment state and the action description information;
and determining the action reward according to the second environment state after the action corresponding to the action description information is executed.
Further, the determining the action reward according to the second environment state after the action corresponding to the action description information is executed includes:
acquiring a robot state and a task stage identifier corresponding to the second environment state;
determining the action reward according to the robot state and the task phase identifier.
In a second aspect, the present invention provides an apparatus for robot skill learning, comprising:
the acquisition module is used for acquiring a plurality of environment states at continuous equal interval time; the environment state comprises a robot state and a task stage identifier;
the processing module is used for inputting the environment states of the plurality of continuous equal-interval moments into a trained robot skill learning model to obtain an action description information sequence of the robot skill; determining an action sequence executed by the robot according to the action description information sequence;
the trained robot skill learning model is obtained by training performance evaluation results obtained by executing action sequences by the robot under different environmental states and different environmental states.
Further, the processing module is further configured to:
acquiring a task of robot skill learning before the environment states at the continuous equal interval time are input to a trained robot skill learning model to obtain an action description information sequence of the robot skill learning;
dividing the task into N subtasks;
dividing the difficulty of each of the N subtasks and generating M subtasks; m, N is a positive integer;
and training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model.
Further, the processing module is specifically configured to:
acquiring a subtask target of each subtask in the N subtasks;
determining an allowable error of each subtask according to the subtask target;
determining a difficulty sequence of the allowable error by adopting a difficulty increasing function;
dividing the difficulty of each subtask according to the difficulty sequence to obtain a plurality of subtasks with different difficulties;
and combining the plurality of subtasks with different difficulties by adopting a series-parallel strategy to obtain M subtasks.
Further, the processing module is specifically configured to:
for the 1 st sub-course, the following steps are performed:
acquiring a preset number of training sample sets; each group of training samples comprises a first environment state, action description information, a second environment state and action rewards; the first environment state is an environment state before the action corresponding to the action description information is executed; the second environment state is the environment state after the action corresponding to the action description information is executed; the action reward is a reward value after the action corresponding to the action description information is executed;
determining a loss function of the robot skill learning model according to the first environment state, the action description information, the second environment state and the action reward;
updating parameters of the robot skill learning model according to the loss function and evaluating the performance of the robot skill learning model to obtain a performance evaluation result;
and if the performance evaluation result or the training time reaches the threshold value, repeating the steps for the 2 nd sub-course until the performance evaluation result or the training time of the Mth sub-course reaches the threshold value, and obtaining the trained robot skill learning model.
Further, the processing module is specifically configured to:
acquiring the first environment state and the action description information;
and determining the action reward according to the second environment state after the action corresponding to the action description information is executed.
Further, the processing module is specifically configured to:
acquiring a robot state and a task stage identifier corresponding to the second environment state;
determining the action reward according to the robot state and the task phase identifier.
In a third aspect, the present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method for robot skill learning according to the first aspect when executing the computer program.
In a fourth aspect, the invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of robot skill learning as described in the first aspect.
According to the technical scheme, the robot skill learning method, the device, the electronic equipment and the storage medium provided by the invention have the advantages that the action description information sequence of the robot skill learning is obtained by inputting a plurality of continuous environment states at equal intervals to the robot skill learning model, so that the robot skill learning is realized, the problems of difficult convergence and low success rate caused by the fact that a multi-stage complex task is faced are solved, the robustness is improved, and the efficient and accurate learning of the complex skill of the robot is realized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a system framework for a method of robot skill learning provided by the present invention;
FIG. 2 is a schematic flow chart of a method for robot skill learning provided by the present invention;
FIG. 3 is a schematic flow chart of a method for robot skill learning provided by the present invention;
FIG. 4 is a schematic diagram of a series-parallel course generating method provided by the present invention;
FIG. 5 is a schematic flow chart of a method for robot skill learning provided by the present invention;
FIG. 6 is a schematic diagram of a top ball experiment position curve of the underwater robot provided by the invention;
FIG. 7 is a schematic diagram illustrating the variation of the control frequency of the heading ball of the underwater robot provided by the present invention;
FIG. 8 is a schematic structural diagram of a device for robot skill learning provided by the present invention;
fig. 9 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
The following further describes embodiments of the present invention with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
The method for robot skill learning provided by the embodiment of the present invention may be applied to a system architecture as shown in fig. 1, where the system architecture includes a camera 100 and a robot skill learning model 200.
Specifically, the camera 100 is used to acquire the environmental status at a plurality of consecutive equally spaced times.
The environment state includes a robot state and a task phase identifier. The robot skill learning model 200 is used to obtain an action description information sequence of the robot skill after inputting a plurality of environment states at consecutive equal intervals.
Further, the action sequence executed by the robot is determined according to the action description information sequence.
It should be noted that, the trained robot skill learning model is obtained by training performance evaluation results obtained after the robot executes action sequences in different environmental states and different environmental states.
It should be noted that fig. 1 is only an example of a system architecture according to the embodiment of the present invention, and the present invention is not limited to this specifically.
Based on the above illustrated system architecture, fig. 2 is a flowchart corresponding to a method for robot skill learning according to an embodiment of the present invention, as shown in fig. 2, the method includes:
in step 201, environmental states at a plurality of consecutive equal intervals are obtained.
The environment state includes a robot state and a task phase identifier.
For example, in underwater robot heading skill learning, robot states include robot pose, robot speed, goal point position, and shooting angle.
In one possible embodiment, the robot is a simulated leopard bream underwater robot.
Step 202, inputting a plurality of continuous environment states at equal intervals to the trained robot skill learning model to obtain an action description information sequence of the robot skill.
And step 203, determining the action sequence executed by the robot according to the action description information sequence.
It should be noted that, the trained robot skill learning model is obtained by training performance evaluation results obtained after the robot executes action sequences in different environmental states and different environmental states.
According to the scheme, the action description information sequence of the robot learning skill is obtained by inputting a plurality of continuous environment states at equal intervals to the robot skill learning model, so that the robot skill learning is realized, the problems of difficult convergence and low success rate caused by the facing of multi-stage complex tasks are solved, the robustness is improved, and the efficient and accurate learning of the robot complex skill is realized.
Before a plurality of environment states at continuous equal intervals are input into a trained robot skill learning model to obtain an action description information sequence of the robot skill, the embodiment of the invention has the following steps as shown in fig. 3:
and 301, acquiring a task of robot skill learning.
Step 302, divide the task into N subtasks.
In the embodiment of the invention, the task is divided into a plurality of subtasks according to the logic sequence, and the target is set for each subtask.
Specifically, according to the task of robot skill learning, each subtask is parameterized and described as the target of each subtask.
In the embodiment of the invention, the target task is determined
Figure 821149DEST_PATH_IMAGE001
And dividing it into according to the logical order of completion
Figure 76681DEST_PATH_IMAGE002
Individual phase, overall task
Figure 247768DEST_PATH_IMAGE001
Can be defined as
Figure 764200DEST_PATH_IMAGE002
A phase subtask, represented by the following equation:
Figure 480483DEST_PATH_IMAGE003
further, a course target is set for each subtask:
Figure 347945DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 57144DEST_PATH_IMAGE005
is as follows
Figure 569028DEST_PATH_IMAGE006
The goals of the subtasks.
For example, in a robot heading task, the heading task
Figure 580846DEST_PATH_IMAGE001
And ejecting the water polo into a preset goal for the underwater robot.
In particular, the task of heading the ball
Figure 529079DEST_PATH_IMAGE001
Described as a two-stage task: the preparation stage and the shooting stage are divided into two subtasks which are expressed as follows:
Figure 182915DEST_PATH_IMAGE007
step 303, dividing the difficulty of each of the N subtasks and generating M subtasks.
M, N is a positive integer.
Specifically, a subtask target of each subtask in the N subtasks is obtained;
determining the allowable error of each subtask according to the subtask target;
determining a difficulty sequence of the allowable error by adopting a difficulty increasing function;
dividing the difficulty of each subtask according to the difficulty sequence to obtain a plurality of subtasks with different difficulties;
and combining a plurality of subtasks with different difficulties by adopting a series-parallel strategy to obtain M subtasks.
In the embodiment of the invention, the course difficulty of each subtask is divided in sequence according to the subtask target.
In particular, subtasks
Figure 283726DEST_PATH_IMAGE008
Course object of
Figure 856659DEST_PATH_IMAGE009
Divide it into
Figure 433133DEST_PATH_IMAGE010
Difficulty of individual course
Figure 234867DEST_PATH_IMAGE011
Determining the difficulty of each course as follows:
Figure 580398DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure 324232DEST_PATH_IMAGE013
as a subtask
Figure 263369DEST_PATH_IMAGE008
To (1) a
Figure 727849DEST_PATH_IMAGE014
The difficulty of each course is determined,
Figure 318099DEST_PATH_IMAGE009
as a subtask
Figure 108200DEST_PATH_IMAGE008
The course object of (1) is,
Figure 3475DEST_PATH_IMAGE015
the function can be set to be linear or non-linear, etc., for increasing the difficulty.
And further, combining the difficulty of each subtask, and generating a complete course by utilizing a series-parallel strategy for training.
In particular, the divided course difficulty is combined with each subtask
Figure 927438DEST_PATH_IMAGE013
It is difficult to sequentially generate courses of the overall task in a serial-parallel mannerDegree sequence
Figure 981981DEST_PATH_IMAGE016
Further, overall task initial course difficulty
Figure 818350DEST_PATH_IMAGE017
The initial difficulty of all the subtasks is used for generating, and the difficulty of other courses of the total task is generated by the difficulty of the previous course and the newly added subtask course, as shown in the following formula:
Figure 591134DEST_PATH_IMAGE018
,
wherein the content of the first and second substances,
Figure 787629DEST_PATH_IMAGE019
to generate an overall task
Figure 837625DEST_PATH_IMAGE020
(ii) individual course difficulty; by means of a series-parallel type strategy,
Figure 969529DEST_PATH_IMAGE021
at the same time
Figure 354243DEST_PATH_IMAGE022
In the embodiment of the invention, the difficulty of the parallel operation, namely the difficulty of the next subtask is increased firstly
Figure 229795DEST_PATH_IMAGE023
(ii) a When in use
Figure 134297DEST_PATH_IMAGE024
Increase the serial difficulty
Figure 561736DEST_PATH_IMAGE025
Then, the parallelism difficulty continues to increase.
Finally, a complete course of the overall task is generated by iteration, in total: (
Figure 309112DEST_PATH_IMAGE026
) And (4) difficulty for skill training.
Taking a robot ball-heading task as an example, in a preparation stage, the shooting point position and the shooting angle are determined according to the goal and the water polo position, and the specific calculation formula is as follows:
Figure 598142DEST_PATH_IMAGE027
Figure 747364DEST_PATH_IMAGE028
Figure 80125DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 190163DEST_PATH_IMAGE030
and
Figure 407518DEST_PATH_IMAGE031
respectively a goal position and a water polo position,
Figure 270301DEST_PATH_IMAGE032
reserving a shoot sprint distance for the robot.
Further, the preparation stage subtask is that the robot rapidly moves to the shooting point position and adjusts to a proper shooting angle, and the parameterization description of the subtask is as follows:
Figure 180488DEST_PATH_IMAGE033
wherein the content of the first and second substances,
Figure 512243DEST_PATH_IMAGE034
in order to determine the shooting point position according to the positions of the goal and the water polo,
Figure 267710DEST_PATH_IMAGE035
in order to obtain the angle of incidence,
Figure 516157DEST_PATH_IMAGE036
is the pose of the robot and is the position of the robot,
Figure 207033DEST_PATH_IMAGE037
to allow for errors.
In the embodiment of the invention, when the error between the robot and the shooting point is smaller than the allowable error and is stable, the sub-task at the stage is judged to be completed. Thus, the task object of the preparation phase
Figure 885139DEST_PATH_IMAGE038
By adjusting
Figure 846228DEST_PATH_IMAGE037
The size of the device is adjusted to adjust the task difficulty,
Figure 558969DEST_PATH_IMAGE037
the smaller the task the higher the difficulty.
In particular, during the shooting stage, the robot shoots from the shooting point
Figure 686325DEST_PATH_IMAGE034
Start to accelerate forward while maintaining heading at the angle of incidence
Figure 241940DEST_PATH_IMAGE035
In the valid range of (2), the parameterized description of the subtask is specifically as follows:
Figure 604788DEST_PATH_IMAGE039
wherein the content of the first and second substances,
Figure 47402DEST_PATH_IMAGE040
at the desired minimum heading speed,
Figure 204714DEST_PATH_IMAGE041
the speed of the robot in the advancing direction is determined,
Figure 247625DEST_PATH_IMAGE042
to allow for errors. And when the error between the heading and the shooting angle of the robot is smaller than the allowable error and is stabilized, and the speed is higher than the expected minimum heading speed, judging that the subtask in the stage is completed. Thus, the task object of the shooting phase
Figure 23951DEST_PATH_IMAGE043
By adjusting
Figure 711285DEST_PATH_IMAGE042
The size of the device is adjusted to adjust the task difficulty,
Figure 429711DEST_PATH_IMAGE042
the smaller the task the higher the difficulty.
Based on this, the object of the heading task of the underwater robot is as follows:
Figure 569705DEST_PATH_IMAGE044
it should be noted that, in the embodiment of the present invention, the real-time position of the underwater robot is acquired by a global vision system, such as a camera
Figure 884143DEST_PATH_IMAGE045
Course of the vehicle
Figure 816196DEST_PATH_IMAGE046
Goal position
Figure 315310DEST_PATH_IMAGE030
And the position of the water ball.
For example, based on the two sub-task objectives
Figure 817967DEST_PATH_IMAGE047
Are divided into
Figure 326308DEST_PATH_IMAGE048
Difficulty of
Figure 112868DEST_PATH_IMAGE049
And
Figure 923829DEST_PATH_IMAGE050
the specific difficulty adopts a difficulty increasing function, and the calculation formula is as follows:
Figure 772836DEST_PATH_IMAGE051
wherein the content of the first and second substances,
Figure 209503DEST_PATH_IMAGE013
as a subtask
Figure 725935DEST_PATH_IMAGE008
To (1) a
Figure 442218DEST_PATH_IMAGE014
The difficulty of the operation is that,
Figure 168734DEST_PATH_IMAGE009
as a subtask
Figure 753299DEST_PATH_IMAGE008
The object of (a) is to be,
Figure 999604DEST_PATH_IMAGE015
in order to increase the difficulty of the function, an exponential function is adopted in the embodiment of the invention.
And further, combining the difficulty of each subtask, and generating a complete course by utilizing a series-parallel strategy for training.
In particular, the division difficulty is determined according to each subtask
Figure 277002DEST_PATH_IMAGE013
Sequentially generating course difficulty sequence of the total task according to a serial-parallel mode
Figure 490814DEST_PATH_IMAGE052
Thus obtaining 7 sub courses.
Further, fig. 4 exemplarily shows a schematic diagram of a series-parallel course generation method for the underwater robot heading skill learning according to the embodiment of the present invention.
For example, let get subtasks
Figure 613491DEST_PATH_IMAGE053
The 4 courses are as difficult as
Figure 245461DEST_PATH_IMAGE049
Subtasks, sub-tasks
Figure 552814DEST_PATH_IMAGE054
The 4 courses are as difficult as
Figure 129289DEST_PATH_IMAGE050
. The series-parallel strategy is adopted as follows:
Figure 931023DEST_PATH_IMAGE018
,
wherein the content of the first and second substances,
Figure 276554DEST_PATH_IMAGE019
to generate an overall task
Figure 20388DEST_PATH_IMAGE020
(ii) individual course difficulty; by using a series-parallel type training strategy,
Figure 693945DEST_PATH_IMAGE021
at the same time
Figure 689583DEST_PATH_IMAGE022
In the embodiment of the invention, the difficulty of the parallel operation, namely the difficulty of the next subtask is increased firstly
Figure 14254DEST_PATH_IMAGE023
(ii) a When in use
Figure 679722DEST_PATH_IMAGE024
Increase the serial difficulty
Figure 965210DEST_PATH_IMAGE025
Then, the parallelism difficulty continues to increase. Generating course difficulty sequence of overall task in sequence
Figure 623593DEST_PATH_IMAGE052
For example, the sub-course generating step is:
Figure 943716DEST_PATH_IMAGE055
Figure 514506DEST_PATH_IMAGE056
Figure 287290DEST_PATH_IMAGE057
Figure 749364DEST_PATH_IMAGE058
Figure 533780DEST_PATH_IMAGE059
Figure 931264DEST_PATH_IMAGE060
Figure 50398DEST_PATH_IMAGE061
based on this, a complete course of the overall task is generated by iteration, in total: (
Figure 925950DEST_PATH_IMAGE062
) And the sub-courses of the individual difficulty are used for robot skill training.
According to the scheme, according to the task target, the robot trains the model through the series-parallel courses with gradually increased difficulty, learns and masters skills, and solves the problems of low success rate and difficult convergence caused by direct high-difficulty training.
And step 304, training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model.
According to the scheme, according to the task target, the robot trains the model with gradually increased difficulty through multi-stage series-parallel courses, learns and masters skills, and solves the problems of low success rate and difficult convergence caused by direct high-difficulty training.
In the embodiment of the invention, each sub-course is taken as a training target, and the robot skill learning model is trained by using a reinforcement learning method.
In one possible implementation, a robot skill learning model is trained using a novel model-free reinforcement learning algorithm (SAC).
Further, the flow of steps is shown in fig. 5, which specifically includes the following steps:
step 501, for each sub-course, a preset number of training sample sets are obtained.
It should be noted that each set of training samples includes a first environment state, action description information, a second environment state, and an action reward; the first environment state is an environment state before the action corresponding to the action description information is executed; the second environment state is the environment state after the action corresponding to the action description information is executed; the action reward is a reward value after the action corresponding to the action description information is executed.
Specifically, a first environment state and action description information are obtained;
and determining the action reward according to the second environment state after the action corresponding to the action description information is executed.
In the embodiment of the invention, the robot skill learning model outputs the action description information according to the first environment state, and controls the interaction between the robot and the environment.
Taking the robot heading skill learning as an example, first, a first environment state is obtained
Figure 830452DEST_PATH_IMAGE063
Inputting the data into a current robot skill learning model, and outputting action description information according to the robot skill learning model
Figure 257892DEST_PATH_IMAGE064
The device is used for controlling the fluctuation frequency of each fin of the underwater robot so as to guide the underwater robot to complete a heading task.
Wherein the state
Figure 739688DEST_PATH_IMAGE063
By the position and posture of the robot
Figure 294298DEST_PATH_IMAGE065
Speed, velocity
Figure 177940DEST_PATH_IMAGE066
Shooting point position and shooting angle
Figure 53578DEST_PATH_IMAGE067
Task phase identifier
Figure 898038DEST_PATH_IMAGE068
Composition is carried out; movement of
Figure 380972DEST_PATH_IMAGE069
The robot consists of the fluctuation frequency of each fin of the robot, and the fluctuation frequency is input into a fin controller to control the swimming of the robot.
Further, acquiring a robot state and a task stage identifier corresponding to the second environment state;
an action reward is determined based on the robot state and the task phase identifier.
When in use
Figure 243754DEST_PATH_IMAGE069
After a time step, acquiring a second environmental state
Figure 888362DEST_PATH_IMAGE070
And obtaining the action reward corresponding to the environment
Figure 485697DEST_PATH_IMAGE071
Common composition experience
Figure 365797DEST_PATH_IMAGE072
And storing the training result into an experience pool for off-line training of a reinforcement learning algorithm.
Taking the learning of the skill of the robot in the top ball as an example, the specific calculation formula of the action reward is as follows:
Figure 224032DEST_PATH_IMAGE073
wherein the content of the first and second substances,
Figure 914907DEST_PATH_IMAGE074
is each partial weight coefficient; when in use
Figure 858592DEST_PATH_IMAGE075
The preparation stage is the time for the preparation,
Figure 807963DEST_PATH_IMAGE076
the robot is guided to move to the position of the shooting point,
Figure 396070DEST_PATH_IMAGE077
guiding the robot to adjust to a proper shooting angle; when in use
Figure 382481DEST_PATH_IMAGE078
The time is the shooting stage,
Figure 938096DEST_PATH_IMAGE079
the robot is guided to accelerate the sprint,
Figure 35365DEST_PATH_IMAGE080
guiding the robot to remain within the range of firing angles; when in use
Figure 9137DEST_PATH_IMAGE081
The sub-course task is completed to reach the goal.
According to the scheme, the action reward is designed according to the state of the robot and the task phase identifier, and efficient and accurate learning of the robot on complex skills is achieved.
And 502, determining a loss function of the robot skill learning model according to the first environment state, the action description information, the second environment state and the action reward.
And 503, updating parameters of the robot skill learning model according to the loss function and evaluating the performance of the robot skill learning model to obtain a performance evaluation result.
In the embodiment of the invention, the performance of the current robot skill learning model is evaluated through the performance evaluation module, and the difficulty acceleration is controlled through the course scheduling module according to the performance evaluation result.
Specifically, the performance evaluation module tests the success rate and the time efficiency of the robot for completing the ball-shooting and shooting task under the current strategy network by testing the current robot skill learning model, and can evaluate the performance of the model trained under the current sub-course difficulty.
Step 504, if the performance evaluation result or the training time reaches the threshold, repeating the above steps for the 2 nd sub-course until the performance evaluation result or the training time of the mth sub-course reaches the threshold, and obtaining the trained robot skill learning model.
Further, according to the performance evaluation result, the course scheduling module judges whether to switch to the next course difficulty, and if the performance or the training time reaches a threshold value, the course difficulty is increased according to the set course, so that the control of the increase of the course difficulty is realized.
For example, when the difficulty reaches a preset target and the robot performance evaluation passes, a trained robot skill learning model is obtained. When the difficulty of the course reaches the preset target, the course reaches the 7 th
Figure 291083DEST_PATH_IMAGE082
) The curriculum difficulty and the performance of the robot skill learning model reach the requirement,the robot realizes the skill learning of the heading task.
In the embodiment of the invention, according to the goal of the heading task, the robot can train the model through multi-stage and series-parallel courses with gradually increasing difficulty, learn and master the heading skill, and avoid the problems of low success rate, difficult convergence and the like caused by direct high-difficulty training.
According to the scheme, the network can learn a more robust model more quickly through reinforcement learning, so that the robot can efficiently and accurately learn complex skills, and the problems of difficult convergence and low success rate of the existing skill learning method in the face of multi-stage complex tasks are solved
Further, to verify validity, for example, a heading task verification may be performed in an indoor pool of 5m × 4 m × 1.1 m. The global visual tracking system installed on the top of the pool is connected to the control console through a USB, and the control console can calculate the current position and the course of the robot and the positions of the goal and the water polo in real time by processing images of the goal, the water polo, the robot and the surrounding environment, so that the shooting point position and the shooting angle are calculated according to the positions of the goal and the water polo. The control console can obtain the real-time environment state, obtain the action by inputting the real-time environment state into the trained robot skill learning model, and send the action to the fish fin controller in the robot as the motion control through wireless communication. The verification results of the robot heading task are given in fig. 6 and 7. Fig. 6 exemplarily shows a schematic diagram of a heading experiment position curve of the underwater robot. Fig. 7 exemplarily shows a schematic diagram of variation of the heading control frequency of the underwater robot. It can be seen that the embodiment of the invention enables the robot to move to the shooting point and adjust the shooting angle faster, then to accelerate in the direction of the shooting angle and push the water polo into the goal.
Based on the technical scheme, the embodiment of the invention can train the heading strategy network by gradually increasing the difficulty through multi-stage and series-parallel courses, solves the problems of difficult convergence, low success rate and the like easily caused by the existing skill learning method in the face of multi-stage complex tasks, and enables the robot to efficiently and accurately learn the complex skills such as heading and the like through training to jack the water polo into the goal.
Based on the same inventive concept, fig. 8 exemplarily shows a device for robot skill learning, which can be a flow of a method for robot skill learning according to an embodiment of the present invention.
The apparatus, comprising:
an obtaining module 801, configured to obtain environment states at multiple consecutive equal-interval moments; the environment state comprises a robot state and a task stage identifier;
the processing module 802 is configured to input the plurality of continuous environment states at equal intervals to a trained robot skill learning model to obtain an action description information sequence of the robot skill; determining an action sequence executed by the robot according to the action description information sequence;
the trained robot skill learning model is obtained by training performance evaluation results obtained by executing action sequences by the robot under different environmental states and different environmental states.
Further, the processing module 802 is further configured to:
acquiring a task of robot skill learning before the environment states at the continuous equal interval time are input to a trained robot skill learning model to obtain an action description information sequence of the robot skill learning;
dividing the task into N subtasks;
dividing the difficulty of each of the N subtasks and generating M subtasks; m, N is a positive integer;
and training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model.
Further, the processing module 802 is specifically configured to:
acquiring a subtask target of each subtask in the N subtasks;
determining an allowable error of each subtask according to the subtask target;
determining a difficulty sequence of the allowable error by adopting a difficulty increasing function;
dividing the difficulty of each subtask according to the difficulty sequence to obtain a plurality of subtasks with different difficulties;
and combining the plurality of subtasks with different difficulties by adopting a series-parallel strategy to obtain M subtasks.
Further, the processing module 802 is specifically configured to:
for the 1 st sub-course, the following steps are performed:
acquiring a preset number of training sample sets; each group of training samples comprises a first environment state, action description information, a second environment state and action rewards; the first environment state is an environment state before the action corresponding to the action description information is executed; the second environment state is the environment state after the action corresponding to the action description information is executed; the action reward is a reward value after the action corresponding to the action description information is executed;
determining a loss function of the robot skill learning model according to the first environment state, the action description information, the second environment state and the action reward;
updating parameters of the robot skill learning model according to the loss function and evaluating the performance of the robot skill learning model to obtain a performance evaluation result;
and if the performance evaluation result or the training time reaches the threshold value, repeating the steps for the 2 nd sub-course until the performance evaluation result or the training time of the Mth sub-course reaches the threshold value, and obtaining the trained robot skill learning model.
Further, the processing module 802 is specifically configured to:
acquiring the first environment state and the action description information;
and determining the action reward according to the second environment state after the action corresponding to the action description information is executed.
Further, the processing module 802 is specifically configured to:
acquiring a robot state and a task stage identifier corresponding to the second environment state;
determining the action reward according to the robot state and the task phase identifier.
Based on the same inventive concept, another embodiment of the present invention provides an electronic device, which specifically includes the following components, with reference to fig. 9: a processor 901, memory 902, communication interface 903, and communication bus 904;
the processor 901, the memory 902 and the communication interface 903 complete mutual communication through the communication bus 904; the communication interface 903 is used for realizing information transmission among the devices;
the processor 901 is configured to call a computer program in the memory 902, and the processor executes the computer program to implement all the steps of the above-mentioned method for robot skill learning, for example, the processor executes the computer program to implement the following steps: acquiring environmental states of a plurality of continuous equal-interval moments; the environment state comprises a robot state and a task stage identifier; inputting the environment states of the plurality of continuous equal-interval moments into a trained robot skill learning model to obtain an action description information sequence of the robot skill; determining an action sequence executed by the robot according to the action description information sequence;
the trained robot skill learning model is obtained by training performance evaluation results obtained by executing action sequences by the robot under different environmental states and different environmental states.
Based on the same inventive concept, a further embodiment of the present invention provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs all the steps of the above-described method of robot skill learning, e.g. the processor performs the following steps when executing the computer program: acquiring environmental states of a plurality of continuous equal-interval moments; the environment state comprises a robot state and a task stage identifier; inputting the environment states of the plurality of continuous equal-interval moments into a trained robot skill learning model to obtain an action description information sequence of the robot skill; determining an action sequence executed by the robot according to the action description information sequence; the trained robot skill learning model is obtained by training performance evaluation results obtained by executing action sequences by the robot under different environmental states and different environmental states.
In addition, the logic instructions in the memory may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a robot skill learning apparatus, or a network device) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions may be essentially or partially implemented in the form of software products, which may be stored in computer readable storage media, such as ROM/RAM, magnetic disk, optical disk, etc., and include instructions for causing a computer device (which may be a personal computer, a robot skill learning apparatus, or a network device, etc.) to execute the method for robot skill learning according to the embodiments or some parts of the embodiments.
In addition, in the present invention, terms such as "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Moreover, in the present invention, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Furthermore, in the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. A method of robotic skill learning, comprising:
acquiring environmental states of a plurality of continuous equal-interval moments; the environment state comprises a robot state and a task stage identifier;
inputting the environment states of the plurality of continuous equal-interval moments into a trained robot skill learning model to obtain an action description information sequence of the robot skill;
determining an action sequence executed by the robot according to the action description information sequence;
the trained robot skill learning model is obtained by training performance evaluation results obtained by executing action sequences by the robot under different environmental states and different environmental states;
before the step of inputting the plurality of environment states at the continuous equal interval time into the trained robot skill learning model to obtain the motion description information sequence of the robot skill, the method further includes:
acquiring a task of the robot skill learning;
dividing the task into N subtasks;
dividing the difficulty of each of the N subtasks and generating M subtasks; m, N is a positive integer;
and training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model.
2. The method of robotic skill learning of claim 1 wherein said dividing the difficulty of each of the N subtasks and generating M subtasks comprises:
acquiring a subtask target of each subtask in the N subtasks;
determining an allowable error of each subtask according to the subtask target;
determining a difficulty sequence of the allowable error by adopting a difficulty increasing function;
dividing the difficulty of each subtask according to the difficulty sequence to obtain a plurality of subtasks with different difficulties;
and combining the plurality of subtasks with different difficulties by adopting a series-parallel strategy to obtain M subtasks.
3. The method of claim 1, wherein the training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model comprises:
for the 1 st sub-course, the following steps are performed:
acquiring a preset number of training sample sets; each group of training samples comprises a first environment state, action description information, a second environment state and action rewards; the first environment state is an environment state before the action corresponding to the action description information is executed; the second environment state is the environment state after the action corresponding to the action description information is executed; the action reward is a reward value after the action corresponding to the action description information is executed;
determining a loss function of the robot skill learning model according to the first environment state, the action description information, the second environment state and the action reward;
updating parameters of the robot skill learning model according to the loss function and evaluating the performance of the robot skill learning model to obtain a performance evaluation result;
and if the performance evaluation result or the training time reaches the threshold value, repeating the steps for the 2 nd sub-course until the performance evaluation result or the training time of the Mth sub-course reaches the threshold value, and obtaining the trained robot skill learning model.
4. The method of robotic skill learning according to claim 3, wherein said obtaining a preset number of training sample sets comprises:
acquiring the first environment state and the action description information;
and determining the action reward according to the second environment state after the action corresponding to the action description information is executed.
5. The method of robot skill learning of claim 4, wherein the determining the action reward according to the second environmental status after the action corresponding to the action description information is performed comprises:
acquiring a robot state and a task stage identifier corresponding to the second environment state;
determining the action reward according to the robot state and the task phase identifier.
6. An apparatus for robotic skill learning, comprising:
the acquisition module is used for acquiring a plurality of environment states at continuous equal interval time; the environment state comprises a robot state and a task stage identifier;
the processing module is used for inputting the environment states of the plurality of continuous equal-interval moments into a trained robot skill learning model to obtain an action description information sequence of the robot skill; determining an action sequence executed by the robot according to the action description information sequence;
the trained robot skill learning model is obtained by training performance evaluation results obtained by executing action sequences by the robot under different environmental states and different environmental states;
the processing module is further configured to obtain a task of robot skill learning before the environment states at the multiple consecutive equal-interval moments are input to a trained robot skill learning model to obtain an action description information sequence of the robot skill learning;
dividing the task into N subtasks;
dividing the difficulty of each of the N subtasks and generating M subtasks; m, N is a positive integer;
and training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 5 are implemented when the processor executes the program.
8. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.
CN202111537547.1A 2021-12-16 2021-12-16 Robot skill learning method and device, electronic equipment and storage medium Active CN113919475B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111537547.1A CN113919475B (en) 2021-12-16 2021-12-16 Robot skill learning method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111537547.1A CN113919475B (en) 2021-12-16 2021-12-16 Robot skill learning method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113919475A CN113919475A (en) 2022-01-11
CN113919475B true CN113919475B (en) 2022-04-08

Family

ID=79248964

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111537547.1A Active CN113919475B (en) 2021-12-16 2021-12-16 Robot skill learning method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113919475B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114609925B (en) * 2022-01-14 2022-12-06 中国科学院自动化研究所 Training method of underwater exploration strategy model and underwater exploration method of bionic machine fish

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461309A (en) * 2020-04-17 2020-07-28 支付宝(杭州)信息技术有限公司 Method and device for updating reinforcement learning system for realizing privacy protection
CN113168566A (en) * 2018-11-30 2021-07-23 谷歌有限责任公司 Controlling a robot by using entropy constraints
CN113156892A (en) * 2021-04-16 2021-07-23 西湖大学 Four-footed robot simulated motion control method based on deep reinforcement learning
CN113487039A (en) * 2021-06-29 2021-10-08 山东大学 Intelligent body self-adaptive decision generation method and system based on deep reinforcement learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113168566A (en) * 2018-11-30 2021-07-23 谷歌有限责任公司 Controlling a robot by using entropy constraints
CN111461309A (en) * 2020-04-17 2020-07-28 支付宝(杭州)信息技术有限公司 Method and device for updating reinforcement learning system for realizing privacy protection
CN113156892A (en) * 2021-04-16 2021-07-23 西湖大学 Four-footed robot simulated motion control method based on deep reinforcement learning
CN113487039A (en) * 2021-06-29 2021-10-08 山东大学 Intelligent body self-adaptive decision generation method and system based on deep reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Towards Practical Multi-Object Manipulation using Relational Reinforcement Learning;Richard Li;《2020 IEEE international Conference on Robotics and Automation》;20200831;第4051-4056页 *

Also Published As

Publication number Publication date
CN113919475A (en) 2022-01-11

Similar Documents

Publication Publication Date Title
CN111488988B (en) Control strategy simulation learning method and device based on counterstudy
CN108115681B (en) Simulation learning method and device for robot, robot and storage medium
CN107102644B (en) Underwater robot track control method and control system based on deep reinforcement learning
CN109523029A (en) For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body
CN111026272B (en) Training method and device for virtual object behavior strategy, electronic equipment and storage medium
US11759947B2 (en) Method for controlling a robot device and robot device controller
CN111783994A (en) Training method and device for reinforcement learning
CN113561986A (en) Decision-making method and device for automatically driving automobile
CN113070878B (en) Robot control method based on impulse neural network, robot and storage medium
CN113919475B (en) Robot skill learning method and device, electronic equipment and storage medium
CN114290339B (en) Robot realistic migration method based on reinforcement learning and residual modeling
CN114239974B (en) Multi-agent position prediction method and device, electronic equipment and storage medium
CN113641099B (en) Impedance control imitation learning training method for surpassing expert demonstration
CN114219066A (en) Unsupervised reinforcement learning method and unsupervised reinforcement learning device based on Watherstein distance
Gromniak et al. Deep reinforcement learning for mobile robot navigation
CN116352715A (en) Double-arm robot cooperative motion control method based on deep reinforcement learning
CN116147627A (en) Mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation
CN113910221B (en) Mechanical arm autonomous motion planning method, device, equipment and storage medium
Sabathiel et al. A computational model of learning to count in a multimodal, interactive environment.
CN113419424A (en) Modeling reinforcement learning robot control method and system capable of reducing over-estimation
CN110515297B (en) Staged motion control method based on redundant musculoskeletal system
CN113485107B (en) Reinforced learning robot control method and system based on consistency constraint modeling
CN114571456B (en) Electric connector assembling method and system based on robot skill learning
Subramanian Task space behavior learning for humanoid robots using Gaussian mixture models
CN115496208B (en) Cooperative mode diversified and guided unsupervised multi-agent reinforcement learning method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant