CN113919475B - Robot skill learning method and device, electronic equipment and storage medium - Google Patents
Robot skill learning method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN113919475B CN113919475B CN202111537547.1A CN202111537547A CN113919475B CN 113919475 B CN113919475 B CN 113919475B CN 202111537547 A CN202111537547 A CN 202111537547A CN 113919475 B CN113919475 B CN 113919475B
- Authority
- CN
- China
- Prior art keywords
- robot
- action
- skill learning
- description information
- learning model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000003860 storage Methods 0.000 title claims abstract description 13
- 230000009471 action Effects 0.000 claims abstract description 143
- 230000007613 environmental effect Effects 0.000 claims abstract description 28
- 238000012549 training Methods 0.000 claims description 52
- 238000011156 evaluation Methods 0.000 claims description 29
- 230000006870 function Effects 0.000 claims description 20
- 238000012545 processing Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 10
- 230000033001 locomotion Effects 0.000 claims description 4
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 10
- 239000000126 substance Substances 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 238000002360 preparation method Methods 0.000 description 6
- 230000002787 reinforcement Effects 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 241001519451 Abramis brama Species 0.000 description 1
- 241000282414 Homo sapiens Species 0.000 description 1
- 241000282373 Panthera pardus Species 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 210000004690 animal fin Anatomy 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000010304 firing Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009182 swimming Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/008—Artificial life, i.e. computing arrangements simulating life based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour
Abstract
The invention discloses a method, a device, electronic equipment and a storage medium for robot skill learning, wherein the method comprises the following steps: acquiring environmental states of a plurality of continuous equal-interval moments; the environment state comprises a robot state and a task stage identifier; inputting a plurality of continuous environment states at equal intervals to a trained robot skill learning model to obtain an action description information sequence of the robot skill; and determining the action sequence executed by the robot according to the action description information sequence. According to the robot skill learning method and the robot skill learning system, a plurality of continuous environment states at equal intervals are input into the robot skill learning model, and the action description information sequence of the robot learning skill is obtained, so that the robot skill learning is realized, the problems of difficult convergence and low success rate caused by the fact that a multi-stage complex task is faced are solved, the robustness is improved, and the efficient and accurate learning of the robot complex skill is realized.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for robot skill learning, electronic equipment and a storage medium.
Background
At present, various autonomous intelligent robots are widely applied to the fields of manufacturing, oceans, national defense and the like. With the development of robots and artificial intelligence technologies, the autonomous ability of the robots is continuously improved, and the robots can replace human beings to complete complex tasks in more fields.
As a robot skill learning algorithm with wide application, the reinforcement learning utilizes the interaction of the robot and the environment to learn the mapping from the state to the action, and optimizes an optimal strategy network under the guidance of a reward function to guide the robot to autonomously complete a specified task. Compared with the traditional control method, the existing robot skill learning method still has more problems and challenges in practical use, and particularly when a multi-stage complex task is faced, the problems of overlong learning time, difficulty in convergence, low success rate and the like easily occur.
In summary, there is a need for a method for robot skill learning to solve the above problems in the prior art.
Disclosure of Invention
Due to the problems of the existing methods, the invention provides a method, a device, an electronic device and a storage medium for robot skill learning.
In a first aspect, the present invention provides a method of robot skill learning, comprising:
acquiring environmental states of a plurality of continuous equal-interval moments; the environment state comprises a robot state and a task stage identifier;
inputting the environment states of the plurality of continuous equal-interval moments into a trained robot skill learning model to obtain an action description information sequence of the robot skill;
determining an action sequence executed by the robot according to the action description information sequence;
the trained robot skill learning model is obtained by training performance evaluation results obtained by executing action sequences by the robot under different environmental states and different environmental states.
Before the inputting the environmental states of the plurality of consecutive equal-interval time points into the trained robot skill learning model to obtain the motion description information sequence of the robot skill, the method further includes:
acquiring a task of the robot skill learning;
dividing the task into N subtasks;
dividing the difficulty of each of the N subtasks and generating M subtasks; m, N is a positive integer;
and training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model.
Further, the dividing difficulty of each of the N subtasks and generating M subtasks includes:
acquiring a subtask target of each subtask in the N subtasks;
determining an allowable error of each subtask according to the subtask target;
determining a difficulty sequence of the allowable error by adopting a difficulty increasing function;
dividing the difficulty of each subtask according to the difficulty sequence to obtain a plurality of subtasks with different difficulties;
and combining the plurality of subtasks with different difficulties by adopting a series-parallel strategy to obtain M subtasks.
Further, the training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model includes:
for the 1 st sub-course, the following steps are performed:
acquiring a preset number of training sample sets; each group of training samples comprises a first environment state, action description information, a second environment state and action rewards; the first environment state is an environment state before the action corresponding to the action description information is executed; the second environment state is the environment state after the action corresponding to the action description information is executed; the action reward is a reward value after the action corresponding to the action description information is executed;
determining a loss function of the robot skill learning model according to the first environment state, the action description information, the second environment state and the action reward;
updating parameters of the robot skill learning model according to the loss function and evaluating the performance of the robot skill learning model to obtain a performance evaluation result;
and if the performance evaluation result or the training time reaches the threshold value, repeating the steps for the 2 nd sub-course until the performance evaluation result or the training time of the Mth sub-course reaches the threshold value, and obtaining the trained robot skill learning model.
Further, the acquiring a preset number of training sample sets includes:
acquiring the first environment state and the action description information;
and determining the action reward according to the second environment state after the action corresponding to the action description information is executed.
Further, the determining the action reward according to the second environment state after the action corresponding to the action description information is executed includes:
acquiring a robot state and a task stage identifier corresponding to the second environment state;
determining the action reward according to the robot state and the task phase identifier.
In a second aspect, the present invention provides an apparatus for robot skill learning, comprising:
the acquisition module is used for acquiring a plurality of environment states at continuous equal interval time; the environment state comprises a robot state and a task stage identifier;
the processing module is used for inputting the environment states of the plurality of continuous equal-interval moments into a trained robot skill learning model to obtain an action description information sequence of the robot skill; determining an action sequence executed by the robot according to the action description information sequence;
the trained robot skill learning model is obtained by training performance evaluation results obtained by executing action sequences by the robot under different environmental states and different environmental states.
Further, the processing module is further configured to:
acquiring a task of robot skill learning before the environment states at the continuous equal interval time are input to a trained robot skill learning model to obtain an action description information sequence of the robot skill learning;
dividing the task into N subtasks;
dividing the difficulty of each of the N subtasks and generating M subtasks; m, N is a positive integer;
and training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model.
Further, the processing module is specifically configured to:
acquiring a subtask target of each subtask in the N subtasks;
determining an allowable error of each subtask according to the subtask target;
determining a difficulty sequence of the allowable error by adopting a difficulty increasing function;
dividing the difficulty of each subtask according to the difficulty sequence to obtain a plurality of subtasks with different difficulties;
and combining the plurality of subtasks with different difficulties by adopting a series-parallel strategy to obtain M subtasks.
Further, the processing module is specifically configured to:
for the 1 st sub-course, the following steps are performed:
acquiring a preset number of training sample sets; each group of training samples comprises a first environment state, action description information, a second environment state and action rewards; the first environment state is an environment state before the action corresponding to the action description information is executed; the second environment state is the environment state after the action corresponding to the action description information is executed; the action reward is a reward value after the action corresponding to the action description information is executed;
determining a loss function of the robot skill learning model according to the first environment state, the action description information, the second environment state and the action reward;
updating parameters of the robot skill learning model according to the loss function and evaluating the performance of the robot skill learning model to obtain a performance evaluation result;
and if the performance evaluation result or the training time reaches the threshold value, repeating the steps for the 2 nd sub-course until the performance evaluation result or the training time of the Mth sub-course reaches the threshold value, and obtaining the trained robot skill learning model.
Further, the processing module is specifically configured to:
acquiring the first environment state and the action description information;
and determining the action reward according to the second environment state after the action corresponding to the action description information is executed.
Further, the processing module is specifically configured to:
acquiring a robot state and a task stage identifier corresponding to the second environment state;
determining the action reward according to the robot state and the task phase identifier.
In a third aspect, the present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method for robot skill learning according to the first aspect when executing the computer program.
In a fourth aspect, the invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of robot skill learning as described in the first aspect.
According to the technical scheme, the robot skill learning method, the device, the electronic equipment and the storage medium provided by the invention have the advantages that the action description information sequence of the robot skill learning is obtained by inputting a plurality of continuous environment states at equal intervals to the robot skill learning model, so that the robot skill learning is realized, the problems of difficult convergence and low success rate caused by the fact that a multi-stage complex task is faced are solved, the robustness is improved, and the efficient and accurate learning of the complex skill of the robot is realized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a system framework for a method of robot skill learning provided by the present invention;
FIG. 2 is a schematic flow chart of a method for robot skill learning provided by the present invention;
FIG. 3 is a schematic flow chart of a method for robot skill learning provided by the present invention;
FIG. 4 is a schematic diagram of a series-parallel course generating method provided by the present invention;
FIG. 5 is a schematic flow chart of a method for robot skill learning provided by the present invention;
FIG. 6 is a schematic diagram of a top ball experiment position curve of the underwater robot provided by the invention;
FIG. 7 is a schematic diagram illustrating the variation of the control frequency of the heading ball of the underwater robot provided by the present invention;
FIG. 8 is a schematic structural diagram of a device for robot skill learning provided by the present invention;
fig. 9 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
The following further describes embodiments of the present invention with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
The method for robot skill learning provided by the embodiment of the present invention may be applied to a system architecture as shown in fig. 1, where the system architecture includes a camera 100 and a robot skill learning model 200.
Specifically, the camera 100 is used to acquire the environmental status at a plurality of consecutive equally spaced times.
The environment state includes a robot state and a task phase identifier. The robot skill learning model 200 is used to obtain an action description information sequence of the robot skill after inputting a plurality of environment states at consecutive equal intervals.
Further, the action sequence executed by the robot is determined according to the action description information sequence.
It should be noted that, the trained robot skill learning model is obtained by training performance evaluation results obtained after the robot executes action sequences in different environmental states and different environmental states.
It should be noted that fig. 1 is only an example of a system architecture according to the embodiment of the present invention, and the present invention is not limited to this specifically.
Based on the above illustrated system architecture, fig. 2 is a flowchart corresponding to a method for robot skill learning according to an embodiment of the present invention, as shown in fig. 2, the method includes:
in step 201, environmental states at a plurality of consecutive equal intervals are obtained.
The environment state includes a robot state and a task phase identifier.
For example, in underwater robot heading skill learning, robot states include robot pose, robot speed, goal point position, and shooting angle.
In one possible embodiment, the robot is a simulated leopard bream underwater robot.
And step 203, determining the action sequence executed by the robot according to the action description information sequence.
It should be noted that, the trained robot skill learning model is obtained by training performance evaluation results obtained after the robot executes action sequences in different environmental states and different environmental states.
According to the scheme, the action description information sequence of the robot learning skill is obtained by inputting a plurality of continuous environment states at equal intervals to the robot skill learning model, so that the robot skill learning is realized, the problems of difficult convergence and low success rate caused by the facing of multi-stage complex tasks are solved, the robustness is improved, and the efficient and accurate learning of the robot complex skill is realized.
Before a plurality of environment states at continuous equal intervals are input into a trained robot skill learning model to obtain an action description information sequence of the robot skill, the embodiment of the invention has the following steps as shown in fig. 3:
and 301, acquiring a task of robot skill learning.
In the embodiment of the invention, the task is divided into a plurality of subtasks according to the logic sequence, and the target is set for each subtask.
Specifically, according to the task of robot skill learning, each subtask is parameterized and described as the target of each subtask.
In the embodiment of the invention, the target task is determinedAnd dividing it into according to the logical order of completionIndividual phase, overall taskCan be defined asA phase subtask, represented by the following equation:
further, a course target is set for each subtask:
For example, in a robot heading task, the heading taskAnd ejecting the water polo into a preset goal for the underwater robot.
In particular, the task of heading the ballDescribed as a two-stage task: the preparation stage and the shooting stage are divided into two subtasks which are expressed as follows:
M, N is a positive integer.
Specifically, a subtask target of each subtask in the N subtasks is obtained;
determining the allowable error of each subtask according to the subtask target;
determining a difficulty sequence of the allowable error by adopting a difficulty increasing function;
dividing the difficulty of each subtask according to the difficulty sequence to obtain a plurality of subtasks with different difficulties;
and combining a plurality of subtasks with different difficulties by adopting a series-parallel strategy to obtain M subtasks.
In the embodiment of the invention, the course difficulty of each subtask is divided in sequence according to the subtask target.
In particular, subtasksCourse object ofDivide it intoDifficulty of individual courseDetermining the difficulty of each course as follows:
wherein the content of the first and second substances,as a subtaskTo (1) aThe difficulty of each course is determined,as a subtaskThe course object of (1) is,the function can be set to be linear or non-linear, etc., for increasing the difficulty.
And further, combining the difficulty of each subtask, and generating a complete course by utilizing a series-parallel strategy for training.
In particular, the divided course difficulty is combined with each subtaskIt is difficult to sequentially generate courses of the overall task in a serial-parallel mannerDegree sequence。
Further, overall task initial course difficultyThe initial difficulty of all the subtasks is used for generating, and the difficulty of other courses of the total task is generated by the difficulty of the previous course and the newly added subtask course, as shown in the following formula:
wherein the content of the first and second substances,to generate an overall task(ii) individual course difficulty; by means of a series-parallel type strategy,at the same time。
In the embodiment of the invention, the difficulty of the parallel operation, namely the difficulty of the next subtask is increased firstly(ii) a When in useIncrease the serial difficultyThen, the parallelism difficulty continues to increase.
Finally, a complete course of the overall task is generated by iteration, in total: () And (4) difficulty for skill training.
Taking a robot ball-heading task as an example, in a preparation stage, the shooting point position and the shooting angle are determined according to the goal and the water polo position, and the specific calculation formula is as follows:
wherein the content of the first and second substances,andrespectively a goal position and a water polo position,reserving a shoot sprint distance for the robot.
Further, the preparation stage subtask is that the robot rapidly moves to the shooting point position and adjusts to a proper shooting angle, and the parameterization description of the subtask is as follows:
wherein the content of the first and second substances,in order to determine the shooting point position according to the positions of the goal and the water polo,in order to obtain the angle of incidence,is the pose of the robot and is the position of the robot,to allow for errors.
In the embodiment of the invention, when the error between the robot and the shooting point is smaller than the allowable error and is stable, the sub-task at the stage is judged to be completed. Thus, the task object of the preparation phaseBy adjustingThe size of the device is adjusted to adjust the task difficulty,the smaller the task the higher the difficulty.
In particular, during the shooting stage, the robot shoots from the shooting pointStart to accelerate forward while maintaining heading at the angle of incidenceIn the valid range of (2), the parameterized description of the subtask is specifically as follows:
wherein the content of the first and second substances,at the desired minimum heading speed,the speed of the robot in the advancing direction is determined,to allow for errors. And when the error between the heading and the shooting angle of the robot is smaller than the allowable error and is stabilized, and the speed is higher than the expected minimum heading speed, judging that the subtask in the stage is completed. Thus, the task object of the shooting phaseBy adjustingThe size of the device is adjusted to adjust the task difficulty,the smaller the task the higher the difficulty.
Based on this, the object of the heading task of the underwater robot is as follows:
it should be noted that, in the embodiment of the present invention, the real-time position of the underwater robot is acquired by a global vision system, such as a cameraCourse of the vehicleGoal positionAnd the position of the water ball.
For example, based on the two sub-task objectivesAre divided intoDifficulty ofAndthe specific difficulty adopts a difficulty increasing function, and the calculation formula is as follows:
wherein the content of the first and second substances,as a subtaskTo (1) aThe difficulty of the operation is that,as a subtaskThe object of (a) is to be,in order to increase the difficulty of the function, an exponential function is adopted in the embodiment of the invention.
And further, combining the difficulty of each subtask, and generating a complete course by utilizing a series-parallel strategy for training.
In particular, the division difficulty is determined according to each subtaskSequentially generating course difficulty sequence of the total task according to a serial-parallel modeThus obtaining 7 sub courses.
Further, fig. 4 exemplarily shows a schematic diagram of a series-parallel course generation method for the underwater robot heading skill learning according to the embodiment of the present invention.
For example, let get subtasksThe 4 courses are as difficult asSubtasks, sub-tasksThe 4 courses are as difficult as. The series-parallel strategy is adopted as follows:
wherein the content of the first and second substances,to generate an overall task(ii) individual course difficulty; by using a series-parallel type training strategy,at the same time。
In the embodiment of the invention, the difficulty of the parallel operation, namely the difficulty of the next subtask is increased firstly(ii) a When in useIncrease the serial difficultyThen, the parallelism difficulty continues to increase. Generating course difficulty sequence of overall task in sequence。
For example, the sub-course generating step is:
based on this, a complete course of the overall task is generated by iteration, in total: () And the sub-courses of the individual difficulty are used for robot skill training.
According to the scheme, according to the task target, the robot trains the model through the series-parallel courses with gradually increased difficulty, learns and masters skills, and solves the problems of low success rate and difficult convergence caused by direct high-difficulty training.
And step 304, training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model.
According to the scheme, according to the task target, the robot trains the model with gradually increased difficulty through multi-stage series-parallel courses, learns and masters skills, and solves the problems of low success rate and difficult convergence caused by direct high-difficulty training.
In the embodiment of the invention, each sub-course is taken as a training target, and the robot skill learning model is trained by using a reinforcement learning method.
In one possible implementation, a robot skill learning model is trained using a novel model-free reinforcement learning algorithm (SAC).
Further, the flow of steps is shown in fig. 5, which specifically includes the following steps:
It should be noted that each set of training samples includes a first environment state, action description information, a second environment state, and an action reward; the first environment state is an environment state before the action corresponding to the action description information is executed; the second environment state is the environment state after the action corresponding to the action description information is executed; the action reward is a reward value after the action corresponding to the action description information is executed.
Specifically, a first environment state and action description information are obtained;
and determining the action reward according to the second environment state after the action corresponding to the action description information is executed.
In the embodiment of the invention, the robot skill learning model outputs the action description information according to the first environment state, and controls the interaction between the robot and the environment.
Taking the robot heading skill learning as an example, first, a first environment state is obtainedInputting the data into a current robot skill learning model, and outputting action description information according to the robot skill learning modelThe device is used for controlling the fluctuation frequency of each fin of the underwater robot so as to guide the underwater robot to complete a heading task.
Wherein the stateBy the position and posture of the robotSpeed, velocityShooting point position and shooting angleTask phase identifierComposition is carried out; movement ofThe robot consists of the fluctuation frequency of each fin of the robot, and the fluctuation frequency is input into a fin controller to control the swimming of the robot.
Further, acquiring a robot state and a task stage identifier corresponding to the second environment state;
an action reward is determined based on the robot state and the task phase identifier.
When in useAfter a time step, acquiring a second environmental stateAnd obtaining the action reward corresponding to the environmentCommon composition experienceAnd storing the training result into an experience pool for off-line training of a reinforcement learning algorithm.
Taking the learning of the skill of the robot in the top ball as an example, the specific calculation formula of the action reward is as follows:
wherein the content of the first and second substances,is each partial weight coefficient; when in useThe preparation stage is the time for the preparation,the robot is guided to move to the position of the shooting point,guiding the robot to adjust to a proper shooting angle; when in useThe time is the shooting stage,the robot is guided to accelerate the sprint,guiding the robot to remain within the range of firing angles; when in useThe sub-course task is completed to reach the goal.
According to the scheme, the action reward is designed according to the state of the robot and the task phase identifier, and efficient and accurate learning of the robot on complex skills is achieved.
And 502, determining a loss function of the robot skill learning model according to the first environment state, the action description information, the second environment state and the action reward.
And 503, updating parameters of the robot skill learning model according to the loss function and evaluating the performance of the robot skill learning model to obtain a performance evaluation result.
In the embodiment of the invention, the performance of the current robot skill learning model is evaluated through the performance evaluation module, and the difficulty acceleration is controlled through the course scheduling module according to the performance evaluation result.
Specifically, the performance evaluation module tests the success rate and the time efficiency of the robot for completing the ball-shooting and shooting task under the current strategy network by testing the current robot skill learning model, and can evaluate the performance of the model trained under the current sub-course difficulty.
Further, according to the performance evaluation result, the course scheduling module judges whether to switch to the next course difficulty, and if the performance or the training time reaches a threshold value, the course difficulty is increased according to the set course, so that the control of the increase of the course difficulty is realized.
For example, when the difficulty reaches a preset target and the robot performance evaluation passes, a trained robot skill learning model is obtained. When the difficulty of the course reaches the preset target, the course reaches the 7 th) The curriculum difficulty and the performance of the robot skill learning model reach the requirement,the robot realizes the skill learning of the heading task.
In the embodiment of the invention, according to the goal of the heading task, the robot can train the model through multi-stage and series-parallel courses with gradually increasing difficulty, learn and master the heading skill, and avoid the problems of low success rate, difficult convergence and the like caused by direct high-difficulty training.
According to the scheme, the network can learn a more robust model more quickly through reinforcement learning, so that the robot can efficiently and accurately learn complex skills, and the problems of difficult convergence and low success rate of the existing skill learning method in the face of multi-stage complex tasks are solved
Further, to verify validity, for example, a heading task verification may be performed in an indoor pool of 5m × 4 m × 1.1 m. The global visual tracking system installed on the top of the pool is connected to the control console through a USB, and the control console can calculate the current position and the course of the robot and the positions of the goal and the water polo in real time by processing images of the goal, the water polo, the robot and the surrounding environment, so that the shooting point position and the shooting angle are calculated according to the positions of the goal and the water polo. The control console can obtain the real-time environment state, obtain the action by inputting the real-time environment state into the trained robot skill learning model, and send the action to the fish fin controller in the robot as the motion control through wireless communication. The verification results of the robot heading task are given in fig. 6 and 7. Fig. 6 exemplarily shows a schematic diagram of a heading experiment position curve of the underwater robot. Fig. 7 exemplarily shows a schematic diagram of variation of the heading control frequency of the underwater robot. It can be seen that the embodiment of the invention enables the robot to move to the shooting point and adjust the shooting angle faster, then to accelerate in the direction of the shooting angle and push the water polo into the goal.
Based on the technical scheme, the embodiment of the invention can train the heading strategy network by gradually increasing the difficulty through multi-stage and series-parallel courses, solves the problems of difficult convergence, low success rate and the like easily caused by the existing skill learning method in the face of multi-stage complex tasks, and enables the robot to efficiently and accurately learn the complex skills such as heading and the like through training to jack the water polo into the goal.
Based on the same inventive concept, fig. 8 exemplarily shows a device for robot skill learning, which can be a flow of a method for robot skill learning according to an embodiment of the present invention.
The apparatus, comprising:
an obtaining module 801, configured to obtain environment states at multiple consecutive equal-interval moments; the environment state comprises a robot state and a task stage identifier;
the processing module 802 is configured to input the plurality of continuous environment states at equal intervals to a trained robot skill learning model to obtain an action description information sequence of the robot skill; determining an action sequence executed by the robot according to the action description information sequence;
the trained robot skill learning model is obtained by training performance evaluation results obtained by executing action sequences by the robot under different environmental states and different environmental states.
Further, the processing module 802 is further configured to:
acquiring a task of robot skill learning before the environment states at the continuous equal interval time are input to a trained robot skill learning model to obtain an action description information sequence of the robot skill learning;
dividing the task into N subtasks;
dividing the difficulty of each of the N subtasks and generating M subtasks; m, N is a positive integer;
and training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model.
Further, the processing module 802 is specifically configured to:
acquiring a subtask target of each subtask in the N subtasks;
determining an allowable error of each subtask according to the subtask target;
determining a difficulty sequence of the allowable error by adopting a difficulty increasing function;
dividing the difficulty of each subtask according to the difficulty sequence to obtain a plurality of subtasks with different difficulties;
and combining the plurality of subtasks with different difficulties by adopting a series-parallel strategy to obtain M subtasks.
Further, the processing module 802 is specifically configured to:
for the 1 st sub-course, the following steps are performed:
acquiring a preset number of training sample sets; each group of training samples comprises a first environment state, action description information, a second environment state and action rewards; the first environment state is an environment state before the action corresponding to the action description information is executed; the second environment state is the environment state after the action corresponding to the action description information is executed; the action reward is a reward value after the action corresponding to the action description information is executed;
determining a loss function of the robot skill learning model according to the first environment state, the action description information, the second environment state and the action reward;
updating parameters of the robot skill learning model according to the loss function and evaluating the performance of the robot skill learning model to obtain a performance evaluation result;
and if the performance evaluation result or the training time reaches the threshold value, repeating the steps for the 2 nd sub-course until the performance evaluation result or the training time of the Mth sub-course reaches the threshold value, and obtaining the trained robot skill learning model.
Further, the processing module 802 is specifically configured to:
acquiring the first environment state and the action description information;
and determining the action reward according to the second environment state after the action corresponding to the action description information is executed.
Further, the processing module 802 is specifically configured to:
acquiring a robot state and a task stage identifier corresponding to the second environment state;
determining the action reward according to the robot state and the task phase identifier.
Based on the same inventive concept, another embodiment of the present invention provides an electronic device, which specifically includes the following components, with reference to fig. 9: a processor 901, memory 902, communication interface 903, and communication bus 904;
the processor 901, the memory 902 and the communication interface 903 complete mutual communication through the communication bus 904; the communication interface 903 is used for realizing information transmission among the devices;
the processor 901 is configured to call a computer program in the memory 902, and the processor executes the computer program to implement all the steps of the above-mentioned method for robot skill learning, for example, the processor executes the computer program to implement the following steps: acquiring environmental states of a plurality of continuous equal-interval moments; the environment state comprises a robot state and a task stage identifier; inputting the environment states of the plurality of continuous equal-interval moments into a trained robot skill learning model to obtain an action description information sequence of the robot skill; determining an action sequence executed by the robot according to the action description information sequence;
the trained robot skill learning model is obtained by training performance evaluation results obtained by executing action sequences by the robot under different environmental states and different environmental states.
Based on the same inventive concept, a further embodiment of the present invention provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs all the steps of the above-described method of robot skill learning, e.g. the processor performs the following steps when executing the computer program: acquiring environmental states of a plurality of continuous equal-interval moments; the environment state comprises a robot state and a task stage identifier; inputting the environment states of the plurality of continuous equal-interval moments into a trained robot skill learning model to obtain an action description information sequence of the robot skill; determining an action sequence executed by the robot according to the action description information sequence; the trained robot skill learning model is obtained by training performance evaluation results obtained by executing action sequences by the robot under different environmental states and different environmental states.
In addition, the logic instructions in the memory may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a robot skill learning apparatus, or a network device) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions may be essentially or partially implemented in the form of software products, which may be stored in computer readable storage media, such as ROM/RAM, magnetic disk, optical disk, etc., and include instructions for causing a computer device (which may be a personal computer, a robot skill learning apparatus, or a network device, etc.) to execute the method for robot skill learning according to the embodiments or some parts of the embodiments.
In addition, in the present invention, terms such as "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Moreover, in the present invention, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Furthermore, in the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (8)
1. A method of robotic skill learning, comprising:
acquiring environmental states of a plurality of continuous equal-interval moments; the environment state comprises a robot state and a task stage identifier;
inputting the environment states of the plurality of continuous equal-interval moments into a trained robot skill learning model to obtain an action description information sequence of the robot skill;
determining an action sequence executed by the robot according to the action description information sequence;
the trained robot skill learning model is obtained by training performance evaluation results obtained by executing action sequences by the robot under different environmental states and different environmental states;
before the step of inputting the plurality of environment states at the continuous equal interval time into the trained robot skill learning model to obtain the motion description information sequence of the robot skill, the method further includes:
acquiring a task of the robot skill learning;
dividing the task into N subtasks;
dividing the difficulty of each of the N subtasks and generating M subtasks; m, N is a positive integer;
and training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model.
2. The method of robotic skill learning of claim 1 wherein said dividing the difficulty of each of the N subtasks and generating M subtasks comprises:
acquiring a subtask target of each subtask in the N subtasks;
determining an allowable error of each subtask according to the subtask target;
determining a difficulty sequence of the allowable error by adopting a difficulty increasing function;
dividing the difficulty of each subtask according to the difficulty sequence to obtain a plurality of subtasks with different difficulties;
and combining the plurality of subtasks with different difficulties by adopting a series-parallel strategy to obtain M subtasks.
3. The method of claim 1, wherein the training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model comprises:
for the 1 st sub-course, the following steps are performed:
acquiring a preset number of training sample sets; each group of training samples comprises a first environment state, action description information, a second environment state and action rewards; the first environment state is an environment state before the action corresponding to the action description information is executed; the second environment state is the environment state after the action corresponding to the action description information is executed; the action reward is a reward value after the action corresponding to the action description information is executed;
determining a loss function of the robot skill learning model according to the first environment state, the action description information, the second environment state and the action reward;
updating parameters of the robot skill learning model according to the loss function and evaluating the performance of the robot skill learning model to obtain a performance evaluation result;
and if the performance evaluation result or the training time reaches the threshold value, repeating the steps for the 2 nd sub-course until the performance evaluation result or the training time of the Mth sub-course reaches the threshold value, and obtaining the trained robot skill learning model.
4. The method of robotic skill learning according to claim 3, wherein said obtaining a preset number of training sample sets comprises:
acquiring the first environment state and the action description information;
and determining the action reward according to the second environment state after the action corresponding to the action description information is executed.
5. The method of robot skill learning of claim 4, wherein the determining the action reward according to the second environmental status after the action corresponding to the action description information is performed comprises:
acquiring a robot state and a task stage identifier corresponding to the second environment state;
determining the action reward according to the robot state and the task phase identifier.
6. An apparatus for robotic skill learning, comprising:
the acquisition module is used for acquiring a plurality of environment states at continuous equal interval time; the environment state comprises a robot state and a task stage identifier;
the processing module is used for inputting the environment states of the plurality of continuous equal-interval moments into a trained robot skill learning model to obtain an action description information sequence of the robot skill; determining an action sequence executed by the robot according to the action description information sequence;
the trained robot skill learning model is obtained by training performance evaluation results obtained by executing action sequences by the robot under different environmental states and different environmental states;
the processing module is further configured to obtain a task of robot skill learning before the environment states at the multiple consecutive equal-interval moments are input to a trained robot skill learning model to obtain an action description information sequence of the robot skill learning;
dividing the task into N subtasks;
dividing the difficulty of each of the N subtasks and generating M subtasks; m, N is a positive integer;
and training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 5 are implemented when the processor executes the program.
8. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111537547.1A CN113919475B (en) | 2021-12-16 | 2021-12-16 | Robot skill learning method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111537547.1A CN113919475B (en) | 2021-12-16 | 2021-12-16 | Robot skill learning method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113919475A CN113919475A (en) | 2022-01-11 |
CN113919475B true CN113919475B (en) | 2022-04-08 |
Family
ID=79248964
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111537547.1A Active CN113919475B (en) | 2021-12-16 | 2021-12-16 | Robot skill learning method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113919475B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114609925B (en) * | 2022-01-14 | 2022-12-06 | 中国科学院自动化研究所 | Training method of underwater exploration strategy model and underwater exploration method of bionic machine fish |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111461309A (en) * | 2020-04-17 | 2020-07-28 | 支付宝(杭州)信息技术有限公司 | Method and device for updating reinforcement learning system for realizing privacy protection |
CN113168566A (en) * | 2018-11-30 | 2021-07-23 | 谷歌有限责任公司 | Controlling a robot by using entropy constraints |
CN113156892A (en) * | 2021-04-16 | 2021-07-23 | 西湖大学 | Four-footed robot simulated motion control method based on deep reinforcement learning |
CN113487039A (en) * | 2021-06-29 | 2021-10-08 | 山东大学 | Intelligent body self-adaptive decision generation method and system based on deep reinforcement learning |
-
2021
- 2021-12-16 CN CN202111537547.1A patent/CN113919475B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113168566A (en) * | 2018-11-30 | 2021-07-23 | 谷歌有限责任公司 | Controlling a robot by using entropy constraints |
CN111461309A (en) * | 2020-04-17 | 2020-07-28 | 支付宝(杭州)信息技术有限公司 | Method and device for updating reinforcement learning system for realizing privacy protection |
CN113156892A (en) * | 2021-04-16 | 2021-07-23 | 西湖大学 | Four-footed robot simulated motion control method based on deep reinforcement learning |
CN113487039A (en) * | 2021-06-29 | 2021-10-08 | 山东大学 | Intelligent body self-adaptive decision generation method and system based on deep reinforcement learning |
Non-Patent Citations (1)
Title |
---|
Towards Practical Multi-Object Manipulation using Relational Reinforcement Learning;Richard Li;《2020 IEEE international Conference on Robotics and Automation》;20200831;第4051-4056页 * |
Also Published As
Publication number | Publication date |
---|---|
CN113919475A (en) | 2022-01-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111488988B (en) | Control strategy simulation learning method and device based on counterstudy | |
CN108115681B (en) | Simulation learning method and device for robot, robot and storage medium | |
CN107102644B (en) | Underwater robot track control method and control system based on deep reinforcement learning | |
CN109523029A (en) | For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body | |
CN111026272B (en) | Training method and device for virtual object behavior strategy, electronic equipment and storage medium | |
US11759947B2 (en) | Method for controlling a robot device and robot device controller | |
CN111783994A (en) | Training method and device for reinforcement learning | |
CN113561986A (en) | Decision-making method and device for automatically driving automobile | |
CN113070878B (en) | Robot control method based on impulse neural network, robot and storage medium | |
CN113919475B (en) | Robot skill learning method and device, electronic equipment and storage medium | |
CN114290339B (en) | Robot realistic migration method based on reinforcement learning and residual modeling | |
CN114239974B (en) | Multi-agent position prediction method and device, electronic equipment and storage medium | |
CN113641099B (en) | Impedance control imitation learning training method for surpassing expert demonstration | |
CN114219066A (en) | Unsupervised reinforcement learning method and unsupervised reinforcement learning device based on Watherstein distance | |
Gromniak et al. | Deep reinforcement learning for mobile robot navigation | |
CN116352715A (en) | Double-arm robot cooperative motion control method based on deep reinforcement learning | |
CN116147627A (en) | Mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation | |
CN113910221B (en) | Mechanical arm autonomous motion planning method, device, equipment and storage medium | |
Sabathiel et al. | A computational model of learning to count in a multimodal, interactive environment. | |
CN113419424A (en) | Modeling reinforcement learning robot control method and system capable of reducing over-estimation | |
CN110515297B (en) | Staged motion control method based on redundant musculoskeletal system | |
CN113485107B (en) | Reinforced learning robot control method and system based on consistency constraint modeling | |
CN114571456B (en) | Electric connector assembling method and system based on robot skill learning | |
Subramanian | Task space behavior learning for humanoid robots using Gaussian mixture models | |
CN115496208B (en) | Cooperative mode diversified and guided unsupervised multi-agent reinforcement learning method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |