CN113919475B

CN113919475B - Robot skill learning method and device, electronic equipment and storage medium

Info

Publication number: CN113919475B
Application number: CN202111537547.1A
Authority: CN
Inventors: 王睿; 张天栋; 王宇; 王硕
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-04-08
Anticipated expiration: 2041-12-16
Also published as: CN113919475A

Abstract

The invention discloses a method, a device, electronic equipment and a storage medium for robot skill learning, wherein the method comprises the following steps: acquiring environmental states of a plurality of continuous equal-interval moments; the environment state comprises a robot state and a task stage identifier; inputting a plurality of continuous environment states at equal intervals to a trained robot skill learning model to obtain an action description information sequence of the robot skill; and determining the action sequence executed by the robot according to the action description information sequence. According to the robot skill learning method and the robot skill learning system, a plurality of continuous environment states at equal intervals are input into the robot skill learning model, and the action description information sequence of the robot learning skill is obtained, so that the robot skill learning is realized, the problems of difficult convergence and low success rate caused by the fact that a multi-stage complex task is faced are solved, the robustness is improved, and the efficient and accurate learning of the robot complex skill is realized.

Description

Robot skill learning method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for robot skill learning, electronic equipment and a storage medium.

Background

At present, various autonomous intelligent robots are widely applied to the fields of manufacturing, oceans, national defense and the like. With the development of robots and artificial intelligence technologies, the autonomous ability of the robots is continuously improved, and the robots can replace human beings to complete complex tasks in more fields.

As a robot skill learning algorithm with wide application, the reinforcement learning utilizes the interaction of the robot and the environment to learn the mapping from the state to the action, and optimizes an optimal strategy network under the guidance of a reward function to guide the robot to autonomously complete a specified task. Compared with the traditional control method, the existing robot skill learning method still has more problems and challenges in practical use, and particularly when a multi-stage complex task is faced, the problems of overlong learning time, difficulty in convergence, low success rate and the like easily occur.

In summary, there is a need for a method for robot skill learning to solve the above problems in the prior art.

Disclosure of Invention

Due to the problems of the existing methods, the invention provides a method, a device, an electronic device and a storage medium for robot skill learning.

In a first aspect, the present invention provides a method of robot skill learning, comprising:

acquiring environmental states of a plurality of continuous equal-interval moments; the environment state comprises a robot state and a task stage identifier;

inputting the environment states of the plurality of continuous equal-interval moments into a trained robot skill learning model to obtain an action description information sequence of the robot skill;

determining an action sequence executed by the robot according to the action description information sequence;

the trained robot skill learning model is obtained by training performance evaluation results obtained by executing action sequences by the robot under different environmental states and different environmental states.

Before the inputting the environmental states of the plurality of consecutive equal-interval time points into the trained robot skill learning model to obtain the motion description information sequence of the robot skill, the method further includes:

acquiring a task of the robot skill learning;

dividing the task into N subtasks;

dividing the difficulty of each of the N subtasks and generating M subtasks; m, N is a positive integer;

and training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model.

Further, the dividing difficulty of each of the N subtasks and generating M subtasks includes:

acquiring a subtask target of each subtask in the N subtasks;

determining an allowable error of each subtask according to the subtask target;

determining a difficulty sequence of the allowable error by adopting a difficulty increasing function;

dividing the difficulty of each subtask according to the difficulty sequence to obtain a plurality of subtasks with different difficulties;

and combining the plurality of subtasks with different difficulties by adopting a series-parallel strategy to obtain M subtasks.

Further, the training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model includes:

for the 1 st sub-course, the following steps are performed:

acquiring a preset number of training sample sets; each group of training samples comprises a first environment state, action description information, a second environment state and action rewards; the first environment state is an environment state before the action corresponding to the action description information is executed; the second environment state is the environment state after the action corresponding to the action description information is executed; the action reward is a reward value after the action corresponding to the action description information is executed;

determining a loss function of the robot skill learning model according to the first environment state, the action description information, the second environment state and the action reward;

updating parameters of the robot skill learning model according to the loss function and evaluating the performance of the robot skill learning model to obtain a performance evaluation result;

and if the performance evaluation result or the training time reaches the threshold value, repeating the steps for the 2 nd sub-course until the performance evaluation result or the training time of the Mth sub-course reaches the threshold value, and obtaining the trained robot skill learning model.

Further, the acquiring a preset number of training sample sets includes:

acquiring the first environment state and the action description information;

and determining the action reward according to the second environment state after the action corresponding to the action description information is executed.

Further, the determining the action reward according to the second environment state after the action corresponding to the action description information is executed includes:

acquiring a robot state and a task stage identifier corresponding to the second environment state;

determining the action reward according to the robot state and the task phase identifier.

In a second aspect, the present invention provides an apparatus for robot skill learning, comprising:

the acquisition module is used for acquiring a plurality of environment states at continuous equal interval time; the environment state comprises a robot state and a task stage identifier;

the processing module is used for inputting the environment states of the plurality of continuous equal-interval moments into a trained robot skill learning model to obtain an action description information sequence of the robot skill; determining an action sequence executed by the robot according to the action description information sequence;

Further, the processing module is further configured to:

acquiring a task of robot skill learning before the environment states at the continuous equal interval time are input to a trained robot skill learning model to obtain an action description information sequence of the robot skill learning;

dividing the task into N subtasks;

Further, the processing module is specifically configured to:

acquiring a subtask target of each subtask in the N subtasks;

determining an allowable error of each subtask according to the subtask target;

Further, the processing module is specifically configured to:

for the 1 st sub-course, the following steps are performed:

Further, the processing module is specifically configured to:

acquiring the first environment state and the action description information;

Further, the processing module is specifically configured to:

In a third aspect, the present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method for robot skill learning according to the first aspect when executing the computer program.

In a fourth aspect, the invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of robot skill learning as described in the first aspect.

According to the technical scheme, the robot skill learning method, the device, the electronic equipment and the storage medium provided by the invention have the advantages that the action description information sequence of the robot skill learning is obtained by inputting a plurality of continuous environment states at equal intervals to the robot skill learning model, so that the robot skill learning is realized, the problems of difficult convergence and low success rate caused by the fact that a multi-stage complex task is faced are solved, the robustness is improved, and the efficient and accurate learning of the complex skill of the robot is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a system framework for a method of robot skill learning provided by the present invention;

FIG. 2 is a schematic flow chart of a method for robot skill learning provided by the present invention;

FIG. 3 is a schematic flow chart of a method for robot skill learning provided by the present invention;

FIG. 4 is a schematic diagram of a series-parallel course generating method provided by the present invention;

FIG. 5 is a schematic flow chart of a method for robot skill learning provided by the present invention;

FIG. 6 is a schematic diagram of a top ball experiment position curve of the underwater robot provided by the invention;

FIG. 7 is a schematic diagram illustrating the variation of the control frequency of the heading ball of the underwater robot provided by the present invention;

FIG. 8 is a schematic structural diagram of a device for robot skill learning provided by the present invention;

fig. 9 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The method for robot skill learning provided by the embodiment of the present invention may be applied to a system architecture as shown in fig. 1, where the system architecture includes a camera 100 and a robot skill learning model 200.

Specifically, the camera 100 is used to acquire the environmental status at a plurality of consecutive equally spaced times.

The environment state includes a robot state and a task phase identifier. The robot skill learning model 200 is used to obtain an action description information sequence of the robot skill after inputting a plurality of environment states at consecutive equal intervals.

Further, the action sequence executed by the robot is determined according to the action description information sequence.

It should be noted that, the trained robot skill learning model is obtained by training performance evaluation results obtained after the robot executes action sequences in different environmental states and different environmental states.

It should be noted that fig. 1 is only an example of a system architecture according to the embodiment of the present invention, and the present invention is not limited to this specifically.

Based on the above illustrated system architecture, fig. 2 is a flowchart corresponding to a method for robot skill learning according to an embodiment of the present invention, as shown in fig. 2, the method includes:

in step 201, environmental states at a plurality of consecutive equal intervals are obtained.

The environment state includes a robot state and a task phase identifier.

For example, in underwater robot heading skill learning, robot states include robot pose, robot speed, goal point position, and shooting angle.

In one possible embodiment, the robot is a simulated leopard bream underwater robot.

Step 202, inputting a plurality of continuous environment states at equal intervals to the trained robot skill learning model to obtain an action description information sequence of the robot skill.

And step 203, determining the action sequence executed by the robot according to the action description information sequence.

According to the scheme, the action description information sequence of the robot learning skill is obtained by inputting a plurality of continuous environment states at equal intervals to the robot skill learning model, so that the robot skill learning is realized, the problems of difficult convergence and low success rate caused by the facing of multi-stage complex tasks are solved, the robustness is improved, and the efficient and accurate learning of the robot complex skill is realized.

Before a plurality of environment states at continuous equal intervals are input into a trained robot skill learning model to obtain an action description information sequence of the robot skill, the embodiment of the invention has the following steps as shown in fig. 3:

and 301, acquiring a task of robot skill learning.

Step 302, divide the task into N subtasks.

In the embodiment of the invention, the task is divided into a plurality of subtasks according to the logic sequence, and the target is set for each subtask.

Specifically, according to the task of robot skill learning, each subtask is parameterized and described as the target of each subtask.

In the embodiment of the invention, the target task is determined

And dividing it into according to the logical order of completion

Individual phase, overall task

Can be defined as

A phase subtask, represented by the following equation:

further, a course target is set for each subtask:

wherein the content of the first and second substances,

is as follows

The goals of the subtasks.

For example, in a robot heading task, the heading task

And ejecting the water polo into a preset goal for the underwater robot.

In particular, the task of heading the ball

Described as a two-stage task: the preparation stage and the shooting stage are divided into two subtasks which are expressed as follows:

step 303, dividing the difficulty of each of the N subtasks and generating M subtasks.

M, N is a positive integer.

Specifically, a subtask target of each subtask in the N subtasks is obtained;

determining the allowable error of each subtask according to the subtask target;

and combining a plurality of subtasks with different difficulties by adopting a series-parallel strategy to obtain M subtasks.

In the embodiment of the invention, the course difficulty of each subtask is divided in sequence according to the subtask target.

In particular, subtasks

Course object of

Divide it into

Difficulty of individual course

Determining the difficulty of each course as follows:

wherein the content of the first and second substances,

as a subtask

To (1) a

The difficulty of each course is determined,

as a subtask

The course object of (1) is,

the function can be set to be linear or non-linear, etc., for increasing the difficulty.

And further, combining the difficulty of each subtask, and generating a complete course by utilizing a series-parallel strategy for training.

In particular, the divided course difficulty is combined with each subtask

It is difficult to sequentially generate courses of the overall task in a serial-parallel mannerDegree sequence

。

Further, overall task initial course difficulty

The initial difficulty of all the subtasks is used for generating, and the difficulty of other courses of the total task is generated by the difficulty of the previous course and the newly added subtask course, as shown in the following formula:

,

wherein the content of the first and second substances,

to generate an overall task

(ii) individual course difficulty; by means of a series-parallel type strategy,

at the same time

。

In the embodiment of the invention, the difficulty of the parallel operation, namely the difficulty of the next subtask is increased firstly

(ii) a When in use

Increase the serial difficulty

Then, the parallelism difficulty continues to increase.

Finally, a complete course of the overall task is generated by iteration, in total: (

) And (4) difficulty for skill training.

Taking a robot ball-heading task as an example, in a preparation stage, the shooting point position and the shooting angle are determined according to the goal and the water polo position, and the specific calculation formula is as follows:

wherein the content of the first and second substances,

and

respectively a goal position and a water polo position,

reserving a shoot sprint distance for the robot.

Further, the preparation stage subtask is that the robot rapidly moves to the shooting point position and adjusts to a proper shooting angle, and the parameterization description of the subtask is as follows:

wherein the content of the first and second substances,

in order to determine the shooting point position according to the positions of the goal and the water polo,

in order to obtain the angle of incidence,

is the pose of the robot and is the position of the robot,

to allow for errors.

In the embodiment of the invention, when the error between the robot and the shooting point is smaller than the allowable error and is stable, the sub-task at the stage is judged to be completed. Thus, the task object of the preparation phase

By adjusting

The size of the device is adjusted to adjust the task difficulty,

the smaller the task the higher the difficulty.

In particular, during the shooting stage, the robot shoots from the shooting point

Start to accelerate forward while maintaining heading at the angle of incidence

In the valid range of (2), the parameterized description of the subtask is specifically as follows:

wherein the content of the first and second substances,

at the desired minimum heading speed,

the speed of the robot in the advancing direction is determined,

to allow for errors. And when the error between the heading and the shooting angle of the robot is smaller than the allowable error and is stabilized, and the speed is higher than the expected minimum heading speed, judging that the subtask in the stage is completed. Thus, the task object of the shooting phase

By adjusting

The size of the device is adjusted to adjust the task difficulty,

the smaller the task the higher the difficulty.

Based on this, the object of the heading task of the underwater robot is as follows:

it should be noted that, in the embodiment of the present invention, the real-time position of the underwater robot is acquired by a global vision system, such as a camera

Course of the vehicle

Goal position

And the position of the water ball.

For example, based on the two sub-task objectives

Are divided into

Difficulty of

And

the specific difficulty adopts a difficulty increasing function, and the calculation formula is as follows:

wherein the content of the first and second substances,

as a subtask

To (1) a

The difficulty of the operation is that,

as a subtask

The object of (a) is to be,

in order to increase the difficulty of the function, an exponential function is adopted in the embodiment of the invention.

In particular, the division difficulty is determined according to each subtask

Sequentially generating course difficulty sequence of the total task according to a serial-parallel mode

Thus obtaining 7 sub courses.

Further, fig. 4 exemplarily shows a schematic diagram of a series-parallel course generation method for the underwater robot heading skill learning according to the embodiment of the present invention.

For example, let get subtasks

The 4 courses are as difficult as

Subtasks, sub-tasks

The 4 courses are as difficult as

. The series-parallel strategy is adopted as follows:

,

wherein the content of the first and second substances,

to generate an overall task

(ii) individual course difficulty; by using a series-parallel type training strategy,

at the same time

。

(ii) a When in use

Increase the serial difficulty

Then, the parallelism difficulty continues to increase. Generating course difficulty sequence of overall task in sequence

。

For example, the sub-course generating step is:

，

，

，

，

，

，

。

based on this, a complete course of the overall task is generated by iteration, in total: (

) And the sub-courses of the individual difficulty are used for robot skill training.

According to the scheme, according to the task target, the robot trains the model through the series-parallel courses with gradually increased difficulty, learns and masters skills, and solves the problems of low success rate and difficult convergence caused by direct high-difficulty training.

And step 304, training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model.

According to the scheme, according to the task target, the robot trains the model with gradually increased difficulty through multi-stage series-parallel courses, learns and masters skills, and solves the problems of low success rate and difficult convergence caused by direct high-difficulty training.

In the embodiment of the invention, each sub-course is taken as a training target, and the robot skill learning model is trained by using a reinforcement learning method.

In one possible implementation, a robot skill learning model is trained using a novel model-free reinforcement learning algorithm (SAC).

Further, the flow of steps is shown in fig. 5, which specifically includes the following steps:

step 501, for each sub-course, a preset number of training sample sets are obtained.

It should be noted that each set of training samples includes a first environment state, action description information, a second environment state, and an action reward; the first environment state is an environment state before the action corresponding to the action description information is executed; the second environment state is the environment state after the action corresponding to the action description information is executed; the action reward is a reward value after the action corresponding to the action description information is executed.

Specifically, a first environment state and action description information are obtained;

In the embodiment of the invention, the robot skill learning model outputs the action description information according to the first environment state, and controls the interaction between the robot and the environment.

Taking the robot heading skill learning as an example, first, a first environment state is obtained

Inputting the data into a current robot skill learning model, and outputting action description information according to the robot skill learning model

The device is used for controlling the fluctuation frequency of each fin of the underwater robot so as to guide the underwater robot to complete a heading task.

Wherein the state

By the position and posture of the robot

Speed, velocity

Shooting point position and shooting angle

Task phase identifier

Composition is carried out; movement of

The robot consists of the fluctuation frequency of each fin of the robot, and the fluctuation frequency is input into a fin controller to control the swimming of the robot.

Further, acquiring a robot state and a task stage identifier corresponding to the second environment state;

an action reward is determined based on the robot state and the task phase identifier.

When in use

After a time step, acquiring a second environmental state

And obtaining the action reward corresponding to the environment

Common composition experience

And storing the training result into an experience pool for off-line training of a reinforcement learning algorithm.

Taking the learning of the skill of the robot in the top ball as an example, the specific calculation formula of the action reward is as follows:

wherein the content of the first and second substances,

is each partial weight coefficient; when in use

The preparation stage is the time for the preparation,

the robot is guided to move to the position of the shooting point,

guiding the robot to adjust to a proper shooting angle; when in use

The time is the shooting stage,

the robot is guided to accelerate the sprint,

guiding the robot to remain within the range of firing angles; when in use

The sub-course task is completed to reach the goal.

According to the scheme, the action reward is designed according to the state of the robot and the task phase identifier, and efficient and accurate learning of the robot on complex skills is achieved.

And 502, determining a loss function of the robot skill learning model according to the first environment state, the action description information, the second environment state and the action reward.

And 503, updating parameters of the robot skill learning model according to the loss function and evaluating the performance of the robot skill learning model to obtain a performance evaluation result.

In the embodiment of the invention, the performance of the current robot skill learning model is evaluated through the performance evaluation module, and the difficulty acceleration is controlled through the course scheduling module according to the performance evaluation result.

Specifically, the performance evaluation module tests the success rate and the time efficiency of the robot for completing the ball-shooting and shooting task under the current strategy network by testing the current robot skill learning model, and can evaluate the performance of the model trained under the current sub-course difficulty.

Step 504, if the performance evaluation result or the training time reaches the threshold, repeating the above steps for the 2 nd sub-course until the performance evaluation result or the training time of the mth sub-course reaches the threshold, and obtaining the trained robot skill learning model.

Further, according to the performance evaluation result, the course scheduling module judges whether to switch to the next course difficulty, and if the performance or the training time reaches a threshold value, the course difficulty is increased according to the set course, so that the control of the increase of the course difficulty is realized.

For example, when the difficulty reaches a preset target and the robot performance evaluation passes, a trained robot skill learning model is obtained. When the difficulty of the course reaches the preset target, the course reaches the 7 th

) The curriculum difficulty and the performance of the robot skill learning model reach the requirement,the robot realizes the skill learning of the heading task.

In the embodiment of the invention, according to the goal of the heading task, the robot can train the model through multi-stage and series-parallel courses with gradually increasing difficulty, learn and master the heading skill, and avoid the problems of low success rate, difficult convergence and the like caused by direct high-difficulty training.

According to the scheme, the network can learn a more robust model more quickly through reinforcement learning, so that the robot can efficiently and accurately learn complex skills, and the problems of difficult convergence and low success rate of the existing skill learning method in the face of multi-stage complex tasks are solved

Further, to verify validity, for example, a heading task verification may be performed in an indoor pool of 5m × 4 m × 1.1 m. The global visual tracking system installed on the top of the pool is connected to the control console through a USB, and the control console can calculate the current position and the course of the robot and the positions of the goal and the water polo in real time by processing images of the goal, the water polo, the robot and the surrounding environment, so that the shooting point position and the shooting angle are calculated according to the positions of the goal and the water polo. The control console can obtain the real-time environment state, obtain the action by inputting the real-time environment state into the trained robot skill learning model, and send the action to the fish fin controller in the robot as the motion control through wireless communication. The verification results of the robot heading task are given in fig. 6 and 7. Fig. 6 exemplarily shows a schematic diagram of a heading experiment position curve of the underwater robot. Fig. 7 exemplarily shows a schematic diagram of variation of the heading control frequency of the underwater robot. It can be seen that the embodiment of the invention enables the robot to move to the shooting point and adjust the shooting angle faster, then to accelerate in the direction of the shooting angle and push the water polo into the goal.

Based on the technical scheme, the embodiment of the invention can train the heading strategy network by gradually increasing the difficulty through multi-stage and series-parallel courses, solves the problems of difficult convergence, low success rate and the like easily caused by the existing skill learning method in the face of multi-stage complex tasks, and enables the robot to efficiently and accurately learn the complex skills such as heading and the like through training to jack the water polo into the goal.

Based on the same inventive concept, fig. 8 exemplarily shows a device for robot skill learning, which can be a flow of a method for robot skill learning according to an embodiment of the present invention.

The apparatus, comprising:

an obtaining module 801, configured to obtain environment states at multiple consecutive equal-interval moments; the environment state comprises a robot state and a task stage identifier;

the processing module 802 is configured to input the plurality of continuous environment states at equal intervals to a trained robot skill learning model to obtain an action description information sequence of the robot skill; determining an action sequence executed by the robot according to the action description information sequence;

Further, the processing module 802 is further configured to:

dividing the task into N subtasks;

Further, the processing module 802 is specifically configured to:

acquiring a subtask target of each subtask in the N subtasks;

determining an allowable error of each subtask according to the subtask target;

Further, the processing module 802 is specifically configured to:

for the 1 st sub-course, the following steps are performed:

Further, the processing module 802 is specifically configured to:

acquiring the first environment state and the action description information;

Further, the processing module 802 is specifically configured to:

Based on the same inventive concept, another embodiment of the present invention provides an electronic device, which specifically includes the following components, with reference to fig. 9: a processor 901, memory 902, communication interface 903, and communication bus 904;

the processor 901, the memory 902 and the communication interface 903 complete mutual communication through the communication bus 904; the communication interface 903 is used for realizing information transmission among the devices;

the processor 901 is configured to call a computer program in the memory 902, and the processor executes the computer program to implement all the steps of the above-mentioned method for robot skill learning, for example, the processor executes the computer program to implement the following steps: acquiring environmental states of a plurality of continuous equal-interval moments; the environment state comprises a robot state and a task stage identifier; inputting the environment states of the plurality of continuous equal-interval moments into a trained robot skill learning model to obtain an action description information sequence of the robot skill; determining an action sequence executed by the robot according to the action description information sequence;

Based on the same inventive concept, a further embodiment of the present invention provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs all the steps of the above-described method of robot skill learning, e.g. the processor performs the following steps when executing the computer program: acquiring environmental states of a plurality of continuous equal-interval moments; the environment state comprises a robot state and a task stage identifier; inputting the environment states of the plurality of continuous equal-interval moments into a trained robot skill learning model to obtain an action description information sequence of the robot skill; determining an action sequence executed by the robot according to the action description information sequence; the trained robot skill learning model is obtained by training performance evaluation results obtained by executing action sequences by the robot under different environmental states and different environmental states.

In addition, the logic instructions in the memory may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a robot skill learning apparatus, or a network device) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions may be essentially or partially implemented in the form of software products, which may be stored in computer readable storage media, such as ROM/RAM, magnetic disk, optical disk, etc., and include instructions for causing a computer device (which may be a personal computer, a robot skill learning apparatus, or a network device, etc.) to execute the method for robot skill learning according to the embodiments or some parts of the embodiments.

In addition, in the present invention, terms such as "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Moreover, in the present invention, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Furthermore, in the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of robotic skill learning, comprising:

the trained robot skill learning model is obtained by training performance evaluation results obtained by executing action sequences by the robot under different environmental states and different environmental states;

before the step of inputting the plurality of environment states at the continuous equal interval time into the trained robot skill learning model to obtain the motion description information sequence of the robot skill, the method further includes:

acquiring a task of the robot skill learning;

dividing the task into N subtasks;

2. The method of robotic skill learning of claim 1 wherein said dividing the difficulty of each of the N subtasks and generating M subtasks comprises:

acquiring a subtask target of each subtask in the N subtasks;

determining an allowable error of each subtask according to the subtask target;

3. The method of claim 1, wherein the training the robot skill learning model according to the target of each of the M sub-courses and the difficulty of the M sub-courses in sequence to obtain the trained robot skill learning model comprises:

for the 1 st sub-course, the following steps are performed:

4. The method of robotic skill learning according to claim 3, wherein said obtaining a preset number of training sample sets comprises:

acquiring the first environment state and the action description information;

5. The method of robot skill learning of claim 4, wherein the determining the action reward according to the second environmental status after the action corresponding to the action description information is performed comprises:

6. An apparatus for robotic skill learning, comprising:

the processing module is further configured to obtain a task of robot skill learning before the environment states at the multiple consecutive equal-interval moments are input to a trained robot skill learning model to obtain an action description information sequence of the robot skill learning;

dividing the task into N subtasks;

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 5 are implemented when the processor executes the program.

8. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.