CN115454096A

CN115454096A - Robot strategy training system and training method based on curriculum reinforcement learning

Info

Publication number: CN115454096A
Application number: CN202211227150.7A
Authority: CN
Inventors: 吴立刚; 董博; 王淼; 王夏爽; 姚蔚然; 田昊宇; 丁季时雨; 孙科武; 杨皙睿; 孙光辉
Original assignee: Second Research Institute Of Casic; Harbin Institute of Technology
Current assignee: Second Research Institute Of Casic; Harbin Institute of Technology
Priority date: 2022-10-09
Filing date: 2022-10-09
Publication date: 2022-12-09

Abstract

A robot strategy training system and a training method based on curriculum reinforcement learning belong to the field of autonomous decision and control of an unmanned system. The invention solves the problem that the existing method is difficult to obtain good decision and control effect in the aspect of strategy training aiming at the robot. Aiming at different types of task modes of heterogeneous multi-robot, the invention takes a dynamic model of a complex environment as input and constructs a multi-robot joint task decision course learning training architecture based on course learning. And (4) considering the progressive progression of task difficulty in the training process, and establishing a parameter autonomous generation algorithm and a target autonomous generation algorithm based on a complex environment dynamics model. And then, on the basis, a course difficulty evaluation and calibration algorithm is established and fed back to the self-optimization reinforcement learning algorithm. The method can be applied to autonomous decision and control of the unmanned system.

Description

Robot strategy training system and training method based on curriculum reinforcement learning

Technical Field

The invention belongs to the field of autonomous decision making and control of unmanned systems, and particularly relates to a robot strategy training system and a robot strategy training method based on curriculum reinforcement learning.

Background

The autonomous decision making of multiple robots is one of the hot problems of study of scholars in recent years, and is widely applied to the fields of military affairs, industry and the like. Where the training of an autonomous decision strategy is often achieved by machine learning. The course learning is based on the reinforced learning by using the idea that human beings are difficult to learn from the easy, the model learns easy samples firstly, then the sample difficulty is gradually improved, and higher training speed and better training effect can be obtained. The core of course learning lies in the autonomous generation of training tasks and the autonomous ranking of task difficulties. The research about the autonomous generation of tasks is limited at present, and the proposed method has an unsupervised mode to train a gradually generalized problem solver; and setting parameter vectors by taking the final task as a template, and adjusting parameters to obtain an intermediate task. However, the existing task autonomous generation method is poor in effectiveness in the aspect of generating the training task aiming at the robot strategy. The mainstream method for autonomous ranking of task difficulty is to consider only reordering of samples of the final task without changing the task itself; changing certain aspects of the MDP to create intermediate tasks with different MDP structures; and (4) considering the evaluation of the difficulty of the human on the task, and sequencing by using a human-in-loop method. However, the accuracy of the sequencing result obtained by adopting the existing autonomous sequencing method is poor, and part of the existing autonomous sequencing method is combined with task generation, so that the method is not suitable for strategy training aiming at the robot.

Therefore, in summary, the existing task autonomous generation method and the existing autonomous ranking method are not good enough in terms of strategy training for robots, and it is difficult to obtain good decision and control effects.

Disclosure of Invention

The invention aims to provide a robot strategy training system and a training method based on course reinforcement learning, which aim to solve the problem that the existing method is difficult to obtain good decision and control effects in the aspect of strategy training of a robot due to the fact that the effectiveness of the existing task autonomous generation method in the aspect of robot strategy training task generation is poor and the accuracy of the sequencing results obtained by the existing autonomous sequencing method is poor.

The technical scheme adopted by the invention for solving the technical problems is as follows:

based on one aspect of the invention, a robot strategy training system based on curriculum reinforcement learning comprises an algorithm operation container module, a training curriculum generation module and a feedback evaluation module, wherein:

the training course generation module is divided into a task generator and a task comparator, and the task generator is used for autonomously generating a course task scene; the task comparator is used for sequencing the difficulty of the tasks from easy to difficult through a neural network to obtain courses;

the algorithm operation container module is used for configuring operation containers for a target recognition algorithm, a robot path planning algorithm and a game countermeasure decision algorithm so as to carry out self-optimization reinforcement learning algorithm training on the target recognition algorithm, the robot path planning algorithm and the game countermeasure decision algorithm according to courses obtained by the training course generation module;

the feedback evaluation module is used for carrying out robot self-organization reinforcement training according to the training error of the robot, outputting the score of the robot for the task execution condition according to the robot self-organization reinforcement training result, and feeding the score of the robot for the task execution condition back to the algorithm operation container module to guide self-optimization reinforcement learning algorithm training.

Further, the target identification algorithm is a YOLOv3 algorithm, the robot path planning algorithm is an artificial potential field algorithm, and the game countermeasure decision algorithm is a PPO algorithm.

Based on another aspect of the invention, a robot strategy training method based on curriculum reinforcement learning specifically comprises the following steps:

step one, performing real scene three-dimensional detection reconstruction and task scene intelligent environment autonomous generation by using a task generator;

step two, sequencing the task scenes from easy to difficult by using a task sequencer to obtain training courses;

thirdly, performing self-optimized reinforcement learning training on a target recognition algorithm, a robot path planning algorithm and a game countermeasure decision algorithm in the algorithm operation container module based on the training courses generated in the second step;

and step four, the feedback evaluation module outputs the scores of the robot for the task execution conditions according to the reinforcement learning training results in the step three, and then feeds the scores of the robot for the task execution conditions back to the algorithm operation container module to guide the training of the target recognition algorithm, the robot path planning algorithm and the game countermeasure decision algorithm.

Further, the real scene three-dimensional detection reconstruction specifically comprises the following processes:

step 1, obtaining depth image

Shooting depth images of the same scene under different angles and illumination intensities through a vision camera and a laser radar;

step 2, preprocessing of depth image

Denoising the depth image by Gaussian filtering, and repairing the denoised depth image by a DeepFillv2 algorithm to obtain a recovered depth image, namely a preprocessed depth image;

step 3, calculating point cloud data by the preprocessed depth image

Calculating a conversion relation between a world coordinate system and an image pixel coordinate system according to an imaging principle, and obtaining point cloud data of the preprocessed depth image in the world coordinate system by using the calculated conversion relation; carrying out distortion compensation on the obtained point cloud data to obtain point cloud data after distortion compensation;

step 4, point cloud registration

Matching and superposing distortion compensated point cloud coordinates corresponding to multiple frames of preprocessed depth images under different angles and illumination into a world coordinate system according to the translation vector and the rotation matrix of each frame by taking the public part of a scene as a reference to obtain a point cloud space after registration;

step 5, fusing point cloud data after registration

Constructing a volume grid by taking the initial position of the sensor as an origin, and dividing the point cloud space after registration into cubes by using the grid, namely dividing the point cloud space after registration into voxels; simulating a surface by assigning a distance field value to each voxel;

step 6, surface Generation

And (5) processing the result obtained in the step (5) by adopting an MC algorithm to generate a three-dimensional surface, namely the task scene map.

Further, the specific process of step 6 is:

the method comprises the steps of respectively storing eight pieces of adjacent data in a data field at eight vertexes of a voxel, after a potential value T is selected for two endpoints of an edge on a boundary voxel, when one endpoint is larger than T and the other endpoint is smaller than T, a vertex of an isosurface exists on the edge, after all twelve edges in the voxel are traversed, intersections of the twelve edges and the isosurface in the voxel are obtained, a triangular surface patch in the voxel is constructed, all the triangular surface patches in the voxel divide the voxel into two regions, namely an isosurface region and an isosurface region, all the triangular surface patches in the voxel are connected to form the isosurface of the voxel, the isosurfaces of all the voxels are combined to form a complete three-dimensional surface, and the formed complete three-dimensional surface is used as a task scene map.

Further, the specific process of the autonomous generation of the task scene intelligent environment is as follows:

step 1) task scene segmentation generation

Constructing an adjacency matrix of voxels around each point in the point cloud by using each voxel obtained after segmentation in the step 5, and weighting the edges of the targets according to the adjacency matrix to complete the separation of overlapped targets in the point cloud, namely segmenting the whole point cloud into 3D point cloud models of each object;

performing data association on the segmented object 3D point cloud model and a task scene map, judging each segmented object type, and adding the object type into a model base;

dividing non-ground point cloud clusters into different types of point cloud clusters to construct an integral three-dimensional semantic map;

step 2) automatic generation of task targets

The specific process of the step 2) is as follows:

the method comprises the following steps of S1, projecting a three-dimensional semantic map onto a horizontal ground to obtain a two-dimensional map;

s2, according to the size of the two-dimensional map, representing the position of a random point by generating random seeds, and inputting the generated random seeds into a Voronoi-Dirichlet mosaic algorithm;

s3, carrying out Dirony triangulation by the Voronoi-Dirichlet mosaic algorithm based on the input random seeds, and dividing the two-dimensional map into polygons by using the vertical bisector of the side of each triangle obtained by the triangulation, namely obtaining a Voronoi diagram;

s4, randomly selecting a polygon as an obstacle area, and then randomly selecting a blank vertex as the position of an obstacle point, a threat point or a reward point model in a model library;

s5, placing a polygonal prism as an obstacle at a position corresponding to an obstacle polygon in the three-dimensional map, and placing an obstacle point, a threat point or a reward point model on the ground at a position corresponding to the selected blank vertex, namely completing the mapping from the two-dimensional map to the three-dimensional task scene map;

s6, checking whether the three-dimensional task scene map added with the model meets the constraint condition of the scene, if so, directly executing the step S7, if not, correcting and placing the position of the model which does not meet the constraint until the three-dimensional task scene map added with the model meets the constraint condition of the scene, and then executing the step S7;

s7, randomly generating the number of the robots and the initial positions of the robots to complete the generation of random game confrontation tasks;

step 3) verification of robot dynamics model

Setting initial kinematic parameters of a dynamic model of the robot unmanned system, selecting a key frame according to the lowest sampling frequency of each sensor in an IMU, a laser radar and a vision camera carried by the robot unmanned system to integrate IMU data, and obtaining state quantity increment between adjacent key frames;

and predicting error data at the next moment according to the state quantity increment between adjacent key frames at the current moment, performing error compensation according to the predicted error data, comparing the difference between a robot dynamics model and an actual robot system, and performing iterative compensation on the initial kinematic parameters to realize dynamics modeling of the robot and verification of the dynamics parameters.

Further, the task sequencer is implemented by using a RankNet network.

Further, the RankNet network has the working principle that:

feature vector composed of task scene and robot dynamics model

Inputting a RankNet network, and mapping the input eigenvector into a real number

Will task U _i The corresponding input feature vector is denoted x _i To task U _j The corresponding input feature vector is denoted x _j The RankNet network is respectively paired with x _i And x _j Forward calculation is carried out to obtain x _i Corresponding difficulty score s _i ＝f(x _i ) And x _j Corresponding difficulty score s _j ＝f(x _j )；

By using

Representing a task U _i Task comparing U _j Higher score, task U _i Score ratio task U of _j Higher scoring predictive relevance probability P _ij Comprises the following steps:

the parameter sigma is a constant, and e is the base number of a natural logarithm;

with S _ij Representing a task U _i And task U _j The difficulty level of (2) is compared, wherein:

probability of true correlation

Comprises the following steps:

the RankNet compares the difficulty between tasks through the probability idea, namely, the task U is not directly judged _i And task U _j Is said to be task U _i Task comparing U _j The probability of high difficulty is P _ij The minimum difference between the predicted correlation probability and the real correlation probability is taken as an optimization target.

Further, the RankNet network model is trained by adopting a pairwise method.

Further, performing distortion compensation on the obtained point cloud data to obtain point cloud data after distortion compensation; the specific process comprises the following steps:

calculating the time difference of each piece of original point cloud data acquired by a laser radar relative to the laser point cloud data at the initial moment in a frame, calculating the motion information of the robot by using an IMU (inertial measurement unit), calculating a transformation matrix of a laser radar coordinate system at the acquisition moment of each piece of original laser point cloud relative to the laser radar coordinate system at the initial moment of the frame according to the time difference and the motion information, and multiplying the transformation matrix by the corresponding coordinates of the original laser point cloud to obtain the laser point cloud coordinates after distortion compensation.

The invention has the beneficial effects that:

aiming at different types of task modes of heterogeneous multi-robot, the invention takes a dynamic model of a complex environment as input and constructs a course learning training architecture for multi-robot joint task decision-making based on course learning. And (4) considering the progressive progression of task difficulty in the training process, and establishing a parameter autonomous generation algorithm and a target autonomous generation algorithm based on the complex environment dynamic model. And then, on the basis, a course difficulty evaluation and calibration algorithm is established and fed back to the self-optimization reinforcement learning algorithm. Meanwhile, the invention saves a large amount of labor cost, can realize the autonomous training of a multi-robot system in a complex environment, and solves the problem that the existing method is difficult to obtain good decision and control effects in the aspect of strategy training aiming at the robot.

Drawings

FIG. 1 is an algorithm architecture diagram for curriculum reinforcement learning;

FIG. 2 is a flowchart of the work of the workout generation module;

FIG. 3 is a schematic diagram of a scene autonomous detection system comprised of a vision camera, a lidar and an IMU;

FIG. 4 is a schematic diagram of the Voronoi-Dirichlet mosaic algorithm segmentation and model addition.

Detailed Description

First embodiment this embodiment will be described with reference to fig. 1. The embodiment provides a robot strategy training system based on course reinforcement learning, the system comprises an algorithm operation container module, a training course generation module and a feedback evaluation module, wherein:

the training course generation module is divided into a task generator and a task comparator, and the task generator is used for autonomously generating a course task scene; the task comparator is used for sequencing the tasks according to difficulty through the neural network to obtain courses and provide a training scene for the reinforcement learning algorithm;

the feedback evaluation module is used for carrying out robot self-organization reinforcement training according to the training error of the robot, outputting the score of the robot for the task execution condition according to the robot self-organization reinforcement training result, and feeding back the score of the robot for the task execution condition to the algorithm operation container module to guide self-optimization reinforcement learning algorithm training.

The robot self-organization strengthening training is training of a selection method of evaluation index weight, under the conditions of different scenes and different tasks, different various task targets are used for evaluation, finally, a weighted sum is taken as a score of a final task execution condition, and the robot self-organization strengthening training refers to the training of the weight.

The curriculum learning method proposed by the invention can learn a strategy that uses only local information (i.e. their own observations) when executed, does not assume any structure of a micro-model of environmental dynamics or an inter-robot communication method, and is applicable not only to cooperative interaction but also to a competitive or mixed interaction environment (referred to as a mixed environment of cooperation and competition) involving material and information behaviors.

The second embodiment is as follows: the difference between the embodiment and the specific embodiment is that the target identification algorithm is a YOLOv3 algorithm, the robot path planning algorithm is an artificial potential field algorithm, and the game countermeasure decision algorithm is a PPO algorithm.

In a third specific embodiment, the robot strategy training method based on curriculum reinforcement learning in the embodiment specifically includes the following steps:

firstly, a task generator is utilized to carry out real scene three-dimensional detection reconstruction and intelligent environment autonomous generation of a complex task scene;

this embodiment will be described with reference to fig. 2. The real scene three-dimensional detection reconstruction method comprises the following specific processes:

step 1, depth image acquisition

step 2, preprocessing of depth image

step 3, calculating point cloud data by the preprocessed depth image

carrying out distortion compensation on the obtained point cloud data to obtain point cloud data after distortion compensation; the specific process comprises the following steps:

calculating the time difference of each piece of original point cloud data acquired by a laser radar relative to the laser point cloud data at the initial moment in a frame, calculating the motion information of the robot by using an IMU (inertial measurement unit), respectively calculating a transformation matrix of a laser radar coordinate system at the acquisition moment of each piece of original laser point cloud relative to the laser radar coordinate system at the initial moment of the frame according to the time difference and the motion information, and multiplying the transformation matrix by the corresponding coordinates of the original laser point cloud to obtain the laser point cloud coordinates after distortion compensation;

step 4, point cloud registration

Matching and superposing distortion compensated point cloud coordinates corresponding to the multi-frame preprocessed depth images under different angles and illumination intensities to a world coordinate system according to the translation vector and the rotation matrix of each frame by taking the public part of the scene as a reference to obtain a point cloud space after registration;

in the whole mapping and registration process, a single-frame image is transformed by a translation vector and a rotation matrix and then constructed in a world coordinate system, so that the construction of the map is realized.

The translation vector and the rotation matrix of each frame are processed by the correction module. As shown in fig. 3, the correction module corrects the global error by using a relatively independent loop detection module. The correction module detects the similarity degree of the image to judge whether the image loops or not, the bag-of-words model is used for describing the image similarity, the TF-IDF algorithm is adopted for calculating the key frame similarity, then the loop detection module obtains loop candidate frames according to the key frame similarity, the continuity of the loop candidate frames is judged, and finally the accumulated scale error and the rotation translation error are corrected according to the similarity transformation relation of the image point cloud space, the repeated information of the map is fused, and closed-loop fusion is achieved.

Step 5, fusing point cloud data after registration

Constructing a volume grid by taking the initial position of the sensor as an origin, and dividing the point cloud space after registration into small cubes by using the grid, namely dividing the point cloud space after registration into voxels; simulating the surface by assigning effective distance field values (SDF values) to the individual voxels;

step 6, surface Generation

Processing the result obtained in the step 5 by adopting an MC algorithm to generate a three-dimensional surface, namely obtaining a task scene map;

the specific process of the step 6 is as follows:

respectively storing eight positions of adjacent data in a data field at eight vertexes of a voxel, selecting a proper constant as a potential value T for two endpoints of an edge on a boundary voxel, and then when one endpoint is greater than T and the other endpoint is less than T, then a vertex of an isosurface exists on the edge, traversing all twelve edges in the voxel to obtain intersections of the twelve edges and the isosurface in the voxel, constructing a triangular surface patch in the voxel, dividing the voxel into two regions of the isosurface and the isosurface by all the triangular surface patches in the voxel, connecting all the triangular surface patches in the voxel to form the isosurface of the voxel, combining the isosurfaces of all the voxels to form a complete three-dimensional surface, and using the formed complete three-dimensional surface as a task scene map;

the specific process of the intelligent environment autonomous generation of the complex task scene is as follows:

the method comprises the steps of firstly, obtaining a three-dimensional semantic map and a corresponding target model through segmentation and identification of a task scene map, constructing a model base, then, automatically generating a series of barrier, threat and reward elements in the task scene map according to a certain rule to obtain a plurality of tasks, and finally, realizing construction of a robot dynamic model by utilizing various sensors carried on a robot.

Step 1) task scene segmentation generation

finally, laser 3D point cloud object segmentation is carried out, non-ground point cloud clustering is segmented into point cloud clusters of different categories, and then an integral three-dimensional semantic map is constructed;

step 2) automatic generation of task targets

The core of the task target automatic generation is programmed generation, all models in the model library are classified into a plurality of different abstraction layers, such as barriers, rewards and threats, and the generation rule of each abstraction layer is made according to the dependency relationship among the different abstraction layers.

The specific process of the step 2) is as follows:

s2, according to the size of the two-dimensional map, representing the positions of a series of random points by generating a certain number of random seeds, and inputting the generated random seeds into a Voronoi-Dirichlet mosaic algorithm;

s3, performing Dirony triangulation by a Voronoi-Dirichlet mosaic algorithm based on input random seeds, and dividing the two-dimensional map into polygons by using the vertical bisectors of the sides of each triangle obtained by the triangulation, namely obtaining a Voronoi diagram;

s4, randomly selecting polygons with a certain proportion as obstacle areas, and then randomly selecting blank vertexes as positions of the obstacle point, threat point or reward point models in the model library;

s5, placing a polygonal prism as a barrier at a position corresponding to a barrier polygon in the three-dimensional map, and placing a barrier point, a threat point or a reward point model on the ground at a position corresponding to the selected blank vertex, namely completing the mapping from the two-dimensional map to the three-dimensional task scene map, as shown in FIG. 4;

s6, checking whether the three-dimensional task scene map added with the model meets the constraint condition of the scene, if so, directly executing the step S7, otherwise, correcting and placing the position of the model which does not meet the constraint until the three-dimensional task scene map added with the model meets the constraint condition of the scene, and then executing the step S7;

the constraint conditions of the scene are as follows: the model and the model in the three-dimensional task scene map after the model is added and the model and the barrier are not covered and overlapped, and a closed area is formed;

and S7, randomly generating the number of the robots and the initial positions of the robots to complete the generation of random game confrontation tasks.

Step 3) verification of robot dynamics model

Firstly, selecting a proper dynamics simplified model of a typical robot unmanned system, setting initial kinematic parameters of the dynamics model of the robot unmanned system, and selecting a key frame to integrate IMU data according to the lowest sampling frequency of each sensor in an IMU, a laser radar and a vision camera carried by the robot unmanned system to obtain state quantity increment between adjacent key frames;

and predicting error data at the next moment according to the state quantity increment between adjacent key frames at the current moment, performing error compensation according to the predicted error data, comparing the difference between a robot dynamics model and an actual robot system, and performing iterative compensation on initial kinematic parameters to realize accurate modeling of the dynamics of the robot and accurate verification of the dynamics parameters.

the task sequencer is implemented by using a RankNet network.

And taking the robot model and the task scene as the input of the neural network, taking the difficulty score of the task as the output, and giving a sample set by utilizing the weighted sum of the score of the difficulty of the course evaluated by the A-star algorithm and the score of the difficulty of the course evaluated by human intuition. And the task scenes which are sequenced from easy to difficult are taken as courses. The rankings of task difficulties can be realized by constructing a RankNet through the following framework:

the RankNet network has the working principle that:

feature vector composed of task scene and robot dynamics model

Will task U _i The corresponding input feature vector is denoted x _i To task U _j Corresponding inputThe feature vector is represented as x _j The RankNet network is respectively paired with x _i And x _j Forward calculation is carried out to obtain x _i Corresponding difficulty score s _i ＝f(x _i ) And x _j Corresponding difficulty score s _j ＝f(x _j )；

By using

Representing tasks U _i Than task U _j Higher score, task U _i Score ratio task U of _j Higher score of predictive relevance probability P _ij Comprises the following steps:

the probability is a sigmoid function, the size of a parameter sigma determines the shape of the function, the parameter sigma is a constant, the value of sigma is determined according to experience and actual conditions, and e is the base number of a natural logarithm;

with S _ij Representing a task U _i And task U _j The difficulty level comparison of (2), wherein:

probability of true correlation

Comprises the following steps:

the RankNet compares the difficulty between tasks through the probability idea, namely, the task U is not directly judged _i And task U _j But rather the task U _i Task comparing U _j Probability of high difficulty is P _ij The minimum difference between the predicted correlation probability and the real correlation probability is taken as an optimization target.

The RankNet network model is trained by a pairwise method.

Step three, carrying out self-optimized reinforcement learning training on a target recognition algorithm, a robot path planning algorithm and a game countermeasure decision algorithm in an algorithm operation container module based on the training courses generated in the step two;

and step four, the feedback evaluation module outputs the scores of the robot for the task execution conditions according to the reinforcement learning training results in the step three, and then feeds the scores of the robot for the task execution conditions back to the algorithm running container module to guide the training of the target recognition algorithm, the robot path planning algorithm and the game confrontation decision algorithm.

The above-described calculation examples of the present invention are merely to describe the calculation model and the calculation flow of the present invention in detail, and are not intended to limit the embodiments of the present invention. It will be apparent to those skilled in the art that other variations and modifications can be made on the basis of the foregoing description, and it is not intended to exhaust all of the embodiments, and all obvious variations and modifications which fall within the scope of the invention are intended to be included within the scope of the invention.

Claims

1. A robot strategy training system based on curriculum reinforcement learning is characterized in that the system comprises an algorithm operation container module, a training curriculum generation module and a feedback evaluation module, wherein:

2. The course reinforcement learning-based robot strategy training system as claimed in claim 1, wherein the target recognition algorithm is a YOLOv3 algorithm, the robot path planning algorithm is an artificial potential field algorithm, and the game countermeasure decision-making algorithm is a PPO algorithm.

3. The training method of the course reinforcement learning-based robot strategy training system according to claim 1, wherein the method specifically comprises the following steps:

firstly, a task generator is utilized to carry out real scene three-dimensional detection reconstruction and task scene intelligent environment autonomous generation;

4. The training method of the course reinforcement learning-based robot strategy training system according to claim 3, wherein the real scene three-dimensional detection and reconstruction comprises the following specific processes:

step 1, obtaining depth image

step 2, preprocessing of depth image

step 3, calculating point cloud data from the preprocessed depth image

step 4, point cloud registration

step 5, fusing point cloud data after registration

step 6, surface Generation

5. The training method of course reinforcement learning-based robot strategy training system according to claim 4, wherein the specific process of step 6 is as follows:

6. The training method of the course reinforcement learning-based robot strategy training system as claimed in claim 5, wherein the specific process of the autonomous generation of the task scenario intelligent environment is as follows:

step 1) task scene segmentation generation

dividing non-ground point cloud clusters into point cloud clusters of different categories, namely constructing an integral three-dimensional semantic map;

step 2) automatic generation of task targets

The specific process of the step 2) is as follows:

s1, projecting a three-dimensional semantic map onto a horizontal ground to obtain a two-dimensional map;

s4, randomly selecting a polygon as an obstacle area, and randomly selecting a blank vertex as the position of an obstacle point, a threat point or a reward point model in a model library;

step 3) verification of robot dynamics model

7. The training method of course reinforcement learning-based robot strategy training system as claimed in claim 6, wherein said task sequencer is implemented by using RankNet network.

8. The training method of the course reinforcement learning-based robot strategy training system according to claim 7, wherein the RankNet network works according to the following principle:

feature vector composed of task scene and robot dynamics model

By using

Representing tasks U _i Task comparing U _j Higher score, task U _i Score ratio task of (U) _j Higher scoring predictive relevance probability P _ij Comprises the following steps:

with S _ij Representing tasks U _i And task U _j The difficulty level of (2) is compared, wherein:

probability of true correlation

Comprises the following steps:

9. The training method of the robot strategy training system based on curriculum reinforcement learning of claim 8, wherein the RankNet network model is trained by using a pair method.

10. The training method of the course reinforcement learning-based robot strategy training system according to claim 9, wherein the obtained point cloud data is subjected to distortion compensation to obtain distortion-compensated point cloud data; the specific process comprises the following steps:

calculating the time difference of each piece of acquired original point cloud data relative to the laser point cloud data at the initial moment in a frame acquired by using a laser radar, calculating the motion information of the robot by using an IMU (inertial measurement unit), respectively calculating a transformation matrix of a laser radar coordinate system at each original laser point cloud acquisition moment relative to the laser radar coordinate system at the initial moment of the frame according to the time difference and the motion information, and multiplying the transformation matrix by the corresponding coordinates of the original laser point cloud to obtain a laser point cloud coordinate after distortion compensation.