CN115454096A - Robot strategy training system and training method based on curriculum reinforcement learning - Google Patents

Robot strategy training system and training method based on curriculum reinforcement learning Download PDF

Info

Publication number
CN115454096A
CN115454096A CN202211227150.7A CN202211227150A CN115454096A CN 115454096 A CN115454096 A CN 115454096A CN 202211227150 A CN202211227150 A CN 202211227150A CN 115454096 A CN115454096 A CN 115454096A
Authority
CN
China
Prior art keywords
task
robot
algorithm
training
point cloud
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211227150.7A
Other languages
Chinese (zh)
Inventor
吴立刚
董博
王淼
王夏爽
姚蔚然
田昊宇
丁季时雨
孙科武
杨皙睿
孙光辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Second Research Institute Of Casic
Harbin Institute of Technology
Original Assignee
Second Research Institute Of Casic
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Second Research Institute Of Casic, Harbin Institute of Technology filed Critical Second Research Institute Of Casic
Priority to CN202211227150.7A priority Critical patent/CN115454096A/en
Publication of CN115454096A publication Critical patent/CN115454096A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0214Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with safety or protection criteria, e.g. avoiding hazardous areas
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0231Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means
    • G05D1/0246Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using a video camera in combination with image processing means
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0257Control of position or course in two dimensions specially adapted to land vehicles using a radar

Landscapes

  • Engineering & Computer Science (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Electromagnetism (AREA)
  • Manipulator (AREA)

Abstract

A robot strategy training system and a training method based on curriculum reinforcement learning belong to the field of autonomous decision and control of an unmanned system. The invention solves the problem that the existing method is difficult to obtain good decision and control effect in the aspect of strategy training aiming at the robot. Aiming at different types of task modes of heterogeneous multi-robot, the invention takes a dynamic model of a complex environment as input and constructs a multi-robot joint task decision course learning training architecture based on course learning. And (4) considering the progressive progression of task difficulty in the training process, and establishing a parameter autonomous generation algorithm and a target autonomous generation algorithm based on a complex environment dynamics model. And then, on the basis, a course difficulty evaluation and calibration algorithm is established and fed back to the self-optimization reinforcement learning algorithm. The method can be applied to autonomous decision and control of the unmanned system.

Description

Robot strategy training system and training method based on curriculum reinforcement learning
Technical Field
The invention belongs to the field of autonomous decision making and control of unmanned systems, and particularly relates to a robot strategy training system and a robot strategy training method based on curriculum reinforcement learning.
Background
The autonomous decision making of multiple robots is one of the hot problems of study of scholars in recent years, and is widely applied to the fields of military affairs, industry and the like. Where the training of an autonomous decision strategy is often achieved by machine learning. The course learning is based on the reinforced learning by using the idea that human beings are difficult to learn from the easy, the model learns easy samples firstly, then the sample difficulty is gradually improved, and higher training speed and better training effect can be obtained. The core of course learning lies in the autonomous generation of training tasks and the autonomous ranking of task difficulties. The research about the autonomous generation of tasks is limited at present, and the proposed method has an unsupervised mode to train a gradually generalized problem solver; and setting parameter vectors by taking the final task as a template, and adjusting parameters to obtain an intermediate task. However, the existing task autonomous generation method is poor in effectiveness in the aspect of generating the training task aiming at the robot strategy. The mainstream method for autonomous ranking of task difficulty is to consider only reordering of samples of the final task without changing the task itself; changing certain aspects of the MDP to create intermediate tasks with different MDP structures; and (4) considering the evaluation of the difficulty of the human on the task, and sequencing by using a human-in-loop method. However, the accuracy of the sequencing result obtained by adopting the existing autonomous sequencing method is poor, and part of the existing autonomous sequencing method is combined with task generation, so that the method is not suitable for strategy training aiming at the robot.
Therefore, in summary, the existing task autonomous generation method and the existing autonomous ranking method are not good enough in terms of strategy training for robots, and it is difficult to obtain good decision and control effects.
Disclosure of Invention
The invention aims to provide a robot strategy training system and a training method based on course reinforcement learning, which aim to solve the problem that the existing method is difficult to obtain good decision and control effects in the aspect of strategy training of a robot due to the fact that the effectiveness of the existing task autonomous generation method in the aspect of robot strategy training task generation is poor and the accuracy of the sequencing results obtained by the existing autonomous sequencing method is poor.
The technical scheme adopted by the invention for solving the technical problems is as follows:
based on one aspect of the invention, a robot strategy training system based on curriculum reinforcement learning comprises an algorithm operation container module, a training curriculum generation module and a feedback evaluation module, wherein:
the training course generation module is divided into a task generator and a task comparator, and the task generator is used for autonomously generating a course task scene; the task comparator is used for sequencing the difficulty of the tasks from easy to difficult through a neural network to obtain courses;
the algorithm operation container module is used for configuring operation containers for a target recognition algorithm, a robot path planning algorithm and a game countermeasure decision algorithm so as to carry out self-optimization reinforcement learning algorithm training on the target recognition algorithm, the robot path planning algorithm and the game countermeasure decision algorithm according to courses obtained by the training course generation module;
the feedback evaluation module is used for carrying out robot self-organization reinforcement training according to the training error of the robot, outputting the score of the robot for the task execution condition according to the robot self-organization reinforcement training result, and feeding the score of the robot for the task execution condition back to the algorithm operation container module to guide self-optimization reinforcement learning algorithm training.
Further, the target identification algorithm is a YOLOv3 algorithm, the robot path planning algorithm is an artificial potential field algorithm, and the game countermeasure decision algorithm is a PPO algorithm.
Based on another aspect of the invention, a robot strategy training method based on curriculum reinforcement learning specifically comprises the following steps:
step one, performing real scene three-dimensional detection reconstruction and task scene intelligent environment autonomous generation by using a task generator;
step two, sequencing the task scenes from easy to difficult by using a task sequencer to obtain training courses;
thirdly, performing self-optimized reinforcement learning training on a target recognition algorithm, a robot path planning algorithm and a game countermeasure decision algorithm in the algorithm operation container module based on the training courses generated in the second step;
and step four, the feedback evaluation module outputs the scores of the robot for the task execution conditions according to the reinforcement learning training results in the step three, and then feeds the scores of the robot for the task execution conditions back to the algorithm operation container module to guide the training of the target recognition algorithm, the robot path planning algorithm and the game countermeasure decision algorithm.
Further, the real scene three-dimensional detection reconstruction specifically comprises the following processes:
step 1, obtaining depth image
Shooting depth images of the same scene under different angles and illumination intensities through a vision camera and a laser radar;
step 2, preprocessing of depth image
Denoising the depth image by Gaussian filtering, and repairing the denoised depth image by a DeepFillv2 algorithm to obtain a recovered depth image, namely a preprocessed depth image;
step 3, calculating point cloud data by the preprocessed depth image
Calculating a conversion relation between a world coordinate system and an image pixel coordinate system according to an imaging principle, and obtaining point cloud data of the preprocessed depth image in the world coordinate system by using the calculated conversion relation; carrying out distortion compensation on the obtained point cloud data to obtain point cloud data after distortion compensation;
step 4, point cloud registration
Matching and superposing distortion compensated point cloud coordinates corresponding to multiple frames of preprocessed depth images under different angles and illumination into a world coordinate system according to the translation vector and the rotation matrix of each frame by taking the public part of a scene as a reference to obtain a point cloud space after registration;
step 5, fusing point cloud data after registration
Constructing a volume grid by taking the initial position of the sensor as an origin, and dividing the point cloud space after registration into cubes by using the grid, namely dividing the point cloud space after registration into voxels; simulating a surface by assigning a distance field value to each voxel;
step 6, surface Generation
And (5) processing the result obtained in the step (5) by adopting an MC algorithm to generate a three-dimensional surface, namely the task scene map.
Further, the specific process of step 6 is:
the method comprises the steps of respectively storing eight pieces of adjacent data in a data field at eight vertexes of a voxel, after a potential value T is selected for two endpoints of an edge on a boundary voxel, when one endpoint is larger than T and the other endpoint is smaller than T, a vertex of an isosurface exists on the edge, after all twelve edges in the voxel are traversed, intersections of the twelve edges and the isosurface in the voxel are obtained, a triangular surface patch in the voxel is constructed, all the triangular surface patches in the voxel divide the voxel into two regions, namely an isosurface region and an isosurface region, all the triangular surface patches in the voxel are connected to form the isosurface of the voxel, the isosurfaces of all the voxels are combined to form a complete three-dimensional surface, and the formed complete three-dimensional surface is used as a task scene map.
Further, the specific process of the autonomous generation of the task scene intelligent environment is as follows:
step 1) task scene segmentation generation
Constructing an adjacency matrix of voxels around each point in the point cloud by using each voxel obtained after segmentation in the step 5, and weighting the edges of the targets according to the adjacency matrix to complete the separation of overlapped targets in the point cloud, namely segmenting the whole point cloud into 3D point cloud models of each object;
performing data association on the segmented object 3D point cloud model and a task scene map, judging each segmented object type, and adding the object type into a model base;
dividing non-ground point cloud clusters into different types of point cloud clusters to construct an integral three-dimensional semantic map;
step 2) automatic generation of task targets
The specific process of the step 2) is as follows:
the method comprises the following steps of S1, projecting a three-dimensional semantic map onto a horizontal ground to obtain a two-dimensional map;
s2, according to the size of the two-dimensional map, representing the position of a random point by generating random seeds, and inputting the generated random seeds into a Voronoi-Dirichlet mosaic algorithm;
s3, carrying out Dirony triangulation by the Voronoi-Dirichlet mosaic algorithm based on the input random seeds, and dividing the two-dimensional map into polygons by using the vertical bisector of the side of each triangle obtained by the triangulation, namely obtaining a Voronoi diagram;
s4, randomly selecting a polygon as an obstacle area, and then randomly selecting a blank vertex as the position of an obstacle point, a threat point or a reward point model in a model library;
s5, placing a polygonal prism as an obstacle at a position corresponding to an obstacle polygon in the three-dimensional map, and placing an obstacle point, a threat point or a reward point model on the ground at a position corresponding to the selected blank vertex, namely completing the mapping from the two-dimensional map to the three-dimensional task scene map;
s6, checking whether the three-dimensional task scene map added with the model meets the constraint condition of the scene, if so, directly executing the step S7, if not, correcting and placing the position of the model which does not meet the constraint until the three-dimensional task scene map added with the model meets the constraint condition of the scene, and then executing the step S7;
s7, randomly generating the number of the robots and the initial positions of the robots to complete the generation of random game confrontation tasks;
step 3) verification of robot dynamics model
Setting initial kinematic parameters of a dynamic model of the robot unmanned system, selecting a key frame according to the lowest sampling frequency of each sensor in an IMU, a laser radar and a vision camera carried by the robot unmanned system to integrate IMU data, and obtaining state quantity increment between adjacent key frames;
and predicting error data at the next moment according to the state quantity increment between adjacent key frames at the current moment, performing error compensation according to the predicted error data, comparing the difference between a robot dynamics model and an actual robot system, and performing iterative compensation on the initial kinematic parameters to realize dynamics modeling of the robot and verification of the dynamics parameters.
Further, the task sequencer is implemented by using a RankNet network.
Further, the RankNet network has the working principle that:
feature vector composed of task scene and robot dynamics model
Figure BDA0003880202620000041
Inputting a RankNet network, and mapping the input eigenvector into a real number
Figure BDA0003880202620000042
Will task U i The corresponding input feature vector is denoted x i To task U j The corresponding input feature vector is denoted x j The RankNet network is respectively paired with x i And x j Forward calculation is carried out to obtain x i Corresponding difficulty score s i =f(x i ) And x j Corresponding difficulty score s j =f(x j );
By using
Figure BDA0003880202620000044
Representing a task U i Task comparing U j Higher score, task U i Score ratio task U of j Higher scoring predictive relevance probability P ij Comprises the following steps:
Figure BDA0003880202620000043
the parameter sigma is a constant, and e is the base number of a natural logarithm;
with S ij Representing a task U i And task U j The difficulty level of (2) is compared, wherein:
Figure BDA0003880202620000051
probability of true correlation
Figure BDA0003880202620000052
Comprises the following steps:
Figure BDA0003880202620000053
the RankNet compares the difficulty between tasks through the probability idea, namely, the task U is not directly judged i And task U j Is said to be task U i Task comparing U j The probability of high difficulty is P ij The minimum difference between the predicted correlation probability and the real correlation probability is taken as an optimization target.
Further, the RankNet network model is trained by adopting a pairwise method.
Further, performing distortion compensation on the obtained point cloud data to obtain point cloud data after distortion compensation; the specific process comprises the following steps:
calculating the time difference of each piece of original point cloud data acquired by a laser radar relative to the laser point cloud data at the initial moment in a frame, calculating the motion information of the robot by using an IMU (inertial measurement unit), calculating a transformation matrix of a laser radar coordinate system at the acquisition moment of each piece of original laser point cloud relative to the laser radar coordinate system at the initial moment of the frame according to the time difference and the motion information, and multiplying the transformation matrix by the corresponding coordinates of the original laser point cloud to obtain the laser point cloud coordinates after distortion compensation.
The invention has the beneficial effects that:
aiming at different types of task modes of heterogeneous multi-robot, the invention takes a dynamic model of a complex environment as input and constructs a course learning training architecture for multi-robot joint task decision-making based on course learning. And (4) considering the progressive progression of task difficulty in the training process, and establishing a parameter autonomous generation algorithm and a target autonomous generation algorithm based on the complex environment dynamic model. And then, on the basis, a course difficulty evaluation and calibration algorithm is established and fed back to the self-optimization reinforcement learning algorithm. Meanwhile, the invention saves a large amount of labor cost, can realize the autonomous training of a multi-robot system in a complex environment, and solves the problem that the existing method is difficult to obtain good decision and control effects in the aspect of strategy training aiming at the robot.
Drawings
FIG. 1 is an algorithm architecture diagram for curriculum reinforcement learning;
FIG. 2 is a flowchart of the work of the workout generation module;
FIG. 3 is a schematic diagram of a scene autonomous detection system comprised of a vision camera, a lidar and an IMU;
FIG. 4 is a schematic diagram of the Voronoi-Dirichlet mosaic algorithm segmentation and model addition.
Detailed Description
First embodiment this embodiment will be described with reference to fig. 1. The embodiment provides a robot strategy training system based on course reinforcement learning, the system comprises an algorithm operation container module, a training course generation module and a feedback evaluation module, wherein:
the training course generation module is divided into a task generator and a task comparator, and the task generator is used for autonomously generating a course task scene; the task comparator is used for sequencing the tasks according to difficulty through the neural network to obtain courses and provide a training scene for the reinforcement learning algorithm;
the algorithm operation container module is used for configuring operation containers for a target recognition algorithm, a robot path planning algorithm and a game countermeasure decision algorithm so as to carry out self-optimization reinforcement learning algorithm training on the target recognition algorithm, the robot path planning algorithm and the game countermeasure decision algorithm according to courses obtained by the training course generation module;
the feedback evaluation module is used for carrying out robot self-organization reinforcement training according to the training error of the robot, outputting the score of the robot for the task execution condition according to the robot self-organization reinforcement training result, and feeding back the score of the robot for the task execution condition to the algorithm operation container module to guide self-optimization reinforcement learning algorithm training.
The robot self-organization strengthening training is training of a selection method of evaluation index weight, under the conditions of different scenes and different tasks, different various task targets are used for evaluation, finally, a weighted sum is taken as a score of a final task execution condition, and the robot self-organization strengthening training refers to the training of the weight.
The curriculum learning method proposed by the invention can learn a strategy that uses only local information (i.e. their own observations) when executed, does not assume any structure of a micro-model of environmental dynamics or an inter-robot communication method, and is applicable not only to cooperative interaction but also to a competitive or mixed interaction environment (referred to as a mixed environment of cooperation and competition) involving material and information behaviors.
The second embodiment is as follows: the difference between the embodiment and the specific embodiment is that the target identification algorithm is a YOLOv3 algorithm, the robot path planning algorithm is an artificial potential field algorithm, and the game countermeasure decision algorithm is a PPO algorithm.
In a third specific embodiment, the robot strategy training method based on curriculum reinforcement learning in the embodiment specifically includes the following steps:
firstly, a task generator is utilized to carry out real scene three-dimensional detection reconstruction and intelligent environment autonomous generation of a complex task scene;
this embodiment will be described with reference to fig. 2. The real scene three-dimensional detection reconstruction method comprises the following specific processes:
step 1, depth image acquisition
Shooting depth images of the same scene under different angles and illumination intensities through a vision camera and a laser radar;
step 2, preprocessing of depth image
Denoising the depth image by Gaussian filtering, and repairing the denoised depth image by a DeepFillv2 algorithm to obtain a recovered depth image, namely a preprocessed depth image;
step 3, calculating point cloud data by the preprocessed depth image
Calculating a conversion relation between a world coordinate system and an image pixel coordinate system according to an imaging principle, and obtaining point cloud data of the preprocessed depth image in the world coordinate system by using the calculated conversion relation; carrying out distortion compensation on the obtained point cloud data to obtain point cloud data after distortion compensation;
carrying out distortion compensation on the obtained point cloud data to obtain point cloud data after distortion compensation; the specific process comprises the following steps:
calculating the time difference of each piece of original point cloud data acquired by a laser radar relative to the laser point cloud data at the initial moment in a frame, calculating the motion information of the robot by using an IMU (inertial measurement unit), respectively calculating a transformation matrix of a laser radar coordinate system at the acquisition moment of each piece of original laser point cloud relative to the laser radar coordinate system at the initial moment of the frame according to the time difference and the motion information, and multiplying the transformation matrix by the corresponding coordinates of the original laser point cloud to obtain the laser point cloud coordinates after distortion compensation;
step 4, point cloud registration
Matching and superposing distortion compensated point cloud coordinates corresponding to the multi-frame preprocessed depth images under different angles and illumination intensities to a world coordinate system according to the translation vector and the rotation matrix of each frame by taking the public part of the scene as a reference to obtain a point cloud space after registration;
in the whole mapping and registration process, a single-frame image is transformed by a translation vector and a rotation matrix and then constructed in a world coordinate system, so that the construction of the map is realized.
The translation vector and the rotation matrix of each frame are processed by the correction module. As shown in fig. 3, the correction module corrects the global error by using a relatively independent loop detection module. The correction module detects the similarity degree of the image to judge whether the image loops or not, the bag-of-words model is used for describing the image similarity, the TF-IDF algorithm is adopted for calculating the key frame similarity, then the loop detection module obtains loop candidate frames according to the key frame similarity, the continuity of the loop candidate frames is judged, and finally the accumulated scale error and the rotation translation error are corrected according to the similarity transformation relation of the image point cloud space, the repeated information of the map is fused, and closed-loop fusion is achieved.
Step 5, fusing point cloud data after registration
Constructing a volume grid by taking the initial position of the sensor as an origin, and dividing the point cloud space after registration into small cubes by using the grid, namely dividing the point cloud space after registration into voxels; simulating the surface by assigning effective distance field values (SDF values) to the individual voxels;
step 6, surface Generation
Processing the result obtained in the step 5 by adopting an MC algorithm to generate a three-dimensional surface, namely obtaining a task scene map;
the specific process of the step 6 is as follows:
respectively storing eight positions of adjacent data in a data field at eight vertexes of a voxel, selecting a proper constant as a potential value T for two endpoints of an edge on a boundary voxel, and then when one endpoint is greater than T and the other endpoint is less than T, then a vertex of an isosurface exists on the edge, traversing all twelve edges in the voxel to obtain intersections of the twelve edges and the isosurface in the voxel, constructing a triangular surface patch in the voxel, dividing the voxel into two regions of the isosurface and the isosurface by all the triangular surface patches in the voxel, connecting all the triangular surface patches in the voxel to form the isosurface of the voxel, combining the isosurfaces of all the voxels to form a complete three-dimensional surface, and using the formed complete three-dimensional surface as a task scene map;
the specific process of the intelligent environment autonomous generation of the complex task scene is as follows:
the method comprises the steps of firstly, obtaining a three-dimensional semantic map and a corresponding target model through segmentation and identification of a task scene map, constructing a model base, then, automatically generating a series of barrier, threat and reward elements in the task scene map according to a certain rule to obtain a plurality of tasks, and finally, realizing construction of a robot dynamic model by utilizing various sensors carried on a robot.
Step 1) task scene segmentation generation
Constructing an adjacency matrix of voxels around each point in the point cloud by using each voxel obtained after segmentation in the step 5, and weighting the edges of the targets according to the adjacency matrix to complete the separation of overlapped targets in the point cloud, namely segmenting the whole point cloud into 3D point cloud models of each object;
performing data association on the segmented object 3D point cloud model and a task scene map, judging each segmented object type, and adding the object type into a model base;
finally, laser 3D point cloud object segmentation is carried out, non-ground point cloud clustering is segmented into point cloud clusters of different categories, and then an integral three-dimensional semantic map is constructed;
step 2) automatic generation of task targets
The core of the task target automatic generation is programmed generation, all models in the model library are classified into a plurality of different abstraction layers, such as barriers, rewards and threats, and the generation rule of each abstraction layer is made according to the dependency relationship among the different abstraction layers.
The specific process of the step 2) is as follows:
the method comprises the following steps of S1, projecting a three-dimensional semantic map onto a horizontal ground to obtain a two-dimensional map;
s2, according to the size of the two-dimensional map, representing the positions of a series of random points by generating a certain number of random seeds, and inputting the generated random seeds into a Voronoi-Dirichlet mosaic algorithm;
s3, performing Dirony triangulation by a Voronoi-Dirichlet mosaic algorithm based on input random seeds, and dividing the two-dimensional map into polygons by using the vertical bisectors of the sides of each triangle obtained by the triangulation, namely obtaining a Voronoi diagram;
s4, randomly selecting polygons with a certain proportion as obstacle areas, and then randomly selecting blank vertexes as positions of the obstacle point, threat point or reward point models in the model library;
s5, placing a polygonal prism as a barrier at a position corresponding to a barrier polygon in the three-dimensional map, and placing a barrier point, a threat point or a reward point model on the ground at a position corresponding to the selected blank vertex, namely completing the mapping from the two-dimensional map to the three-dimensional task scene map, as shown in FIG. 4;
s6, checking whether the three-dimensional task scene map added with the model meets the constraint condition of the scene, if so, directly executing the step S7, otherwise, correcting and placing the position of the model which does not meet the constraint until the three-dimensional task scene map added with the model meets the constraint condition of the scene, and then executing the step S7;
the constraint conditions of the scene are as follows: the model and the model in the three-dimensional task scene map after the model is added and the model and the barrier are not covered and overlapped, and a closed area is formed;
and S7, randomly generating the number of the robots and the initial positions of the robots to complete the generation of random game confrontation tasks.
Step 3) verification of robot dynamics model
Firstly, selecting a proper dynamics simplified model of a typical robot unmanned system, setting initial kinematic parameters of the dynamics model of the robot unmanned system, and selecting a key frame to integrate IMU data according to the lowest sampling frequency of each sensor in an IMU, a laser radar and a vision camera carried by the robot unmanned system to obtain state quantity increment between adjacent key frames;
and predicting error data at the next moment according to the state quantity increment between adjacent key frames at the current moment, performing error compensation according to the predicted error data, comparing the difference between a robot dynamics model and an actual robot system, and performing iterative compensation on initial kinematic parameters to realize accurate modeling of the dynamics of the robot and accurate verification of the dynamics parameters.
Step two, sequencing the task scenes from easy to difficult by using a task sequencer to obtain training courses;
the task sequencer is implemented by using a RankNet network.
And taking the robot model and the task scene as the input of the neural network, taking the difficulty score of the task as the output, and giving a sample set by utilizing the weighted sum of the score of the difficulty of the course evaluated by the A-star algorithm and the score of the difficulty of the course evaluated by human intuition. And the task scenes which are sequenced from easy to difficult are taken as courses. The rankings of task difficulties can be realized by constructing a RankNet through the following framework:
the RankNet network has the working principle that:
feature vector composed of task scene and robot dynamics model
Figure BDA0003880202620000091
Inputting a RankNet network, and mapping the input eigenvector into a real number
Figure BDA0003880202620000092
Will task U i The corresponding input feature vector is denoted x i To task U j Corresponding inputThe feature vector is represented as x j The RankNet network is respectively paired with x i And x j Forward calculation is carried out to obtain x i Corresponding difficulty score s i =f(x i ) And x j Corresponding difficulty score s j =f(x j );
By using
Figure BDA0003880202620000105
Representing tasks U i Than task U j Higher score, task U i Score ratio task U of j Higher score of predictive relevance probability P ij Comprises the following steps:
Figure BDA0003880202620000101
the probability is a sigmoid function, the size of a parameter sigma determines the shape of the function, the parameter sigma is a constant, the value of sigma is determined according to experience and actual conditions, and e is the base number of a natural logarithm;
with S ij Representing a task U i And task U j The difficulty level comparison of (2), wherein:
Figure BDA0003880202620000102
probability of true correlation
Figure BDA0003880202620000103
Comprises the following steps:
Figure BDA0003880202620000104
the RankNet compares the difficulty between tasks through the probability idea, namely, the task U is not directly judged i And task U j But rather the task U i Task comparing U j Probability of high difficulty is P ij The minimum difference between the predicted correlation probability and the real correlation probability is taken as an optimization target.
The RankNet network model is trained by a pairwise method.
Step three, carrying out self-optimized reinforcement learning training on a target recognition algorithm, a robot path planning algorithm and a game countermeasure decision algorithm in an algorithm operation container module based on the training courses generated in the step two;
and step four, the feedback evaluation module outputs the scores of the robot for the task execution conditions according to the reinforcement learning training results in the step three, and then feeds the scores of the robot for the task execution conditions back to the algorithm running container module to guide the training of the target recognition algorithm, the robot path planning algorithm and the game confrontation decision algorithm.
The above-described calculation examples of the present invention are merely to describe the calculation model and the calculation flow of the present invention in detail, and are not intended to limit the embodiments of the present invention. It will be apparent to those skilled in the art that other variations and modifications can be made on the basis of the foregoing description, and it is not intended to exhaust all of the embodiments, and all obvious variations and modifications which fall within the scope of the invention are intended to be included within the scope of the invention.

Claims (10)

1. A robot strategy training system based on curriculum reinforcement learning is characterized in that the system comprises an algorithm operation container module, a training curriculum generation module and a feedback evaluation module, wherein:
the training course generation module is divided into a task generator and a task comparator, and the task generator is used for autonomously generating a course task scene; the task comparator is used for sequencing the difficulty of the tasks from easy to difficult through a neural network to obtain courses;
the algorithm operation container module is used for configuring operation containers for a target recognition algorithm, a robot path planning algorithm and a game countermeasure decision algorithm so as to carry out self-optimization reinforcement learning algorithm training on the target recognition algorithm, the robot path planning algorithm and the game countermeasure decision algorithm according to courses obtained by the training course generation module;
the feedback evaluation module is used for carrying out robot self-organization reinforcement training according to the training error of the robot, outputting the score of the robot for the task execution condition according to the robot self-organization reinforcement training result, and feeding back the score of the robot for the task execution condition to the algorithm operation container module to guide self-optimization reinforcement learning algorithm training.
2. The course reinforcement learning-based robot strategy training system as claimed in claim 1, wherein the target recognition algorithm is a YOLOv3 algorithm, the robot path planning algorithm is an artificial potential field algorithm, and the game countermeasure decision-making algorithm is a PPO algorithm.
3. The training method of the course reinforcement learning-based robot strategy training system according to claim 1, wherein the method specifically comprises the following steps:
firstly, a task generator is utilized to carry out real scene three-dimensional detection reconstruction and task scene intelligent environment autonomous generation;
step two, sequencing the task scenes from easy to difficult by using a task sequencer to obtain training courses;
thirdly, performing self-optimized reinforcement learning training on a target recognition algorithm, a robot path planning algorithm and a game countermeasure decision algorithm in the algorithm operation container module based on the training courses generated in the second step;
and step four, the feedback evaluation module outputs the scores of the robot for the task execution conditions according to the reinforcement learning training results in the step three, and then feeds the scores of the robot for the task execution conditions back to the algorithm running container module to guide the training of the target recognition algorithm, the robot path planning algorithm and the game confrontation decision algorithm.
4. The training method of the course reinforcement learning-based robot strategy training system according to claim 3, wherein the real scene three-dimensional detection and reconstruction comprises the following specific processes:
step 1, obtaining depth image
Shooting depth images of the same scene under different angles and illumination intensities through a vision camera and a laser radar;
step 2, preprocessing of depth image
Denoising the depth image by Gaussian filtering, and repairing the denoised depth image by a DeepFillv2 algorithm to obtain a recovered depth image, namely a preprocessed depth image;
step 3, calculating point cloud data from the preprocessed depth image
Calculating a conversion relation between a world coordinate system and an image pixel coordinate system according to an imaging principle, and obtaining point cloud data of the preprocessed depth image in the world coordinate system by using the calculated conversion relation; carrying out distortion compensation on the obtained point cloud data to obtain point cloud data after distortion compensation;
step 4, point cloud registration
Matching and superposing distortion compensated point cloud coordinates corresponding to the multi-frame preprocessed depth images under different angles and illumination intensities to a world coordinate system according to the translation vector and the rotation matrix of each frame by taking the public part of the scene as a reference to obtain a point cloud space after registration;
step 5, fusing point cloud data after registration
Constructing a volume grid by taking the initial position of the sensor as an origin, and dividing the point cloud space after registration into cubes by using the grid, namely dividing the point cloud space after registration into voxels; simulating a surface by assigning a distance field value to each voxel;
step 6, surface Generation
And (5) processing the result obtained in the step (5) by adopting an MC algorithm to generate a three-dimensional surface, namely the task scene map.
5. The training method of course reinforcement learning-based robot strategy training system according to claim 4, wherein the specific process of step 6 is as follows:
the method comprises the steps of respectively storing eight pieces of adjacent data in a data field at eight vertexes of a voxel, after a potential value T is selected for two endpoints of an edge on a boundary voxel, when one endpoint is larger than T and the other endpoint is smaller than T, a vertex of an isosurface exists on the edge, after all twelve edges in the voxel are traversed, intersections of the twelve edges and the isosurface in the voxel are obtained, a triangular surface patch in the voxel is constructed, all the triangular surface patches in the voxel divide the voxel into two regions, namely an isosurface region and an isosurface region, all the triangular surface patches in the voxel are connected to form the isosurface of the voxel, the isosurfaces of all the voxels are combined to form a complete three-dimensional surface, and the formed complete three-dimensional surface is used as a task scene map.
6. The training method of the course reinforcement learning-based robot strategy training system as claimed in claim 5, wherein the specific process of the autonomous generation of the task scenario intelligent environment is as follows:
step 1) task scene segmentation generation
Constructing an adjacency matrix of voxels around each point in the point cloud by using each voxel obtained after segmentation in the step 5, and weighting the edges of the targets according to the adjacency matrix to complete the separation of overlapped targets in the point cloud, namely segmenting the whole point cloud into 3D point cloud models of each object;
performing data association on the segmented object 3D point cloud model and a task scene map, judging each segmented object type, and adding the object type into a model base;
dividing non-ground point cloud clusters into point cloud clusters of different categories, namely constructing an integral three-dimensional semantic map;
step 2) automatic generation of task targets
The specific process of the step 2) is as follows:
s1, projecting a three-dimensional semantic map onto a horizontal ground to obtain a two-dimensional map;
s2, according to the size of the two-dimensional map, representing the position of a random point by generating random seeds, and inputting the generated random seeds into a Voronoi-Dirichlet mosaic algorithm;
s3, carrying out Dirony triangulation by the Voronoi-Dirichlet mosaic algorithm based on the input random seeds, and dividing the two-dimensional map into polygons by using the vertical bisector of the side of each triangle obtained by the triangulation, namely obtaining a Voronoi diagram;
s4, randomly selecting a polygon as an obstacle area, and randomly selecting a blank vertex as the position of an obstacle point, a threat point or a reward point model in a model library;
s5, placing a polygonal prism as an obstacle at a position corresponding to an obstacle polygon in the three-dimensional map, and placing an obstacle point, a threat point or a reward point model on the ground at a position corresponding to the selected blank vertex, namely completing the mapping from the two-dimensional map to the three-dimensional task scene map;
s6, checking whether the three-dimensional task scene map added with the model meets the constraint condition of the scene, if so, directly executing the step S7, if not, correcting and placing the position of the model which does not meet the constraint until the three-dimensional task scene map added with the model meets the constraint condition of the scene, and then executing the step S7;
s7, randomly generating the number of the robots and the initial positions of the robots to complete the generation of random game confrontation tasks;
step 3) verification of robot dynamics model
Setting initial kinematic parameters of a dynamic model of the robot unmanned system, selecting a key frame according to the lowest sampling frequency of each sensor in an IMU, a laser radar and a vision camera carried by the robot unmanned system to integrate IMU data, and obtaining state quantity increment between adjacent key frames;
and predicting error data at the next moment according to the state quantity increment between adjacent key frames at the current moment, performing error compensation according to the predicted error data, comparing the difference between a robot dynamics model and an actual robot system, and performing iterative compensation on the initial kinematic parameters to realize dynamics modeling of the robot and verification of the dynamics parameters.
7. The training method of course reinforcement learning-based robot strategy training system as claimed in claim 6, wherein said task sequencer is implemented by using RankNet network.
8. The training method of the course reinforcement learning-based robot strategy training system according to claim 7, wherein the RankNet network works according to the following principle:
feature vector composed of task scene and robot dynamics model
Figure FDA0003880202610000031
Inputting a RankNet network, and mapping the input eigenvector into a real number
Figure FDA0003880202610000041
Will task U i The corresponding input feature vector is denoted x i To task U j The corresponding input feature vector is denoted x j The RankNet network is respectively paired with x i And x j Forward calculation is carried out to obtain x i Corresponding difficulty score s i =f(x i ) And x j Corresponding difficulty score s j =f(x j );
By using
Figure FDA0003880202610000046
Representing tasks U i Task comparing U j Higher score, task U i Score ratio task of (U) j Higher scoring predictive relevance probability P ij Comprises the following steps:
Figure FDA0003880202610000042
the parameter sigma is a constant, and e is the base number of a natural logarithm;
with S ij Representing tasks U i And task U j The difficulty level of (2) is compared, wherein:
Figure FDA0003880202610000043
probability of true correlation
Figure FDA0003880202610000044
Comprises the following steps:
Figure FDA0003880202610000045
the RankNet compares the difficulty between tasks through the probability idea, namely, the task U is not directly judged i And task U j But rather the task U i Task comparing U j Probability of high difficulty is P ij The minimum difference between the predicted correlation probability and the real correlation probability is taken as an optimization target.
9. The training method of the robot strategy training system based on curriculum reinforcement learning of claim 8, wherein the RankNet network model is trained by using a pair method.
10. The training method of the course reinforcement learning-based robot strategy training system according to claim 9, wherein the obtained point cloud data is subjected to distortion compensation to obtain distortion-compensated point cloud data; the specific process comprises the following steps:
calculating the time difference of each piece of acquired original point cloud data relative to the laser point cloud data at the initial moment in a frame acquired by using a laser radar, calculating the motion information of the robot by using an IMU (inertial measurement unit), respectively calculating a transformation matrix of a laser radar coordinate system at each original laser point cloud acquisition moment relative to the laser radar coordinate system at the initial moment of the frame according to the time difference and the motion information, and multiplying the transformation matrix by the corresponding coordinates of the original laser point cloud to obtain a laser point cloud coordinate after distortion compensation.
CN202211227150.7A 2022-10-09 2022-10-09 Robot strategy training system and training method based on curriculum reinforcement learning Pending CN115454096A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211227150.7A CN115454096A (en) 2022-10-09 2022-10-09 Robot strategy training system and training method based on curriculum reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211227150.7A CN115454096A (en) 2022-10-09 2022-10-09 Robot strategy training system and training method based on curriculum reinforcement learning

Publications (1)

Publication Number Publication Date
CN115454096A true CN115454096A (en) 2022-12-09

Family

ID=84309007

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211227150.7A Pending CN115454096A (en) 2022-10-09 2022-10-09 Robot strategy training system and training method based on curriculum reinforcement learning

Country Status (1)

Country Link
CN (1) CN115454096A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118182538A (en) * 2024-05-17 2024-06-14 北京理工大学前沿技术研究院 Unprotected left-turn scene decision planning method and system based on course reinforcement learning

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110327624A (en) * 2019-07-03 2019-10-15 广州多益网络股份有限公司 A kind of game follower method and system based on course intensified learning
CN112633466A (en) * 2020-10-28 2021-04-09 华南理工大学 Memory-keeping course learning method facing difficult exploration environment
CN113110592A (en) * 2021-04-23 2021-07-13 南京大学 Unmanned aerial vehicle obstacle avoidance and path planning method
CN114290339A (en) * 2022-03-09 2022-04-08 南京大学 Robot reality migration system and method based on reinforcement learning and residual modeling
CN114529010A (en) * 2022-01-28 2022-05-24 广州杰赛科技股份有限公司 Robot autonomous learning method, device, equipment and storage medium
CN114578860A (en) * 2022-03-28 2022-06-03 中国人民解放军国防科技大学 Large-scale unmanned aerial vehicle cluster flight method based on deep reinforcement learning
CN114741886A (en) * 2022-04-18 2022-07-12 中国人民解放军军事科学院战略评估咨询中心 Unmanned aerial vehicle cluster multi-task training method and system based on contribution degree evaluation
CN114952828A (en) * 2022-05-09 2022-08-30 华中科技大学 Mechanical arm motion planning method and system based on deep reinforcement learning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110327624A (en) * 2019-07-03 2019-10-15 广州多益网络股份有限公司 A kind of game follower method and system based on course intensified learning
CN112633466A (en) * 2020-10-28 2021-04-09 华南理工大学 Memory-keeping course learning method facing difficult exploration environment
CN113110592A (en) * 2021-04-23 2021-07-13 南京大学 Unmanned aerial vehicle obstacle avoidance and path planning method
CN114529010A (en) * 2022-01-28 2022-05-24 广州杰赛科技股份有限公司 Robot autonomous learning method, device, equipment and storage medium
CN114290339A (en) * 2022-03-09 2022-04-08 南京大学 Robot reality migration system and method based on reinforcement learning and residual modeling
CN114578860A (en) * 2022-03-28 2022-06-03 中国人民解放军国防科技大学 Large-scale unmanned aerial vehicle cluster flight method based on deep reinforcement learning
CN114741886A (en) * 2022-04-18 2022-07-12 中国人民解放军军事科学院战略评估咨询中心 Unmanned aerial vehicle cluster multi-task training method and system based on contribution degree evaluation
CN114952828A (en) * 2022-05-09 2022-08-30 华中科技大学 Mechanical arm motion planning method and system based on deep reinforcement learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HAOYU TIAN 等: "Task-Extended Utility Tensor Method for Decentralized Multi-Vehicle Mission Planning", 《IEEE》, 31 October 2023 (2023-10-31) *
林一炯: "基于深度强化学习的高效率机器人自主学习方法研究", 《CNKI》, 31 July 2020 (2020-07-31) *
胡欢: "基于深度强化学习的机器人控制问题研究", 《CNKI》, 31 December 2021 (2021-12-31) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118182538A (en) * 2024-05-17 2024-06-14 北京理工大学前沿技术研究院 Unprotected left-turn scene decision planning method and system based on course reinforcement learning

Similar Documents

Publication Publication Date Title
CN112258618B (en) Semantic mapping and positioning method based on fusion of prior laser point cloud and depth map
US20230161352A1 (en) Dynamic obstacle avoidance method based on real-time local grid map construction
Paton et al. Bridging the appearance gap: Multi-experience localization for long-term visual teach and repeat
Thorpe et al. Vision and navigation for the Carnegie Mellon Navlab
CN111429514A (en) Laser radar 3D real-time target detection method fusing multi-frame time sequence point clouds
Tovar et al. Planning exploration strategies for simultaneous localization and mapping
CN111486855A (en) Indoor two-dimensional semantic grid map construction method with object navigation points
Hardouin et al. Next-Best-View planning for surface reconstruction of large-scale 3D environments with multiple UAVs
CN112950645B (en) Image semantic segmentation method based on multitask deep learning
CN105069843A (en) Rapid extraction method for dense point cloud oriented toward city three-dimensional modeling
CN111998862B (en) BNN-based dense binocular SLAM method
Shan et al. LiDAR-based stable navigable region detection for unmanned surface vehicles
Jiao et al. 2-entity random sample consensus for robust visual localization: Framework, methods, and verifications
Li et al. Learning view and target invariant visual servoing for navigation
CN117214904A (en) Intelligent fish identification monitoring method and system based on multi-sensor data
CN115454096A (en) Robot strategy training system and training method based on curriculum reinforcement learning
Short et al. Abio-inspiredalgorithminimage-based pathplanning and localization using visual features and maps
Zhou et al. Place recognition and navigation of outdoor mobile robots based on random Forest learning with a 3D LiDAR
Giordano et al. 3D structure identification from image moments
CN115690343A (en) Robot laser radar scanning and mapping method based on visual following
Yan et al. Mobile robot 3D map building and path planning based on multi–sensor data fusion
Liu et al. Laser 3D tightly coupled mapping method based on visual information
Lee et al. Road following in an unstructured desert environment based on the EM (expectation-maximization) algorithm
Manderson et al. Gaze selection for enhanced visual odometry during navigation
Guo et al. 3D object detection and tracking based on streaming data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination