CN115454096B - Course reinforcement learning-based robot strategy training system and training method - Google Patents
Course reinforcement learning-based robot strategy training system and training method Download PDFInfo
- Publication number
- CN115454096B CN115454096B CN202211227150.7A CN202211227150A CN115454096B CN 115454096 B CN115454096 B CN 115454096B CN 202211227150 A CN202211227150 A CN 202211227150A CN 115454096 B CN115454096 B CN 115454096B
- Authority
- CN
- China
- Prior art keywords
- task
- algorithm
- training
- robot
- point cloud
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 86
- 238000000034 method Methods 0.000 title claims abstract description 53
- 230000002787 reinforcement Effects 0.000 title claims abstract description 38
- 230000008569 process Effects 0.000 claims abstract description 18
- 238000011156 evaluation Methods 0.000 claims abstract description 10
- 239000011159 matrix material Substances 0.000 claims description 17
- 238000012163 sequencing technique Methods 0.000 claims description 13
- 238000001514 detection method Methods 0.000 claims description 9
- 230000011218 segmentation Effects 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000005286 illumination Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 238000013519 translation Methods 0.000 claims description 6
- 238000012795 verification Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 230000004927 fusion Effects 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 3
- 238000003384 imaging method Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000000926 separation method Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 3
- 238000012937 correction Methods 0.000 description 3
- 230000004888 barrier function Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000005728 strengthening Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000009133 cooperative interaction Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0214—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with safety or protection criteria, e.g. avoiding hazardous areas
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0231—Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means
- G05D1/0246—Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using a video camera in combination with image processing means
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0257—Control of position or course in two dimensions specially adapted to land vehicles using a radar
Landscapes
- Engineering & Computer Science (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- Aviation & Aerospace Engineering (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Electromagnetism (AREA)
- Manipulator (AREA)
Abstract
A robot strategy training system and a training method based on course reinforcement learning belong to the field of autonomous decision and control of unmanned systems. The method solves the problem that the prior method is difficult to obtain good decision and control effect in the aspect of strategy training aiming at the robot. Aiming at different types of task modes of heterogeneous multi-robots, the invention takes a dynamic model of a complex environment as input to construct a multi-robot combined task decision course learning training framework based on course learning. And (3) taking the gradual progress of task difficulty in the training process into consideration, and establishing a parameter autonomous generation algorithm and a target autonomous generation algorithm based on the complex environment dynamics model. And then, on the basis, establishing a course difficulty evaluation and calibration algorithm, and feeding back to the self-optimizing reinforcement learning algorithm. The method can be applied to autonomous decision making and control of an unmanned system.
Description
Technical Field
The invention belongs to the field of autonomous decision making and control of unmanned systems, and particularly relates to a robot strategy training system and a training method based on course reinforcement learning.
Background
The autonomous decision of multiple robots is one of the hot problems of research of students in recent years, and has wide application in the fields of military, industry and the like. Wherein training of the autonomous decision strategy is often achieved through machine learning. Course learning is based on reinforcement learning by using the ideas of easy to difficult learning of human, the model learns easy samples first, then the sample difficulty is gradually improved, and higher training speed and better training effect can be obtained. The core of course learning is the autonomous generation of training tasks and the autonomous ordering of task difficulties. Research on autonomous generation of tasks is limited at present, and a proposed method trains a gradually generalized problem solver in an unsupervised mode; and setting a parameter vector by taking the final task as a template, and adjusting parameters to obtain an intermediate task. However, the existing task autonomous generation method is poor in effectiveness in training task generation aiming at a robot strategy. The mainstream approach to autonomous ordering of task difficulty has been to consider only reordering samples of the final task without changing the task itself; changing certain aspects of the MDP to create intermediate tasks with different MDP structures; considering the assessment of task difficulty by human, the method of people in a loop is utilized for sorting. However, the accuracy of the sequencing result obtained by the existing autonomous sequencing method is poor, and part of the existing autonomous sequencing method is combined with task generation and is not suitable for strategy training aiming at the robot.
Therefore, in summary, the existing task autonomous generation method and the existing autonomous sequencing method perform poorly in terms of strategy training for robots, and it is difficult to obtain better decision and control effects.
Disclosure of Invention
The invention aims to solve the problem that the prior method is difficult to obtain good decision and control effects in the aspect of strategy training aiming at a robot due to poor effectiveness of the prior task autonomous generation method in the aspect of training tasks aiming at the robot strategy and poor accuracy of a sequencing result obtained by the prior autonomous sequencing method.
The technical scheme adopted by the invention for solving the technical problems is as follows:
According to one aspect of the invention, a course reinforcement learning-based robotic strategy training system comprises an algorithm running container module, a training course generating module and a feedback evaluating module, wherein:
The training course generating module is divided into a task generator and a task comparator, and the task generator is used for autonomously generating a course task scene; the task comparator is used for sequencing the tasks from easy to difficult in difficulty through the neural network to obtain courses;
The algorithm running container module is used for configuring a running container for the target recognition algorithm, the robot path planning algorithm and the game countermeasure decision algorithm, so as to perform self-optimized reinforcement learning algorithm training on the target recognition algorithm, the robot path planning algorithm and the game countermeasure decision algorithm according to the courses obtained by the training course generating module;
The feedback evaluation module is used for carrying out self-organizing reinforcement training of the robot according to the training error of the robot, outputting the score of the robot on the task execution condition according to the self-organizing reinforcement training result of the robot, and feeding back the score of the robot on the task execution condition to the algorithm operation container module to guide the self-optimizing reinforcement learning algorithm training.
Further, the target recognition algorithm is YOLOv algorithm, the robot path planning algorithm is artificial potential field algorithm, and the game countermeasure decision algorithm is PPO algorithm.
Based on another aspect of the invention, a robot strategy training method based on course reinforcement learning specifically comprises the following steps:
Step one, performing three-dimensional detection reconstruction of a real scene and autonomous generation of an intelligent environment of the task scene by using a task generator;
Step two, sequencing task scenes from easy to difficult by using a task sequencer to obtain training courses;
step three, performing self-optimized reinforcement learning training by an algorithm running a target recognition algorithm, a robot path planning algorithm and a game countermeasure decision algorithm in a container module based on the training courses generated in the step two;
And step four, a feedback evaluation module outputs the scores of the robots on the task execution conditions according to the reinforcement learning training results in the step three, and feeds back the scores of the robots on the task execution conditions to an algorithm operation container module to guide the training of a target recognition algorithm, a robot path planning algorithm and a game countermeasure decision algorithm.
Further, the specific process of three-dimensional detection reconstruction of the real scene is as follows:
Step 1, depth image acquisition
Shooting depth images of the same scene under different angles and illumination by using a vision camera and a laser radar;
Step2, preprocessing of depth image
Denoising the depth image by using Gaussian filtering, and repairing the denoised depth image by using DEEPFILLV algorithm to obtain a restored depth image, namely a preprocessed depth image;
step 3, calculating point cloud data from the preprocessed depth image
Calculating a conversion relation between a world coordinate system and an image pixel coordinate system according to an imaging principle, and obtaining point cloud data of the preprocessed depth image under the world coordinate system by using the calculated conversion relation; performing distortion compensation on the obtained point cloud data to obtain point cloud data after distortion compensation;
Step 4, point cloud registration
The common part of the scene is used as a reference, and the distortion-compensated point cloud coordinates corresponding to the multi-frame preprocessed depth images under different angles and illumination are matched and overlapped into a world coordinate system according to the translation vector and the rotation matrix of each frame, so that a registered point cloud space is obtained;
step 5, fusion of point cloud data after registration
Constructing a volume grid by taking the initial position of the sensor as an origin, and dividing the registered point cloud space into cubes by utilizing the grid, namely dividing the registered point cloud space into voxels; simulating a surface by assigning distance field values to respective voxels;
Step 6, surface generation
And (5) processing the result obtained in the step (5) by adopting an MC algorithm to generate a three-dimensional surface, and obtaining the task scene map.
Further, the specific process of the step 6 is as follows:
and respectively storing eight adjacent data in the data field at eight vertexes of a voxel, selecting potential values T for two endpoints of one edge on a boundary voxel, when one endpoint is larger than T and the other endpoint is smaller than T, then, one vertex of an isosurface exists on the edge, traversing all twelve edges in the voxel to obtain the intersection point of twelve edges in the voxel and the isosurface, constructing triangular patches in the voxel, dividing the voxel into two areas which are respectively arranged in the isosurface and outside the isosurface by all triangular patches in the voxel, connecting all triangular patches in the voxel to form the isosurface of the voxel, combining the isosurfaces of all voxels to form a complete three-dimensional surface, and taking the complete three-dimensional surface as a task scene map.
Further, the specific process of autonomous generation of the task scene intelligent environment is as follows:
Step 1) task scene segmentation generation
Constructing an adjacent matrix of voxels around each point in the point cloud by utilizing each voxel obtained after segmentation in the step 5, weighting the edges of the targets according to the adjacent matrix, and completing the separation of overlapping targets in the point cloud, namely segmenting the whole point cloud into 3D point cloud models of all objects;
Carrying out data association on the segmented object 3D point cloud model and the task scene map, judging each segmented target category, and adding the target category into a model library;
dividing the non-ground points Yun Julei into point cloud clusters of different categories, namely constructing an integral three-dimensional semantic map;
Step 2) task goal automation generation
The specific process of the step 2) is as follows:
S1, projecting a three-dimensional semantic map onto a horizontal ground to obtain a two-dimensional map;
S2, representing the positions of random points by generating random seeds according to the size of the two-dimensional map, and inputting the generated random seeds into Wo Luo Node-Dirichlet mosaic algorithm;
S3, carrying out Dironey triangulation by using Wo Luo Noil-Dirichlet mosaic algorithm based on the input random seeds, and dividing the two-dimensional map into polygons by using the vertical bisector of the side of each triangle obtained by the triangulation to obtain a Wo Luo Noil map;
s4, randomly selecting a polygon as an obstacle area, and randomly selecting blank vertexes as positions of obstacle points, threat points or rewarding point models in a model library;
S5, in the three-dimensional map, a polygonal prism is placed at a position corresponding to the obstacle polygon to serve as an obstacle, and an obstacle point, a threat point or a rewarding point model is placed on the ground at a position corresponding to the selected blank vertex, so that mapping from the two-dimensional map to the three-dimensional task scene map is completed;
Step S6, checking whether the three-dimensional task scene map added with the model meets the constraint condition of the scene, if so, directly executing the step S7, and if not, correcting the position of the model which is placed without meeting the constraint until the three-dimensional task scene map added with the model meets the constraint condition of the scene, and then executing the step S7;
S7, randomly generating the number of robots and the initial positions of the robots, and completing the generation of the random game countermeasure task;
step 3) robot dynamics model verification
Setting initial kinematic parameters of a dynamic model of the unmanned robot system, selecting key frames to integrate IMU data according to the lowest sampling frequency of each sensor in the IMU, the laser radar and the vision camera carried by the unmanned robot system, and obtaining state quantity increment between adjacent key frames;
And predicting error data at the next moment according to state quantity increment between adjacent key frames at the current moment, performing error compensation according to the predicted error data, comparing the difference between the robot dynamics model and an actual robot system, and realizing dynamics modeling and verification of the dynamics parameters of the robot through iterative compensation of initial kinematics parameters.
Further, the task sequencer is implemented using a RankNet network.
Further, the operating principle of the RankNet network is as follows:
Feature vector composed of task scene and robot dynamics model Inputting RankNet network, mapping the input characteristic vector into a real number
The input feature vector corresponding to the task U i is expressed as x i, the input feature vector corresponding to the task U j is expressed as x j, and the RankNet network respectively carries out forward calculation on x i and x j to obtain a difficulty score s i=f(xi) corresponding to x i and a difficulty score s j=f(xj corresponding to x j;
By using Indicating that task U i scored higher than task U j, the predicted relevance probability P ij for task U i scored higher than task U j is:
the parameter sigma is a constant, e is the base of natural logarithm;
The difficulty level comparison of task U i and task U j is represented by S ij, wherein:
True correlation probability The method comprises the following steps:
The rank Net network compares the difficulty between tasks through the probability thought, namely, the difficulty of the task U i and the task U j is not judged directly, but the probability that the task U i is higher than the task U j is P ij, and the difference between the predicted relevance probability and the true relevance probability is the minimum as an optimization target.
Further, the RankNet network model is trained by using a paper method.
Further, the obtained point cloud data is subjected to distortion compensation, and point cloud data after distortion compensation is obtained; the specific process is as follows:
For original laser point cloud data in a frame acquired by using a laser radar, calculating the time difference of each acquired original point cloud data relative to the laser point cloud data at the initial moment in the frame, calculating the motion information of the robot by using an IMU, respectively calculating a transformation matrix of a laser radar coordinate system at the acquisition moment of each original laser point cloud relative to a laser radar coordinate system at the initial moment of the frame according to the time difference and the motion information, and multiplying the transformation matrix by the coordinates of the corresponding original laser point cloud to obtain the laser point cloud coordinates after distortion compensation.
The beneficial effects of the invention are as follows:
Aiming at different types of task modes of heterogeneous multi-robots, the invention takes a dynamic model of a complex environment as input to construct a multi-robot combined task decision course learning training framework based on course learning. And (3) taking the gradual progress of task difficulty in the training process into consideration, and establishing a parameter autonomous generation algorithm and a target autonomous generation algorithm based on the complex environment dynamics model. And then, on the basis, establishing a course difficulty evaluation and calibration algorithm, and feeding back to the self-optimizing reinforcement learning algorithm. Meanwhile, the invention saves a great deal of labor cost, can realize the autonomous training of the multi-robot system in a complex environment, and solves the problem that the prior method is difficult to obtain good decision and control effect in the aspect of strategy training aiming at robots.
Drawings
FIG. 1 is an algorithmic framework diagram for course reinforcement learning;
FIG. 2 is a workflow diagram of a workout generation module;
FIG. 3 is a schematic diagram of a scene autonomous detection system comprised of a vision camera, lidar and an IMU;
FIG. 4 is a schematic diagram of Wo Luo Node-Dirichlet mosaic algorithm segmentation and model addition.
Detailed Description
Detailed description of the inventionin the first embodiment, this embodiment will be described with reference to fig. 1. The robot strategy training system based on course reinforcement learning according to the embodiment comprises an algorithm running container module, a training course generating module and a feedback evaluating module, wherein:
The training course generating module is divided into a task generator and a task comparator, and the task generator is used for autonomously generating a course task scene; the task comparator is used for sequencing the tasks from easy to difficult in difficulty through the neural network, obtaining courses and providing training scenes for the reinforcement learning algorithm;
The algorithm running container module is used for configuring a running container for the target recognition algorithm, the robot path planning algorithm and the game countermeasure decision algorithm, so as to perform self-optimized reinforcement learning algorithm training on the target recognition algorithm, the robot path planning algorithm and the game countermeasure decision algorithm according to the courses obtained by the training course generating module;
The feedback evaluation module is used for carrying out self-organizing reinforcement training of the robot according to the training error of the robot, outputting the score of the robot on the task execution condition according to the self-organizing reinforcement training result of the robot, and feeding back the score of the robot on the task execution condition to the algorithm operation container module to guide the self-optimizing reinforcement learning algorithm training.
The robot self-organizing strengthening training is training of a method for selecting evaluation index weights, under the conditions of different scenes and different tasks, different multiple task targets are utilized for evaluation, and finally a weighted sum is taken as a score of the final task execution condition, and the robot self-organizing strengthening training is training of the weights.
The course learning method provided by the invention can learn the strategy of using only local information (namely, own observation) when being executed, does not assume any structure of a communication method between micromodels of environmental dynamics or robots, and is applicable to not only cooperative interaction but also competition or mixed interaction environment (refer to mixed environment of cooperation and competition) involving substance and information behaviors.
The second embodiment is as follows: the difference between the embodiment and the specific embodiment is that the target recognition algorithm is YOLOv algorithm, the robot path planning algorithm is artificial potential field algorithm, and the game countermeasure decision algorithm is PPO algorithm.
The third embodiment of the present invention relates to a robot strategy training method based on course reinforcement learning, the method specifically comprising the following steps:
step one, performing three-dimensional detection reconstruction of a real scene and autonomous generation of an intelligent environment of a complex task scene by using a task generator;
this embodiment will be described with reference to fig. 2. The specific process of the three-dimensional detection reconstruction of the real scene is as follows:
Step 1, depth image acquisition
Shooting depth images of the same scene under different angles and illumination by using a vision camera and a laser radar;
Step2, preprocessing of depth image
Denoising the depth image by using Gaussian filtering, and repairing the denoised depth image by using DEEPFILLV algorithm to obtain a restored depth image, namely a preprocessed depth image;
step 3, calculating point cloud data from the preprocessed depth image
Calculating a conversion relation between a world coordinate system and an image pixel coordinate system according to an imaging principle, and obtaining point cloud data of the preprocessed depth image under the world coordinate system by using the calculated conversion relation; performing distortion compensation on the obtained point cloud data to obtain point cloud data after distortion compensation;
performing distortion compensation on the obtained point cloud data to obtain point cloud data after distortion compensation; the specific process is as follows:
for original laser point cloud data in a frame acquired by using a laser radar, calculating the time difference of each acquired original point cloud data relative to the laser point cloud data at the initial moment in the frame, calculating the motion information of the robot by using an IMU, respectively calculating a transformation matrix of a laser radar coordinate system at the acquisition moment of each original laser point cloud relative to a laser radar coordinate system at the initial moment of the frame according to the time difference and the motion information, and multiplying the transformation matrix by the corresponding original laser point cloud coordinates to obtain the laser point cloud coordinates after distortion compensation;
Step 4, point cloud registration
The common part of the scene is used as a reference, and the distortion-compensated point cloud coordinates corresponding to the multi-frame preprocessed depth images under different angles and illumination are matched and overlapped into a world coordinate system according to the translation vector and the rotation matrix of each frame, so that a registered point cloud space is obtained;
in the whole map construction and registration process, a single frame image is transformed by a translation vector and a rotation matrix and then is constructed in a world coordinate system, so that the map construction is realized.
The translation vector and rotation matrix of each frame are processed by the correction module. As shown in fig. 3, the correction module corrects the global error using a relatively independent loop detection module. The correction module detects the similarity of the images to judge whether to loop, describes the image similarity by using a word bag model, calculates the similarity of key frames by using a TF-IDF algorithm, then the loop detection module acquires loop candidate frames according to the similarity of the key frames, judges the continuity of the loop candidate frames, and finally corrects accumulated scale errors and rotation translation errors according to the similarity transformation relation of the image point cloud space, fuses map repeated information, and realizes closed loop fusion.
Step 5, fusion of point cloud data after registration
Constructing a volume grid by taking the initial position of a sensor as an origin, and dividing the registered point cloud space into small cubes by utilizing the grid, namely dividing the registered point cloud space into voxels; simulating the surface by assigning an effective distance field value (SDF value) to each voxel;
Step 6, surface generation
Processing the result obtained in the step 5 by adopting an MC algorithm to generate a three-dimensional surface, and obtaining a task scene map;
the specific process of the step6 is as follows:
respectively storing eight adjacent data in a data field at eight vertexes of a voxel, selecting a proper constant as a potential value T for two endpoints of a prism on a boundary voxel, when one endpoint is larger than T and the other endpoint is smaller than T, then one vertex of an isosurface exists on the prism, traversing all twelve prisms in the voxel to obtain the intersection point of the twelve prisms and the isosurface in the voxel, constructing triangular patches in the voxel, dividing the voxel into two areas which are in the isosurface and outside the isosurface by all triangular patches in the voxel, connecting all triangular patches in the voxel to form the isosurface of the voxel, combining the isosurfaces of all voxels to form a complete three-dimensional surface, and taking the complete three-dimensional surface as a task scene map;
the specific process of the intelligent environment autonomous generation of the complex task scene is as follows:
the method comprises the steps of automatically generating a complex task scene intelligent environment, firstly obtaining a three-dimensional semantic map and a corresponding target model through segmentation and identification of a task scene map, constructing a model library, then automatically generating a series of obstacle, threat and rewarding elements in the task scene map according to a certain rule to obtain a plurality of tasks, and finally realizing the construction of a robot dynamics model by utilizing each sensor carried on a robot.
Step 1) task scene segmentation generation
Constructing an adjacent matrix of voxels around each point in the point cloud by utilizing each voxel obtained after segmentation in the step 5, weighting the edges of the targets according to the adjacent matrix, and completing the separation of overlapping targets in the point cloud, namely segmenting the whole point cloud into 3D point cloud models of all objects;
Carrying out data association on the segmented object 3D point cloud model and the task scene map, judging each segmented target category, and adding the target category into a model library;
Finally, carrying out laser 3D point cloud object segmentation, and segmenting non-ground points Yun Julei into point cloud clusters of different categories, namely constructing an integral three-dimensional semantic map;
Step 2) task goal automation generation
The core of the task target automatic generation is programmed generation, and the generation rule of each abstract layer is manufactured according to the dependency relationship among different abstract layers by classifying the models in all model libraries into a plurality of different abstract layers, such as barriers, rewards and threats.
The specific process of the step 2) is as follows:
S1, projecting a three-dimensional semantic map onto a horizontal ground to obtain a two-dimensional map;
S2, representing the positions of a series of random points by generating a certain number of random seeds according to the size of the two-dimensional map, and inputting the generated random seeds into Wo Luo Noil-Dirichlet mosaic algorithm;
S3, carrying out Dironey triangulation by using Wo Luo Noil-Dirichlet mosaic algorithm based on the input random seeds, and dividing the two-dimensional map into polygons by using the vertical bisector of the side of each triangle obtained by the triangulation to obtain a Wo Luo Noil map;
s4, randomly selecting a polygon with a certain proportion as an obstacle area, and randomly selecting blank vertexes as positions of obstacle points, threat points or rewarding point models in a model library;
Step S5, in the three-dimensional map, a polygonal prism is placed at a position corresponding to the obstacle polygon to serve as an obstacle, and an obstacle point, a threat point or a rewarding point model is placed on the ground at a position corresponding to the selected blank vertex, so that the mapping from the two-dimensional map to the three-dimensional task scene map is completed, and the map is shown in fig. 4;
Step S6, checking whether the three-dimensional task scene map added with the model meets the constraint condition of the scene, if so, directly executing the step S7, and if not, correcting the position of the model which is placed without meeting the constraint until the three-dimensional task scene map added with the model meets the constraint condition of the scene, and then executing the step S7;
The constraint conditions of the scene are as follows: the model and the model, the model and the barrier are not covered or overlapped in the three-dimensional task scene map after the model is added, and a closed area is formed;
and S7, randomly generating the number of robots and the initial positions of the robots, and completing the generation of the random game countermeasure task.
Step 3) robot dynamics model verification
Firstly, selecting a proper dynamic simplified model of a typical unmanned aerial vehicle system, setting initial kinematic parameters of the dynamic model of the unmanned aerial vehicle system, selecting key frames according to the lowest sampling frequency of each sensor in an IMU, a laser radar and a vision camera carried by the unmanned aerial vehicle system, integrating IMU data, and obtaining state quantity increment between adjacent key frames;
And predicting error data at the next moment according to state quantity increment between adjacent key frames at the current moment, performing error compensation according to the predicted error data, comparing the difference between the robot dynamics model and an actual robot system, and realizing accurate modeling of dynamics of the robot and accurate verification of the dynamics parameters through iterative compensation of initial kinematics parameters.
Step two, sequencing task scenes from easy to difficult by using a task sequencer to obtain training courses;
The task sequencer is implemented using a RankNet network.
Taking the robot model and the task scene as inputs of the neural network, taking the difficulty score of the task as output, and giving a sample set by using a weighted sum of the score of the difficulty of the course estimated by the A-algorithm and the score of the difficulty of the course estimated by human intuition. And (5) according to the task scenes which are easily and difficultly ordered, the courses are obtained. Rank Net can be constructed through the following framework to realize the sequencing of task difficulty:
the operating principle of the RankNet network is as follows:
Feature vector composed of task scene and robot dynamics model Inputting RankNet network, mapping the input characteristic vector into a real number
The input feature vector corresponding to the task U i is expressed as x i, the input feature vector corresponding to the task U j is expressed as x j, and the RankNet network respectively carries out forward calculation on x i and x j to obtain a difficulty score s i=f(xi) corresponding to x i and a difficulty score s j=f(xj corresponding to x j;
By using Indicating that task U i scored higher than task U j, the predicted relevance probability P ij for task U i scored higher than task U j is:
The probability is a sigmoid function, the size of a parameter sigma determines the shape of the function, the parameter sigma is a constant, the value of sigma is determined according to experience and actual conditions, and e is the base of natural logarithm;
The difficulty level comparison of task U i and task U j is represented by S ij, wherein:
True correlation probability The method comprises the following steps:
The rank Net network compares the difficulty between tasks through the probability thought, namely, the difficulty of the task U i and the task U j is not judged directly, but the probability that the task U i is higher than the task U j is P ij, and the difference between the predicted relevance probability and the true relevance probability is the minimum as an optimization target.
The RankNet network model is trained by a aid method.
Step three, performing self-optimized reinforcement learning training by an algorithm running a target recognition algorithm, a robot path planning algorithm and a game countermeasure decision algorithm in a container module based on the training courses generated in the step two;
And step four, a feedback evaluation module outputs the scores of the robots on the task execution conditions according to the reinforcement learning training results in the step three, and feeds back the scores of the robots on the task execution conditions to an algorithm operation container module to guide the training of a target recognition algorithm, a robot path planning algorithm and a game countermeasure decision algorithm.
The above examples of the present invention are only for describing the calculation model and calculation flow of the present invention in detail, and are not limiting of the embodiments of the present invention. Other variations and modifications of the above description will be apparent to those of ordinary skill in the art, and it is not intended to be exhaustive of all embodiments, all of which are within the scope of the invention.
Claims (7)
1. The training method of the robot strategy training system based on course reinforcement learning comprises an algorithm running container module, a training course generating module and a feedback evaluating module, wherein: the training course generating module is divided into a task generator and a task comparator, and the task generator is used for autonomously generating a course task scene; the task comparator is used for sequencing the tasks from easy to difficult in difficulty through the neural network to obtain courses; the algorithm running container module is used for configuring a running container for the target recognition algorithm, the robot path planning algorithm and the game countermeasure decision algorithm, so as to perform self-optimized reinforcement learning algorithm training on the target recognition algorithm, the robot path planning algorithm and the game countermeasure decision algorithm according to the courses obtained by the training course generating module; the feedback evaluation module is used for carrying out self-organizing reinforcement training of the robot according to the training error of the robot, outputting the score of the robot on the task execution condition according to the self-organizing reinforcement training result of the robot, and feeding back the score of the robot on the task execution condition to the algorithm operation container module to guide the self-optimizing reinforcement learning algorithm training; the target recognition algorithm is YOLOv algorithm, the robot path planning algorithm is artificial potential field algorithm, and the game countermeasure decision algorithm is PPO algorithm; the method is characterized by comprising the following steps:
Step one, performing three-dimensional detection reconstruction of a real scene and autonomous generation of an intelligent environment of the task scene by using a task generator;
The specific process of the three-dimensional detection reconstruction of the real scene is as follows:
Step 1, depth image acquisition
Shooting depth images of the same scene under different angles and illumination by using a vision camera and a laser radar;
Step2, preprocessing of depth image
Denoising the depth image by using Gaussian filtering, and repairing the denoised depth image by using DEEPFILLV algorithm to obtain a restored depth image, namely a preprocessed depth image;
step 3, calculating point cloud data from the preprocessed depth image
Calculating a conversion relation between a world coordinate system and an image pixel coordinate system according to an imaging principle, and obtaining point cloud data of the preprocessed depth image under the world coordinate system by using the calculated conversion relation; performing distortion compensation on the obtained point cloud data to obtain point cloud data after distortion compensation;
Step 4, point cloud registration
The common part of the scene is used as a reference, and the distortion-compensated point cloud coordinates corresponding to the multi-frame preprocessed depth images under different angles and illumination are matched and overlapped into a world coordinate system according to the translation vector and the rotation matrix of each frame, so that a registered point cloud space is obtained;
step 5, fusion of point cloud data after registration
Constructing a volume grid by taking the initial position of the sensor as an origin, and dividing the registered point cloud space into cubes by utilizing the grid, namely dividing the registered point cloud space into voxels; simulating a surface by assigning distance field values to respective voxels;
Step 6, surface generation
Processing the result obtained in the step 5 by adopting an MC algorithm to generate a three-dimensional surface, and obtaining a task scene map;
Step two, sequencing task scenes from easy to difficult by using a task sequencer to obtain training courses;
step three, performing self-optimized reinforcement learning training by an algorithm running a target recognition algorithm, a robot path planning algorithm and a game countermeasure decision algorithm in a container module based on the training courses generated in the step two;
And step four, a feedback evaluation module outputs the scores of the robots on the task execution conditions according to the reinforcement learning training results in the step three, and feeds back the scores of the robots on the task execution conditions to an algorithm operation container module to guide the training of a target recognition algorithm, a robot path planning algorithm and a game countermeasure decision algorithm.
2. The training method of the robot strategy training system based on course reinforcement learning according to claim 1, wherein the specific process of the step 6 is as follows:
and respectively storing eight adjacent data in the data field at eight vertexes of a voxel, selecting potential values T for two endpoints of one edge on a boundary voxel, when one endpoint is larger than T and the other endpoint is smaller than T, then, one vertex of an isosurface exists on the edge, traversing all twelve edges in the voxel to obtain the intersection point of twelve edges in the voxel and the isosurface, constructing triangular patches in the voxel, dividing the voxel into two areas which are respectively arranged in the isosurface and outside the isosurface by all triangular patches in the voxel, connecting all triangular patches in the voxel to form the isosurface of the voxel, combining the isosurfaces of all voxels to form a complete three-dimensional surface, and taking the complete three-dimensional surface as a task scene map.
3. The training method of the robot strategy training system based on course reinforcement learning according to claim 2, wherein the specific process of autonomous generation of the task scene intelligent environment is as follows:
Step 1) task scene segmentation generation
Constructing an adjacent matrix of voxels around each point in the point cloud by utilizing each voxel obtained after segmentation in the step 5, weighting the edges of the targets according to the adjacent matrix, and completing the separation of overlapping targets in the point cloud, namely segmenting the whole point cloud into 3D point cloud models of all objects;
Carrying out data association on the segmented object 3D point cloud model and the task scene map, judging each segmented target category, and adding the target category into a model library;
dividing the non-ground points Yun Julei into point cloud clusters of different categories, namely constructing an integral three-dimensional semantic map;
Step 2) task goal automation generation
The specific process of the step 2) is as follows:
S1, projecting a three-dimensional semantic map onto a horizontal ground to obtain a two-dimensional map;
S2, representing the positions of random points by generating random seeds according to the size of the two-dimensional map, and inputting the generated random seeds into Wo Luo Node-Dirichlet mosaic algorithm;
S3, carrying out Dironey triangulation by using Wo Luo Noil-Dirichlet mosaic algorithm based on the input random seeds, and dividing the two-dimensional map into polygons by using the vertical bisector of the side of each triangle obtained by the triangulation to obtain a Wo Luo Noil map;
s4, randomly selecting a polygon as an obstacle area, and randomly selecting blank vertexes as positions of obstacle points, threat points or rewarding point models in a model library;
S5, in the three-dimensional map, a polygonal prism is placed at a position corresponding to the obstacle polygon to serve as an obstacle, and an obstacle point, a threat point or a rewarding point model is placed on the ground at a position corresponding to the selected blank vertex, so that mapping from the two-dimensional map to the three-dimensional task scene map is completed;
Step S6, checking whether the three-dimensional task scene map added with the model meets the constraint condition of the scene, if so, directly executing the step S7, and if not, correcting the position of the model which is placed without meeting the constraint until the three-dimensional task scene map added with the model meets the constraint condition of the scene, and then executing the step S7;
S7, randomly generating the number of robots and the initial positions of the robots, and completing the generation of the random game countermeasure task;
step 3) robot dynamics model verification
Setting initial kinematic parameters of a dynamic model of the unmanned robot system, selecting key frames to integrate IMU data according to the lowest sampling frequency of each sensor in the IMU, the laser radar and the vision camera carried by the unmanned robot system, and obtaining state quantity increment between adjacent key frames;
And predicting error data at the next moment according to state quantity increment between adjacent key frames at the current moment, performing error compensation according to the predicted error data, comparing the difference between the robot dynamics model and an actual robot system, and realizing dynamics modeling and verification of the dynamics parameters of the robot through iterative compensation of initial kinematics parameters.
4. The training method of the robot strategy training system based on course reinforcement learning of claim 3, wherein said task sequencer is implemented using a RankNet network.
5. The training method of the robot strategy training system based on course reinforcement learning according to claim 4, wherein the operating principle of the rank net network is as follows:
Feature vector composed of task scene and robot dynamics model Inputting RankNet network, mapping the input characteristic vector into a real number
The input feature vector corresponding to the task U i is expressed as x i, the input feature vector corresponding to the task U j is expressed as x j, and the RankNet network respectively carries out forward calculation on x i and x j to obtain a difficulty score s i=f(xi) corresponding to x i and a difficulty score s j=f(xj corresponding to x j;
By using Indicating that task U i scored higher than task U j, the predicted relevance probability P ij for task U i scored higher than task U j is:
the parameter sigma is a constant, e is the base of natural logarithm;
The difficulty level comparison of task U i and task U j is represented by S ij, wherein:
True correlation probability The method comprises the following steps:
The rank Net network compares the difficulty between tasks through the probability thought, namely, the difficulty of the task U i and the task U j is not judged directly, but the probability that the task U i is higher than the task U j is P ij, and the difference between the predicted relevance probability and the true relevance probability is the minimum as an optimization target.
6. The training method of the robot strategy training system based on course reinforcement learning of claim 5, wherein the rank net network model is trained by using a paper method.
7. The training method of the robot strategy training system based on course reinforcement learning according to claim 6, wherein the obtained point cloud data is subjected to distortion compensation to obtain the point cloud data after the distortion compensation; the specific process is as follows:
For original laser point cloud data in a frame acquired by using a laser radar, calculating the time difference of each acquired original point cloud data relative to the laser point cloud data at the initial moment in the frame, calculating the motion information of the robot by using an IMU, respectively calculating a transformation matrix of a laser radar coordinate system at the acquisition moment of each original laser point cloud relative to a laser radar coordinate system at the initial moment of the frame according to the time difference and the motion information, and multiplying the transformation matrix by the coordinates of the corresponding original laser point cloud to obtain the laser point cloud coordinates after distortion compensation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211227150.7A CN115454096B (en) | 2022-10-09 | 2022-10-09 | Course reinforcement learning-based robot strategy training system and training method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211227150.7A CN115454096B (en) | 2022-10-09 | 2022-10-09 | Course reinforcement learning-based robot strategy training system and training method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115454096A CN115454096A (en) | 2022-12-09 |
CN115454096B true CN115454096B (en) | 2024-07-19 |
Family
ID=84309007
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211227150.7A Active CN115454096B (en) | 2022-10-09 | 2022-10-09 | Course reinforcement learning-based robot strategy training system and training method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115454096B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118182538B (en) * | 2024-05-17 | 2024-08-13 | 北京理工大学前沿技术研究院 | Unprotected left-turn scene decision planning method and system based on course reinforcement learning |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110327624A (en) * | 2019-07-03 | 2019-10-15 | 广州多益网络股份有限公司 | A kind of game follower method and system based on course intensified learning |
CN112633466A (en) * | 2020-10-28 | 2021-04-09 | 华南理工大学 | Memory-keeping course learning method facing difficult exploration environment |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113110592B (en) * | 2021-04-23 | 2022-09-23 | 南京大学 | Unmanned aerial vehicle obstacle avoidance and path planning method |
CN114529010A (en) * | 2022-01-28 | 2022-05-24 | 广州杰赛科技股份有限公司 | Robot autonomous learning method, device, equipment and storage medium |
CN114290339B (en) * | 2022-03-09 | 2022-06-21 | 南京大学 | Robot realistic migration method based on reinforcement learning and residual modeling |
CN114578860B (en) * | 2022-03-28 | 2024-10-18 | 中国人民解放军国防科技大学 | Large-scale unmanned aerial vehicle cluster flight method based on deep reinforcement learning |
CN114741886B (en) * | 2022-04-18 | 2022-11-22 | 中国人民解放军军事科学院战略评估咨询中心 | Unmanned aerial vehicle cluster multi-task training method and system based on contribution degree evaluation |
CN114952828B (en) * | 2022-05-09 | 2024-06-14 | 华中科技大学 | Mechanical arm motion planning method and system based on deep reinforcement learning |
-
2022
- 2022-10-09 CN CN202211227150.7A patent/CN115454096B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110327624A (en) * | 2019-07-03 | 2019-10-15 | 广州多益网络股份有限公司 | A kind of game follower method and system based on course intensified learning |
CN112633466A (en) * | 2020-10-28 | 2021-04-09 | 华南理工大学 | Memory-keeping course learning method facing difficult exploration environment |
Non-Patent Citations (1)
Title |
---|
Task-Extended Utility Tensor Method for Decentralized Multi-Vehicle Mission Planning;Haoyu Tian 等;《IEEE》;20231031;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN115454096A (en) | 2022-12-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114384920B (en) | Dynamic obstacle avoidance method based on real-time construction of local grid map | |
CN111486855B (en) | Indoor two-dimensional semantic grid map construction method with object navigation points | |
CN110956651B (en) | Terrain semantic perception method based on fusion of vision and vibrotactile sense | |
EP3405845B1 (en) | Object-focused active three-dimensional reconstruction | |
Tovar et al. | Planning exploration strategies for simultaneous localization and mapping | |
CN111429514A (en) | Laser radar 3D real-time target detection method fusing multi-frame time sequence point clouds | |
Hardouin et al. | Next-Best-View planning for surface reconstruction of large-scale 3D environments with multiple UAVs | |
CN109800689A (en) | A kind of method for tracking target based on space-time characteristic fusion study | |
CN112578673B (en) | Perception decision and tracking control method for multi-sensor fusion of formula-free racing car | |
CN109829476B (en) | End-to-end three-dimensional object detection method based on YOLO | |
CN111709988A (en) | Method and device for determining characteristic information of object, electronic equipment and storage medium | |
CN117214904A (en) | Intelligent fish identification monitoring method and system based on multi-sensor data | |
CN115454096B (en) | Course reinforcement learning-based robot strategy training system and training method | |
CN117769724A (en) | Synthetic dataset creation using deep-learned object detection and classification | |
Short et al. | Abio-inspiredalgorithminimage-based pathplanning and localization using visual features and maps | |
WO2023242223A1 (en) | Motion prediction for mobile agents | |
CN116109047A (en) | Intelligent scheduling method based on three-dimensional intelligent detection | |
Cardoso et al. | A large-scale mapping method based on deep neural networks applied to self-driving car localization | |
CN118314180A (en) | Point cloud matching method and system based on derivative-free optimization | |
Tallavajhula | Lidar Simulation for Robotic Application Development: Modeling and Evaluation. | |
CN115690343A (en) | Robot laser radar scanning and mapping method based on visual following | |
CN115345281A (en) | Depth reinforcement learning acceleration training method for unmanned aerial vehicle image navigation | |
Lee et al. | Road following in an unstructured desert environment based on the EM (expectation-maximization) algorithm | |
Yan et al. | Mobile robot 3D map building and path planning based on multi–sensor data fusion | |
Mosalam et al. | Artificial Intelligence in Vision-Based Structural Health Monitoring |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |