CN115454096B

CN115454096B - Course reinforcement learning-based robot strategy training system and training method

Info

Publication number: CN115454096B
Application number: CN202211227150.7A
Authority: CN
Inventors: 吴立刚; 董博; 王淼; 王夏爽; 姚蔚然; 田昊宇; 丁季时雨; 孙科武; 杨皙睿; 孙光辉
Original assignee: Second Research Institute Of Casic; Harbin Institute of Technology
Current assignee: Second Research Institute Of Casic; Harbin Institute of Technology
Priority date: 2022-10-09
Filing date: 2022-10-09
Publication date: 2024-07-19
Anticipated expiration: 2042-10-09
Also published as: CN115454096A

Abstract

A robot strategy training system and a training method based on course reinforcement learning belong to the field of autonomous decision and control of unmanned systems. The method solves the problem that the prior method is difficult to obtain good decision and control effect in the aspect of strategy training aiming at the robot. Aiming at different types of task modes of heterogeneous multi-robots, the invention takes a dynamic model of a complex environment as input to construct a multi-robot combined task decision course learning training framework based on course learning. And (3) taking the gradual progress of task difficulty in the training process into consideration, and establishing a parameter autonomous generation algorithm and a target autonomous generation algorithm based on the complex environment dynamics model. And then, on the basis, establishing a course difficulty evaluation and calibration algorithm, and feeding back to the self-optimizing reinforcement learning algorithm. The method can be applied to autonomous decision making and control of an unmanned system.

Description

Course reinforcement learning-based robot strategy training system and training method

Technical Field

The invention belongs to the field of autonomous decision making and control of unmanned systems, and particularly relates to a robot strategy training system and a training method based on course reinforcement learning.

Background

The autonomous decision of multiple robots is one of the hot problems of research of students in recent years, and has wide application in the fields of military, industry and the like. Wherein training of the autonomous decision strategy is often achieved through machine learning. Course learning is based on reinforcement learning by using the ideas of easy to difficult learning of human, the model learns easy samples first, then the sample difficulty is gradually improved, and higher training speed and better training effect can be obtained. The core of course learning is the autonomous generation of training tasks and the autonomous ordering of task difficulties. Research on autonomous generation of tasks is limited at present, and a proposed method trains a gradually generalized problem solver in an unsupervised mode; and setting a parameter vector by taking the final task as a template, and adjusting parameters to obtain an intermediate task. However, the existing task autonomous generation method is poor in effectiveness in training task generation aiming at a robot strategy. The mainstream approach to autonomous ordering of task difficulty has been to consider only reordering samples of the final task without changing the task itself; changing certain aspects of the MDP to create intermediate tasks with different MDP structures; considering the assessment of task difficulty by human, the method of people in a loop is utilized for sorting. However, the accuracy of the sequencing result obtained by the existing autonomous sequencing method is poor, and part of the existing autonomous sequencing method is combined with task generation and is not suitable for strategy training aiming at the robot.

Therefore, in summary, the existing task autonomous generation method and the existing autonomous sequencing method perform poorly in terms of strategy training for robots, and it is difficult to obtain better decision and control effects.

Disclosure of Invention

The invention aims to solve the problem that the prior method is difficult to obtain good decision and control effects in the aspect of strategy training aiming at a robot due to poor effectiveness of the prior task autonomous generation method in the aspect of training tasks aiming at the robot strategy and poor accuracy of a sequencing result obtained by the prior autonomous sequencing method.

The technical scheme adopted by the invention for solving the technical problems is as follows:

According to one aspect of the invention, a course reinforcement learning-based robotic strategy training system comprises an algorithm running container module, a training course generating module and a feedback evaluating module, wherein:

The training course generating module is divided into a task generator and a task comparator, and the task generator is used for autonomously generating a course task scene; the task comparator is used for sequencing the tasks from easy to difficult in difficulty through the neural network to obtain courses;

The algorithm running container module is used for configuring a running container for the target recognition algorithm, the robot path planning algorithm and the game countermeasure decision algorithm, so as to perform self-optimized reinforcement learning algorithm training on the target recognition algorithm, the robot path planning algorithm and the game countermeasure decision algorithm according to the courses obtained by the training course generating module;

The feedback evaluation module is used for carrying out self-organizing reinforcement training of the robot according to the training error of the robot, outputting the score of the robot on the task execution condition according to the self-organizing reinforcement training result of the robot, and feeding back the score of the robot on the task execution condition to the algorithm operation container module to guide the self-optimizing reinforcement learning algorithm training.

Further, the target recognition algorithm is YOLOv algorithm, the robot path planning algorithm is artificial potential field algorithm, and the game countermeasure decision algorithm is PPO algorithm.

Based on another aspect of the invention, a robot strategy training method based on course reinforcement learning specifically comprises the following steps:

Step one, performing three-dimensional detection reconstruction of a real scene and autonomous generation of an intelligent environment of the task scene by using a task generator;

Step two, sequencing task scenes from easy to difficult by using a task sequencer to obtain training courses;

step three, performing self-optimized reinforcement learning training by an algorithm running a target recognition algorithm, a robot path planning algorithm and a game countermeasure decision algorithm in a container module based on the training courses generated in the step two;

And step four, a feedback evaluation module outputs the scores of the robots on the task execution conditions according to the reinforcement learning training results in the step three, and feeds back the scores of the robots on the task execution conditions to an algorithm operation container module to guide the training of a target recognition algorithm, a robot path planning algorithm and a game countermeasure decision algorithm.

Further, the specific process of three-dimensional detection reconstruction of the real scene is as follows:

Step 1, depth image acquisition

Shooting depth images of the same scene under different angles and illumination by using a vision camera and a laser radar;

Step2, preprocessing of depth image

Denoising the depth image by using Gaussian filtering, and repairing the denoised depth image by using DEEPFILLV algorithm to obtain a restored depth image, namely a preprocessed depth image;

step 3, calculating point cloud data from the preprocessed depth image

Calculating a conversion relation between a world coordinate system and an image pixel coordinate system according to an imaging principle, and obtaining point cloud data of the preprocessed depth image under the world coordinate system by using the calculated conversion relation; performing distortion compensation on the obtained point cloud data to obtain point cloud data after distortion compensation;

Step 4, point cloud registration

The common part of the scene is used as a reference, and the distortion-compensated point cloud coordinates corresponding to the multi-frame preprocessed depth images under different angles and illumination are matched and overlapped into a world coordinate system according to the translation vector and the rotation matrix of each frame, so that a registered point cloud space is obtained;

step 5, fusion of point cloud data after registration

Constructing a volume grid by taking the initial position of the sensor as an origin, and dividing the registered point cloud space into cubes by utilizing the grid, namely dividing the registered point cloud space into voxels; simulating a surface by assigning distance field values to respective voxels;

Step 6, surface generation

And (5) processing the result obtained in the step (5) by adopting an MC algorithm to generate a three-dimensional surface, and obtaining the task scene map.

Further, the specific process of the step 6 is as follows:

and respectively storing eight adjacent data in the data field at eight vertexes of a voxel, selecting potential values T for two endpoints of one edge on a boundary voxel, when one endpoint is larger than T and the other endpoint is smaller than T, then, one vertex of an isosurface exists on the edge, traversing all twelve edges in the voxel to obtain the intersection point of twelve edges in the voxel and the isosurface, constructing triangular patches in the voxel, dividing the voxel into two areas which are respectively arranged in the isosurface and outside the isosurface by all triangular patches in the voxel, connecting all triangular patches in the voxel to form the isosurface of the voxel, combining the isosurfaces of all voxels to form a complete three-dimensional surface, and taking the complete three-dimensional surface as a task scene map.

Further, the specific process of autonomous generation of the task scene intelligent environment is as follows:

Step 1) task scene segmentation generation

Constructing an adjacent matrix of voxels around each point in the point cloud by utilizing each voxel obtained after segmentation in the step 5, weighting the edges of the targets according to the adjacent matrix, and completing the separation of overlapping targets in the point cloud, namely segmenting the whole point cloud into 3D point cloud models of all objects;

Carrying out data association on the segmented object 3D point cloud model and the task scene map, judging each segmented target category, and adding the target category into a model library;

dividing the non-ground points Yun Julei into point cloud clusters of different categories, namely constructing an integral three-dimensional semantic map;

Step 2) task goal automation generation

The specific process of the step 2) is as follows:

S1, projecting a three-dimensional semantic map onto a horizontal ground to obtain a two-dimensional map;

S2, representing the positions of random points by generating random seeds according to the size of the two-dimensional map, and inputting the generated random seeds into Wo Luo Node-Dirichlet mosaic algorithm;

S3, carrying out Dironey triangulation by using Wo Luo Noil-Dirichlet mosaic algorithm based on the input random seeds, and dividing the two-dimensional map into polygons by using the vertical bisector of the side of each triangle obtained by the triangulation to obtain a Wo Luo Noil map;

s4, randomly selecting a polygon as an obstacle area, and randomly selecting blank vertexes as positions of obstacle points, threat points or rewarding point models in a model library;

S5, in the three-dimensional map, a polygonal prism is placed at a position corresponding to the obstacle polygon to serve as an obstacle, and an obstacle point, a threat point or a rewarding point model is placed on the ground at a position corresponding to the selected blank vertex, so that mapping from the two-dimensional map to the three-dimensional task scene map is completed;

Step S6, checking whether the three-dimensional task scene map added with the model meets the constraint condition of the scene, if so, directly executing the step S7, and if not, correcting the position of the model which is placed without meeting the constraint until the three-dimensional task scene map added with the model meets the constraint condition of the scene, and then executing the step S7;

S7, randomly generating the number of robots and the initial positions of the robots, and completing the generation of the random game countermeasure task;

step 3) robot dynamics model verification

Setting initial kinematic parameters of a dynamic model of the unmanned robot system, selecting key frames to integrate IMU data according to the lowest sampling frequency of each sensor in the IMU, the laser radar and the vision camera carried by the unmanned robot system, and obtaining state quantity increment between adjacent key frames;

And predicting error data at the next moment according to state quantity increment between adjacent key frames at the current moment, performing error compensation according to the predicted error data, comparing the difference between the robot dynamics model and an actual robot system, and realizing dynamics modeling and verification of the dynamics parameters of the robot through iterative compensation of initial kinematics parameters.

Further, the task sequencer is implemented using a RankNet network.

Further, the operating principle of the RankNet network is as follows:

Feature vector composed of task scene and robot dynamics model Inputting RankNet network, mapping the input characteristic vector into a real number

The input feature vector corresponding to the task U _i is expressed as x _i, the input feature vector corresponding to the task U _j is expressed as x _j, and the RankNet network respectively carries out forward calculation on x _i and x _j to obtain a difficulty score s _i＝f(x_i) corresponding to x _i and a difficulty score s _j＝f(x_j corresponding to x _j;

By using Indicating that task U _i scored higher than task U _j, the predicted relevance probability P _ij for task U _i scored higher than task U _j is:

the parameter sigma is a constant, e is the base of natural logarithm;

The difficulty level comparison of task U _i and task U _j is represented by S _ij, wherein:

True correlation probability The method comprises the following steps:

The rank Net network compares the difficulty between tasks through the probability thought, namely, the difficulty of the task U _i and the task U _j is not judged directly, but the probability that the task U _i is higher than the task U _j is P _ij, and the difference between the predicted relevance probability and the true relevance probability is the minimum as an optimization target.

Further, the RankNet network model is trained by using a paper method.

Further, the obtained point cloud data is subjected to distortion compensation, and point cloud data after distortion compensation is obtained; the specific process is as follows:

For original laser point cloud data in a frame acquired by using a laser radar, calculating the time difference of each acquired original point cloud data relative to the laser point cloud data at the initial moment in the frame, calculating the motion information of the robot by using an IMU, respectively calculating a transformation matrix of a laser radar coordinate system at the acquisition moment of each original laser point cloud relative to a laser radar coordinate system at the initial moment of the frame according to the time difference and the motion information, and multiplying the transformation matrix by the coordinates of the corresponding original laser point cloud to obtain the laser point cloud coordinates after distortion compensation.

The beneficial effects of the invention are as follows:

Aiming at different types of task modes of heterogeneous multi-robots, the invention takes a dynamic model of a complex environment as input to construct a multi-robot combined task decision course learning training framework based on course learning. And (3) taking the gradual progress of task difficulty in the training process into consideration, and establishing a parameter autonomous generation algorithm and a target autonomous generation algorithm based on the complex environment dynamics model. And then, on the basis, establishing a course difficulty evaluation and calibration algorithm, and feeding back to the self-optimizing reinforcement learning algorithm. Meanwhile, the invention saves a great deal of labor cost, can realize the autonomous training of the multi-robot system in a complex environment, and solves the problem that the prior method is difficult to obtain good decision and control effect in the aspect of strategy training aiming at robots.

Drawings

FIG. 1 is an algorithmic framework diagram for course reinforcement learning;

FIG. 2 is a workflow diagram of a workout generation module;

FIG. 3 is a schematic diagram of a scene autonomous detection system comprised of a vision camera, lidar and an IMU;

FIG. 4 is a schematic diagram of Wo Luo Node-Dirichlet mosaic algorithm segmentation and model addition.

Detailed Description

Detailed description of the inventionin the first embodiment, this embodiment will be described with reference to fig. 1. The robot strategy training system based on course reinforcement learning according to the embodiment comprises an algorithm running container module, a training course generating module and a feedback evaluating module, wherein:

The training course generating module is divided into a task generator and a task comparator, and the task generator is used for autonomously generating a course task scene; the task comparator is used for sequencing the tasks from easy to difficult in difficulty through the neural network, obtaining courses and providing training scenes for the reinforcement learning algorithm;

The robot self-organizing strengthening training is training of a method for selecting evaluation index weights, under the conditions of different scenes and different tasks, different multiple task targets are utilized for evaluation, and finally a weighted sum is taken as a score of the final task execution condition, and the robot self-organizing strengthening training is training of the weights.

The course learning method provided by the invention can learn the strategy of using only local information (namely, own observation) when being executed, does not assume any structure of a communication method between micromodels of environmental dynamics or robots, and is applicable to not only cooperative interaction but also competition or mixed interaction environment (refer to mixed environment of cooperation and competition) involving substance and information behaviors.

The second embodiment is as follows: the difference between the embodiment and the specific embodiment is that the target recognition algorithm is YOLOv algorithm, the robot path planning algorithm is artificial potential field algorithm, and the game countermeasure decision algorithm is PPO algorithm.

The third embodiment of the present invention relates to a robot strategy training method based on course reinforcement learning, the method specifically comprising the following steps:

step one, performing three-dimensional detection reconstruction of a real scene and autonomous generation of an intelligent environment of a complex task scene by using a task generator;

this embodiment will be described with reference to fig. 2. The specific process of the three-dimensional detection reconstruction of the real scene is as follows:

Step 1, depth image acquisition

Step2, preprocessing of depth image

step 3, calculating point cloud data from the preprocessed depth image

performing distortion compensation on the obtained point cloud data to obtain point cloud data after distortion compensation; the specific process is as follows:

for original laser point cloud data in a frame acquired by using a laser radar, calculating the time difference of each acquired original point cloud data relative to the laser point cloud data at the initial moment in the frame, calculating the motion information of the robot by using an IMU, respectively calculating a transformation matrix of a laser radar coordinate system at the acquisition moment of each original laser point cloud relative to a laser radar coordinate system at the initial moment of the frame according to the time difference and the motion information, and multiplying the transformation matrix by the corresponding original laser point cloud coordinates to obtain the laser point cloud coordinates after distortion compensation;

Step 4, point cloud registration

in the whole map construction and registration process, a single frame image is transformed by a translation vector and a rotation matrix and then is constructed in a world coordinate system, so that the map construction is realized.

The translation vector and rotation matrix of each frame are processed by the correction module. As shown in fig. 3, the correction module corrects the global error using a relatively independent loop detection module. The correction module detects the similarity of the images to judge whether to loop, describes the image similarity by using a word bag model, calculates the similarity of key frames by using a TF-IDF algorithm, then the loop detection module acquires loop candidate frames according to the similarity of the key frames, judges the continuity of the loop candidate frames, and finally corrects accumulated scale errors and rotation translation errors according to the similarity transformation relation of the image point cloud space, fuses map repeated information, and realizes closed loop fusion.

Step 5, fusion of point cloud data after registration

Constructing a volume grid by taking the initial position of a sensor as an origin, and dividing the registered point cloud space into small cubes by utilizing the grid, namely dividing the registered point cloud space into voxels; simulating the surface by assigning an effective distance field value (SDF value) to each voxel;

Step 6, surface generation

Processing the result obtained in the step 5 by adopting an MC algorithm to generate a three-dimensional surface, and obtaining a task scene map;

the specific process of the step6 is as follows:

respectively storing eight adjacent data in a data field at eight vertexes of a voxel, selecting a proper constant as a potential value T for two endpoints of a prism on a boundary voxel, when one endpoint is larger than T and the other endpoint is smaller than T, then one vertex of an isosurface exists on the prism, traversing all twelve prisms in the voxel to obtain the intersection point of the twelve prisms and the isosurface in the voxel, constructing triangular patches in the voxel, dividing the voxel into two areas which are in the isosurface and outside the isosurface by all triangular patches in the voxel, connecting all triangular patches in the voxel to form the isosurface of the voxel, combining the isosurfaces of all voxels to form a complete three-dimensional surface, and taking the complete three-dimensional surface as a task scene map;

the specific process of the intelligent environment autonomous generation of the complex task scene is as follows:

the method comprises the steps of automatically generating a complex task scene intelligent environment, firstly obtaining a three-dimensional semantic map and a corresponding target model through segmentation and identification of a task scene map, constructing a model library, then automatically generating a series of obstacle, threat and rewarding elements in the task scene map according to a certain rule to obtain a plurality of tasks, and finally realizing the construction of a robot dynamics model by utilizing each sensor carried on a robot.

Step 1) task scene segmentation generation

Finally, carrying out laser 3D point cloud object segmentation, and segmenting non-ground points Yun Julei into point cloud clusters of different categories, namely constructing an integral three-dimensional semantic map;

Step 2) task goal automation generation

The core of the task target automatic generation is programmed generation, and the generation rule of each abstract layer is manufactured according to the dependency relationship among different abstract layers by classifying the models in all model libraries into a plurality of different abstract layers, such as barriers, rewards and threats.

The specific process of the step 2) is as follows:

S2, representing the positions of a series of random points by generating a certain number of random seeds according to the size of the two-dimensional map, and inputting the generated random seeds into Wo Luo Noil-Dirichlet mosaic algorithm;

s4, randomly selecting a polygon with a certain proportion as an obstacle area, and randomly selecting blank vertexes as positions of obstacle points, threat points or rewarding point models in a model library;

Step S5, in the three-dimensional map, a polygonal prism is placed at a position corresponding to the obstacle polygon to serve as an obstacle, and an obstacle point, a threat point or a rewarding point model is placed on the ground at a position corresponding to the selected blank vertex, so that the mapping from the two-dimensional map to the three-dimensional task scene map is completed, and the map is shown in fig. 4;

The constraint conditions of the scene are as follows: the model and the model, the model and the barrier are not covered or overlapped in the three-dimensional task scene map after the model is added, and a closed area is formed;

and S7, randomly generating the number of robots and the initial positions of the robots, and completing the generation of the random game countermeasure task.

Step 3) robot dynamics model verification

Firstly, selecting a proper dynamic simplified model of a typical unmanned aerial vehicle system, setting initial kinematic parameters of the dynamic model of the unmanned aerial vehicle system, selecting key frames according to the lowest sampling frequency of each sensor in an IMU, a laser radar and a vision camera carried by the unmanned aerial vehicle system, integrating IMU data, and obtaining state quantity increment between adjacent key frames;

And predicting error data at the next moment according to state quantity increment between adjacent key frames at the current moment, performing error compensation according to the predicted error data, comparing the difference between the robot dynamics model and an actual robot system, and realizing accurate modeling of dynamics of the robot and accurate verification of the dynamics parameters through iterative compensation of initial kinematics parameters.

The task sequencer is implemented using a RankNet network.

Taking the robot model and the task scene as inputs of the neural network, taking the difficulty score of the task as output, and giving a sample set by using a weighted sum of the score of the difficulty of the course estimated by the A-algorithm and the score of the difficulty of the course estimated by human intuition. And (5) according to the task scenes which are easily and difficultly ordered, the courses are obtained. Rank Net can be constructed through the following framework to realize the sequencing of task difficulty:

the operating principle of the RankNet network is as follows:

The probability is a sigmoid function, the size of a parameter sigma determines the shape of the function, the parameter sigma is a constant, the value of sigma is determined according to experience and actual conditions, and e is the base of natural logarithm;

True correlation probability The method comprises the following steps:

The RankNet network model is trained by a aid method.

The above examples of the present invention are only for describing the calculation model and calculation flow of the present invention in detail, and are not limiting of the embodiments of the present invention. Other variations and modifications of the above description will be apparent to those of ordinary skill in the art, and it is not intended to be exhaustive of all embodiments, all of which are within the scope of the invention.

Claims

1. The training method of the robot strategy training system based on course reinforcement learning comprises an algorithm running container module, a training course generating module and a feedback evaluating module, wherein: the training course generating module is divided into a task generator and a task comparator, and the task generator is used for autonomously generating a course task scene; the task comparator is used for sequencing the tasks from easy to difficult in difficulty through the neural network to obtain courses; the algorithm running container module is used for configuring a running container for the target recognition algorithm, the robot path planning algorithm and the game countermeasure decision algorithm, so as to perform self-optimized reinforcement learning algorithm training on the target recognition algorithm, the robot path planning algorithm and the game countermeasure decision algorithm according to the courses obtained by the training course generating module; the feedback evaluation module is used for carrying out self-organizing reinforcement training of the robot according to the training error of the robot, outputting the score of the robot on the task execution condition according to the self-organizing reinforcement training result of the robot, and feeding back the score of the robot on the task execution condition to the algorithm operation container module to guide the self-optimizing reinforcement learning algorithm training; the target recognition algorithm is YOLOv algorithm, the robot path planning algorithm is artificial potential field algorithm, and the game countermeasure decision algorithm is PPO algorithm; the method is characterized by comprising the following steps:

The specific process of the three-dimensional detection reconstruction of the real scene is as follows:

Step 1, depth image acquisition

Step2, preprocessing of depth image

step 3, calculating point cloud data from the preprocessed depth image

Step 4, point cloud registration

step 5, fusion of point cloud data after registration

Step 6, surface generation

2. The training method of the robot strategy training system based on course reinforcement learning according to claim 1, wherein the specific process of the step 6 is as follows:

3. The training method of the robot strategy training system based on course reinforcement learning according to claim 2, wherein the specific process of autonomous generation of the task scene intelligent environment is as follows:

Step 1) task scene segmentation generation

Step 2) task goal automation generation

The specific process of the step 2) is as follows:

step 3) robot dynamics model verification

4. The training method of the robot strategy training system based on course reinforcement learning of claim 3, wherein said task sequencer is implemented using a RankNet network.

5. The training method of the robot strategy training system based on course reinforcement learning according to claim 4, wherein the operating principle of the rank net network is as follows:

the parameter sigma is a constant, e is the base of natural logarithm;

True correlation probability The method comprises the following steps:

6. The training method of the robot strategy training system based on course reinforcement learning of claim 5, wherein the rank net network model is trained by using a paper method.

7. The training method of the robot strategy training system based on course reinforcement learning according to claim 6, wherein the obtained point cloud data is subjected to distortion compensation to obtain the point cloud data after the distortion compensation; the specific process is as follows: