CN111080659A

CN111080659A - Environmental semantic perception method based on visual information

Info

Publication number: CN111080659A
Application number: CN201911317441.3A
Authority: CN
Inventors: 白成超; 郭继峰; 郑红星; 刘天航
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2019-12-19
Filing date: 2019-12-19
Publication date: 2020-04-28

Abstract

The invention provides an environmental semantic perception method based on visual information, which comprises the following steps: acquiring environmental image information by using a Kinect V1.0 camera to obtain a color image and a depth image after registration; on the basis of the color image and the depth image after registration, resolving a three-dimensional pose of the camera according to ORB feature points extracted from each frame through an ORB _ SLAM2 process to obtain camera pose information; performing semantic segmentation on each frame of image to generate semantic color information; synchronously generating a point cloud according to the input depth map and the internal reference matrix of the camera; registering semantic color information into the point cloud to obtain a local semantic point cloud result; fusing the camera pose information and the local semantic point cloud result to obtain new global semantic point cloud information; and expressing the fused global semantic point cloud information by using an octree map to obtain a final three-dimensional octree semantic map. The invention provides deeper human-like understanding for the environment detection of the extraterrestrial celestial body patrolling device.

Description

Environmental semantic perception method based on visual information

Technical Field

The invention relates to an environment semantic perception method based on visual information, and belongs to the technical field of artificial intelligence information.

Background

The effective perception of the mobile platform in the environment can be realized through the synchronous positioning and drawing technology, the obstacle information in the environment can be known, the relative relation between the mobile platform and the environment can be obtained through synchronization, the key step of realizing platform autonomy is realized, however, with the continuous development of the platform and a detection load, more task scenes and requirements are provided, and the problems in actual encounter cannot be solved for the identification of the appearance and the geometric characteristics of the target. In the process of inspecting extraterrestrial celestial bodies, a piece of landform with similar appearance can obtain three-dimensional reconstruction of an environment if only depending on a traditional identification mode, but is difficult to distinguish the difference between two landforms, for a previous detection task, the prior cognition is far from insufficient as long as the prior detection task can identify whether a front obstacle exists or not and whether the front obstacle can pass or not, and along with the increase of detection time and scale, the environment needs to be understood in a semantic level, namely whether a target exists or not needs to be detected, what needs to be further analyzed, and the intelligent core of the inspection tour device is achieved. In the unmanned research, the problem is also important, a vehicle, a pedestrian, a roadblock and other targets with high randomness can be met in the driving process, the fact that a collision cannot be avoided is assumed, the fact that a front obstacle exists is recognized, the significance of semantic understanding is highlighted when the obstacle cannot be effectively judged, and after the understanding result that one is a person and the other is a grass heap is given, correct judgment can be easily made. The semantic awareness of the environment determines the correctness and effectiveness of the task performed, as can be analyzed by the above example. Furthermore, semantic cognition on the environment is more similar to human environment understanding behavior, and currently, research aiming at the problem gradually becomes a field hotspot.

Semantic segmentation may be understood as the division of the image input into different semantically interpretable classes, a common segmentation architecture being convolutional neural networks. In 2017, a learner from UCL gives a semantic segmentation idea based on deep learning, and a solution named deep Lab is provided, and mainly comprises a deep convolution network, an up-sampling convolution and a fully-connected conditional random field. The resolution of characteristic response calculation can be effectively controlled by utilizing the up-sampling filtering, and the field range of the filter can be effectively expanded, so that more semantic information can be fused; furthermore, the multi-scale segmentation of the target is realized by using the pyramid Pooling of the upsampling space (Atrous SpatialPyramid Pooling); and finally, the precise positioning of the target boundary is improved by combining the deep convolutional neural network with the probability map model. In order to solve the problem that the existing semantic segmentation method does not effectively utilize neural network parameters, Chaurasia et al utilizes an encoder to express and realize efficient semantic segmentation and provides a LinkNet solution, so that learning training can be performed on the premise of not obviously increasing parameters. Zhao et al provides a real-time semantic segmentation framework for high-resolution images, and provides an image cascade network (ICNet), and high-quality semantic segmentation is rapidly realized by introducing a cascade feature fusion unit. Schneider et al propose a new multimode convolutional neural network architecture for semantic segmentation and target detection, utilize complementary input information in addition to color information, and the advantage of this kind of combined model is that the intermediate level fusion is realized, make the network can utilize the interdependence of the cross-modality. In order to solve the scene analysis in the unlimited open vocabulary environment, Zhao et al proposes a pyramid scene analysis network (PSPNet), and realizes global semantic understanding by semantic fusion based on different regions, and it can be seen from the result that the PSPNet provides a very good framework for pixel-level prediction. In addition, U-Net, SegNet, DeconvNet, RefineNet, PixelNet and other methods also show good segmentation effects, and meanwhile, scholars propose an end-to-end-based segmentation model and an implementation idea based on countermeasure training, so that a new direction is provided for subsequent research.

Based on the background investigation and analysis, the demand for environment semantic perception is gradually increased, and the future development direction is also predicted, so that the invention provides a brand-new semantic perception method on the basis of the existing perception technology, and provides support for the environment semantic understanding of the patrol instrument.

Disclosure of Invention

The invention provides an environmental semantic perception method based on visual information, and aims to solve the problem that the existing environmental perception is insufficient in deep semantic understanding capability and simultaneously provides reliable environmental perception information for a subsequent planning control stage.

An environmental semantic perception method based on visual information, the perception method comprising the following steps:

the method comprises the following steps: acquiring environment image information by using a Kinect V1.0 camera to obtain a color image and a depth image after registration, and simultaneously executing the second step and the third step;

step two: on the basis of the color image and the depth image after registration, resolving a three-dimensional pose of the camera according to ORB feature points extracted from each frame through an ORB _ SLAM2 process to obtain camera pose information, and then executing a fifth step;

step three: based on the issued color image, performing semantic segmentation on each frame of image to generate semantic color information; synchronously generating a point cloud according to the input depth map and the internal reference matrix of the camera;

step four: registering semantic color information generated in the third step into the point cloud generated in the third step to obtain a local semantic point cloud result;

step five: fusing the camera pose information obtained in the step two with the local semantic point cloud result generated in the step four to obtain new global semantic point cloud information;

step six: and expressing the fused global semantic point cloud information obtained in the fifth step by using an octree map to obtain a final three-dimensional octree semantic map.

Further, in the second step, specifically, the color map and the depth map after the registration are obtained in a manner of combining OpenNI and OpenCV.

Further, in step two, specifically, the three main parallel threads of the ORB _ SLAM2 are as follows:

the camera position of each frame is positioned through the matched features in the local map, and only the movement BA is used for minimizing the reprojection error, wherein the BA is Bundle Adjustment and is translated into beam Adjustment;

the management and optimization of the local map are realized based on the local BA;

and performing loop detection, and correcting the accumulated drift based on pose graph optimization.

Further, in step two, specifically, the ORB _ SLAM2 process is optimized by bundle adjustment.

Further, in step three, specifically, the point cloud is a three-dimensional point cloud.

Further, in step three, specifically, the pyramid scene analysis network PSPNet is used as a model for implementing the semantic segmentation network.

Further, in the fifth step, a maximum confidence fusion mode is adopted in a fusion mode in which the camera pose information and the local semantic point cloud result generated in the fourth step are fused.

Further, in the sixth step, specifically, when the point cloud is inserted into the three-dimensional map, the points are filtered through a voxel filter to sample the points downwards; then, inserting the points into an Octomap, and removing a free space by utilizing ray projection so as to update an internal node of the Octomap, namely a voxel with lower resolution; and finally, sorting the updated Octomap to realize visualization.

The main advantages of the invention are:

the invention realizes semantic reconstruction of the detection environment based on visual information, has the capabilities of synchronous three-dimensional reconstruction, semantic segmentation and spatial semantic representation, provides deeper human-like understanding for environment detection of the extraterrestrial celestial body inspection tour, and provides more reliable information input for task planning and decision analysis. The invention belongs to the direction of artificial intelligence information technology, and improves the high-level semantic understanding capability compared with the prior art.

Drawings

FIG. 1 is a frame diagram for implementing RGB-D semantic SLAM based environment semantic perception method based on visual information according to the present invention;

FIG. 2 is a schematic diagram of an input information conversion process;

FIG. 3 is an ORB _ SLAM2 implementation framework;

FIG. 4 is a schematic diagram of an octree map;

FIG. 5 is a schematic diagram of a pyramid scene analysis network framework;

FIG. 6 is a diagram of the reconstruction of results in a dataset environment based on ORB _ SLAM 2;

FIG. 7 is a diagram illustrating the result of mapping based on RGBD three-dimensional semantics;

FIG. 8 is a diagram of the final result and semantic tags;

FIG. 9 is a diagram illustrating indoor environment mapping results based on ORB _ SLAM 2;

FIG. 10 is a diagram illustrating the result of mapping based on RGBD three-dimensional semantics;

FIG. 11 is a diagram of the final result and semantic tags.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides an embodiment of an environmental semantic perception method based on visual information, where the perception method includes the following steps:

step two: on the basis of the color image and the depth image after registration, resolving a three-dimensional pose of the camera according to ORB feature points extracted from each frame through an ORB _ SLAM2 process (synchronous positioning and drawing) to obtain camera pose information, and then executing a fifth step;

Specifically, the invention realizes the pose estimation, semantic segmentation and global/local semantic reconstruction of visual point cloud information by using the depth camera, improves the understanding capacity of environment information, ensures that the environment perception of the patrol device is not limited to geometric three-dimensional understanding, increases the understanding of barrier semantic attributes, is more favorable for the development of tasks and the planning of paths, and greatly improves the intelligence of the platform.

Referring to fig. 2, in the present preferred embodiment, in step two, specifically, the color map and the depth map after registration are obtained by combining OpenNI and OpenCV.

Referring to fig. 3, in the present preferred embodiment, in step two, specifically, the three main parallel threads of the ORB _ SLAM2 are as follows:

the camera position of each frame is positioned through the matched features in the local map, and only the movement BA is used for minimizing the reprojection error;

In the preferred embodiment of this section, in step two, specifically, the ORB _ SLAM2 process is optimized by bundle adjustment.

Specifically, referring to fig. 2-3, regarding ORB _ SLAM2, in 2017 Mur-Artal et al, an open source SLAM solution applicable to monocular, binocular and RGB-D cameras was proposed, namely ORB _ SLAM2, compared with the previous monocular ORB _ SLAM system, the application range is expanded, not limited to monocular vision, and the whole system framework includes closed loop detection, relocation and map reuse; secondly, higher precision is obtained by introducing Beam Adjustment (BA) optimization at the rear end compared with a real-time method based on Iteration Closest Point (ICP) or luminosity and depth error minimization and the like; thirdly, the final precision is superior to the direct binocular matching by using the binocular point matching and monocular observation at far and near positions; fourthly, a lightweight positioning mode is provided, the visual odometer is utilized to track the non-reconstructed area and is matched with the map point allowing zero drift positioning, and the positioning problem under the condition that the map cannot be built is effectively solved. At present, the system is already applied to various application scenarios, such as handheld environment reconstruction equipment, unmanned aerial vehicle environment reconstruction, unmanned vehicle automatic driving in a large-scale environment and the like, so the invention uses ORB _ SLAM2 as the rear end of the SLAM to solve the camera pose, the SLAM system can still obtain accurate global positioning accuracy in a large time scale, the requirement on the operating environment is conventional, and the real-time solution operation can be realized based on a CPU.

1) System input

The input of the system is a color image and a depth image collected by a camera, and for each frame of image, a group of characteristic points are extracted, which correspond to the Kinect V1.0 camera adopted by the invention, that is, 1000 points are extracted from each image in the size of 640 multiplied by 480. Meanwhile, for the image data acquisition application, a mode of combining OpenNI and OpenCV is adopted, because OpenCV cannot directly operate the sensor, and the image format extracted by OpenNI cannot directly perform subsequent operations, and the operation flow is shown in fig. 2. The available state of the sensor equipment is detected through OpenNI, the data stream is collected, and the data stream is converted into an OpenCV available form through format conversion, namely, a picture format for subsequent operation can be carried out. The obtained image information will be stored in the form of color pictures and depth pictures.

2) System architecture and operation

In operation, the system has three main parallel threads: firstly, the camera position of each frame is positioned through matched features in a local map, and only motion BA is used for minimizing a reprojection error; secondly, the management and optimization of the local map are realized based on the local BA; performing loop detection, and correcting accumulated drift based on pose graph optimization; after this, a fourth thread, i.e., a full BA optimization, can be performed to give an optimal mechanism and motion solution. In addition, a location identification module based on DBoW2 is embedded for relocation in case of tracking failure or reinitialization in already reconstructed scenes. While the system also maintains a common visibility graph (covisibilitygraph), i.e., two keyframes connecting arbitrary observation common points and a minimum spanning tree connecting all the keyframes, these graph structures allow the retrieval of local windows of the keyframes for tracking and local mapping to occur locally. The system uses the same ORB features for tracking, mapping, and recognition tasks, which are robust to rotation and scale, and well invariant to camera auto-gain, auto-exposure, and illumination variations. And the method has the characteristics of fast reading, extraction and matching, and has advantages in the aspect of real-time operation.

3) Beam method Adjustment (Bundle Adjustment) optimization

Map point three-dimensional coordinate X_w,j∈³Keyframe pose T_iwE SE (3), where w represents the world system, for matching keypoints

The reprojection error of (2) is optimized to minimize the error sum thereof. The error term for observing map point j in key frame i is:

e_i,j＝x_i,j-π_i(T_iw,X_w,j) (1)

where pi_iIs the projection equation:

[x_i,jy_i,jz_i,j]^T＝R_iwX_w,j+t_iw(3)

where R is_iw∈SO(3)、

Are each T_iwThe rotational and translational portions of (a). (f)_i,u,f_i,v) And (c)_i,u,c_i,v) Is the camera internal parameter corresponding to the key frame i moment. The cost function to be minimized is:

where ρ is_hIs a Huber robust kernel function that,

is a covariance matrix related to the scale of the detection keypoints. For a full BA, all points and key frames are optimized, the first key frame being fixed as the origin. In local BA, all points contained in the local area are optimized, while the subset of keyframes is fixed. In "pose graph optimization" or "motion BA", all points are fixed, only the camera pose is optimized. Pose graph Optimization (PoseGraph Optimization) under the SE (3) constraint is given below.

Firstly, a pose graph of a binary edge is given, and the error of the edge is defined as:

after the closed edge is calculated, log_SE(3)Conversion to tangent space, so that the error vector is⁶The vector of (1). The goal is to optimize the keyframe pose in SE (3) space, minimizing the cost function:

in the formula, Λ_i,jIs the information matrix of the edge. Although this method is a rough approximation of a full BA, it has faster, better convergence than BA.

In this preferred embodiment, in step three, specifically, the point cloud is a three-dimensional point cloud.

Specifically, before inserting the three-dimensional map, the environmental structure information is stored in the form of a point cloud for performing message delivery. A point cloud is a set of unordered points, each containing the coordinates of the point in some reference system. The depth image is first registered to the reference frame of the color image. Then, the real world coordinates of each pixel are calculated according to the position and the depth of each pixel on the image and the camera internal parameters, and further point cloud information is generated.

In the pinhole camera model, given a pixel and its pixel coordinates (X, Y) and depth d, the coordinates (X, Y, Z) of the real world sitting in the camera optical center coordinate system can be calculated by:

where f is_x、f_yIs the focal length of the camera, c_x、c_yIs the pixel coordinate of the center of the optical axis on the image. In addition to the location and RGB information, semantic information is also stored in the point cloud. Different point types are used for different semantic fusion methods. According to the invention, the three-dimensional semantic reconstruction is realized by adopting the maximum confidence fusion, so that the point cloud data structure comprises three-dimensional position information, RGB color information, semantic color information and semantic confidence information.

Referring to fig. 5, in the preferred embodiment of this section, in step three, specifically, the pyramid scene analysis network PSPNet is used as a model for implementing the semantic segmentation network.

Specifically, the main purpose of semantic segmentation is to distinguish semantic information of an image, and compared with target identification and positioning, the semantic segmentation is closer to real application in application, that is, whether an object to be identified exists in the image is given by the target identification, the positioning is to give a relative spatial relationship of the identified objects, and the semantic segmentation is to semantically distinguish the environment, so that the semantic segmentation has an understanding capability on each frame of image. The environment perception of the semantic level is the most needed in practical application, because the attribute of the environment can be better judged by combining the priori knowledge through semantic cognition, the planning constraint is considered from more aspects, and a safer and more optimized running track is obtained.

In recent years, with the rise of artificial intelligence technology, semantic segmentation is more and more emphasized, and the semantic segmentation has been brought into effect in many fields through the combination with a neural network, such as intelligent robots, unmanned driving, medical images and the like, so that support is provided for high-level understanding of different task scenes, and the conversion from actual measurement information to abstract semantic understanding is provided. For the extraterrestrial celestial body patrol device, the ability is also needed to help the patrol device autonomously carry out patrol tasks, know what obstacles are while detecting obstacles in front, know the current terrain, and be unsuitable for going forward and the like.

At present, mature deep networks such as AlexNet, VGG-16, GoogleNet and ResNet have good effects on the realization of image semantic segmentation. The invention adopts pyramid scene analysis network (PSPNet) as a model for realizing CNN semantic segmentation network. Fig. 5-6 show the structure of the network model, the input of which is the collected color image of the scene, and the output is the score map containing the category information. To implement this process, first process the input image with ResNet to generate a feature map; secondly, performing pyramid pooling operation on the generated feature map, thereby obtaining feature maps with different resolutions; then, carrying out convolution operation on each pooled feature map, and stacking the results in combination with the upsampled feature map to form a final feature map; and finally, obtaining a score map of the category through convolution processing.

When the method is implemented on an unmanned vehicle platform, an image acquired by the Kinect V1.0 is firstly adjusted to the input size of a CNN semantic segmentation network; simultaneously, a Softmax activation function is adopted during class output mapping so as to generate a probability distribution (the sum of scores is 1); then, according to a semantic fusion method, a semantic label with the highest probability is selected for each pixel, and the probabilities are called semantic confidence of associated semantic category labels; and finally, decoding the semantic labels into RGB colors according to the color map. And obtaining and representing the semantic information.

In this preferred embodiment of the present invention, in the fifth step, specifically, a fusion mode in which the camera pose information and the local semantic point cloud result generated in the fourth step are fused adopts a maximum confidence fusion mode.

Specifically, semantic labels corresponding to pixels of each frame of image can be obtained by performing semantic segmentation on each frame of image, and in a continuous motion environment, semantic values at a plurality of continuous moments need to be fused to realize global semantic understanding. When point cloud fusion is executed, the method adopts a maximum confidence fusion mode, the fusion comprises the semantic color with the highest confidence generated by the CNN semantic segmentation network and the confidence of the generated point cloud, and the same information is also stored in each voxel of Octmap. When inserting a point cloud into an Octomap, if a voxel has a new measurement, the two semantic information are fused together.

If the two semantic colors are the same, the semantic color is maintained and the confidence is the average of the confidence of the two semantics. In another case, if the two semantics are different in color, the semantics with higher confidence are retained, and the invention reduces the confidence by 0.9 as a penalty for the inconsistency. This may also ensure that the semantic information is always updated, even if it already has a very high degree of confidence. The method has the advantage that only one semantic information is stored, so that the memory efficiency is improved. The pseudo code is shown in table 1:

TABLE 1 semantic fusion-Max confidence fusion

Referring to fig. 4, in the present preferred embodiment, in step six, specifically, when a point cloud is inserted into a three-dimensional map, first, filtering points through a voxel filter to down-sample the points; then, inserting the points into the Octomap, and removing a free space within a certain range by utilizing ray projection so as to update internal nodes of the Octomap, namely voxels with lower resolution; and finally, sorting the updated Octomap to realize visualization.

Specifically, regarding the octree map:

the three-dimensional reconstruction terrain can be represented in various forms, the three-dimensional reconstruction terrain can be divided into a measurement map and a topological map, and in order to effectively improve the map representation in a large-scale environment, the Octomap is used as the three-dimensional map representation. Octmap represents a large bounded space as an octree occupying a grid (voxel). Each node in the octree represents a voxel of a particular size, depending on its level in the tree. Each parent node of the octree is subdivided into 8 children nodes until the best resolution is reached. An illustration of an octree is shown in FIG. 4. Thus, a three-dimensional map of a large scale can be efficiently stored in the memory.

The Octomap models the sensors with hit and miss rates and updates the occupancy of voxels based on different measurements in a probabilistic manner. Through testing, it was found that testing of the invention was suitable with a resolution of 2 cm, providing not only good detail for the characterization of the environment, but also maintaining the real-time efficiency of the inset map. In addition to this, Octomap is also able to distinguish free space from unknown space.

Regarding point cloud insertion maps:

when inserting a point cloud in a three-dimensional map, the points are first filtered by a voxel filter to down-sample the points. These points are then inserted into Octomap. And the free space within a certain range is eliminated by utilizing ray projection. And then updating the internal nodes of the Octomap, namely the voxels with lower resolution. And finally, sorting the updated Octomap to realize visualization.

Wherein the voxel filter is used to down-sample the point cloud. The principle is to retain only one point (resolution) in a given voxel space. Since only one point is needed to update the octree nodes, the resolution of the voxel filter is set to the same value as the octree resolution. Such a filter can greatly improve performance because it reduces the number of points, especially for points far from the sensor, which is time consuming for ray casting. For a kinect v1.0 with an image size of 640 x 480, 307200 dots need to be inserted. After voxel filtering, 15000 to 60000 points can be obtained according to the distance of the points, thereby greatly reducing the storage of the points and improving the utilization of effective points.

When the point cloud is inserted into the Octomap, only the voxels with the highest resolution (leaf nodes) are updated. Their occupancy probabilities, RGB colors, semantic colors, and confidences are updated. And meanwhile, updating semantic color and confidence level according to a maximum confidence level semantic fusion method. Considering the limited measurement range and efficiency of the depth camera, only points at a certain distance from the origin (the optical center of the camera) are inserted here. This maximum range is set to 5 meters in the present invention. For the probability of occupation, according to the derivation in the octree correlation paper, assuming that T is 1,2, T-1, time T, the observed data is z₁,,z_TThen, the information recorded by the nth leaf node is:

to clear free space, when a point is inserted in Octomap, ray casting may be performed to clear all voxels on the straight line between the origin and the end point. When the endpoint is far from the origin, this can be a very expensive operation because many octree searches are performed. In order to eliminate the necessary free space while maintaining reasonable operating efficiency, the present invention only projects light to a limited extent.

Then, color and semantic information at low resolution is obtained by updating the internal nodes of the octree. The occupation probability of a father node is set to be the maximum value of eight child nodes of the father node, the color of the father node is set to be the average value of the child nodes of the father node, and the semantic information of the father node is the fusion of the semantics of the child nodes.

Finally, in Octomap, the same child nodes may be pruned to reduce the size of the map data. In the source code implementation of Octomap, children are pruned if all of these children have the same footprint. Since semantic information must be stored on leaf nodes, a node's children are pruned only if all its children have the same probability of occupation, the same semantic color, and the same semantic confidence. So in actual testing, the probability of child nodes being pruned is low.

The specific embodiment of the invention:

(1) validating parameter settings

Based on the method, the algorithm verification is completed in 2 environments, wherein the simulation environment verification is performed based on an ADE20K data set issued by MIT, and the data set provides a good test reference for perception and semantic understanding of a scene; the complex environment test is carried out in the laboratory environment including people, tables, chairs, cabinets, books and the like, and the ISAP laboratory of the new Harbour technology is selected for testing.

Meanwhile, a whale XQ unmanned vehicle platform is selected as an experimental test platform, a KinectV1.0 depth vision camera is carried, and the internal parameter is f_x＝517.306408，f_y＝516.469215，c_x＝318.643040，c_y255.313989, the tangential distortion coefficient is k₁＝0.262383，k₂-0.953104, radial distortion coefficient p₁＝-0.005358，p₂＝0.002628，p₃When 1.163314, the effective depth range can be calculated as:

in the process of physical testing, the acquisition frequency of a color image and a depth image of the Kinect V1.0 camera is 30Hz, the acquisition frequency of the vibration sensor is 100Hz, the frequency of a feature vector is 1.6Hz, and the running frequency of ORB _ SLAM2 is 15 Hz.

(2) Test results

Open data set testing:

semantic reconstruction test analysis under the environment of a public data set is given, and ORB _ SLAM2 sparse environment reconstruction and dense three-dimensional semantic reconstruction based on the method provided by the invention are respectively completed based on an ADE20K data set, wherein a test result based on ORB _ SLAM2 is given in FIG. 6, sparse point cloud reconstruction is given on the left side in the graph, only the change trend of the environment can be seen approximately, the graph is usually used for auxiliary navigation and provides detection feedback of front obstacles for a platform, and an image schematic and feature point detection result of a key frame is given on the right side.

Therefore, as shown in fig. 7, a three-dimensional semantic mapping result based on RGBD is given, and video data of 52s in the data set is collected together, and the mapping frequency is 0.9 Hz. The left side of the graph shows a semantic map construction result of a traveling process, the right side of the graph shows an image schematic of the process in data, parameters set based on experiments can be seen, a test environment result can be reasonably reconstructed, meanwhile, in combination with the graph 8, the judgment of the environment semantic result can be basically consistent with the actual result, for example, the color schematic of typical scenes such as the ground, a table, a chair and a wall surface in a reconstructed graph is consistent with the result in a semantic label, and a green track in the graph is a track schematic of the motion of a camera. It should be noted here that in order to globally judge the semantic labeling precision, it is necessary to know the real semantic information of the reconstructed point cloud and the point cloud semantic estimation in the testing process, but the measurement of the standard value is very difficult, and at the same time, each experiment cannot ensure that the point cloud selection is completely consistent, so the method provided by the invention can judge the correctness according to the semantic color information, and then indirectly feed back the reliability of the label according to the planning success rate of the experiment platform based on the semantic information in the subsequent physical test. Of course, there are some data sets that give true contrast values, but the data are very limited, and the ADE20K data set for verification in the present invention does not give the true contrast values, so the judgment is still made by comparing the semantic color of the point cloud with the actual value, because the cognitive ability is finally used for physical application, and the above-mentioned judgment method has certain operability.

And (3) testing the physical environment:

the invention carries out physical test in a complex laboratory environment, walks for a circle around a laboratory walkway, acquires 84s of video data, and places a lawn in a scene for distinguishing terrains made of different materials, wherein the scene comprises common articles such as floors, walls, people, tables, chairs, curtains, glass, lockers, bags, garbage cans and the like. Consistent with the data set testing thought, firstly, a sparse reconstruction test based on the ORB _ SLAM2 is given, as shown in fig. 9, the left side shows an environment three-dimensional point cloud reconstruction, the right side shows a part of key frame and feature point detection results in the process, and the method can also show only the rough shape of the environment, and is difficult to represent information such as the specific shape of an object in the real environment, but for a patrol task, the environment is unknown, the information of each frame is important for both scientific detection and navigation, and the environment which runs outside the ground can be expressed abundantly in practice, so that the method based on the invention is used for carrying out an experiment on the dense three-dimensional semantic reconstruction in the environment.

Fig. 10 shows a three-dimensional semantic mapping result based on RGBD, where the frequency of the mapping process is 0.4Hz, the left side in the diagram is the semantic mapping result in the test process, different colors represent different semantic information, and the right side is an image schematic acquired around an ISAP laboratory in the test process, so that the environment passing through can be seen to contain more object types, and can be seen as complex scene processing.

The final semantic mapping result in the indoor complex environment is shown in fig. 11, where the green track is the actual running path, referring to the semantic label shown on the right, the experiment has the semantic recognition and reconstruction capability of 16 objects such as walls, floors, people, doors, glass windows, storage cabinets, boxes, chairs, curtains, grass and the like, and through comparison with an actual measurement environment, the walking environment information has good semantic mapping effect, compared with the reconstruction result based on ORB _ SLAM2, although the approximate environment results are similar in schematic and also show whether the detection environment has obstacles, the semantic mapping meaning is larger, such as a laboratory student sitting close to the wall in the environment on the left side in the figure, the semantic graph can effectively segment the information from the environment and represent the information by different color values, but the former cannot and only can give the result that an object exists. Meanwhile, as the main purpose of the invention is to improve the terrain perception capability of the patrol instrument, 1000 point cloud points are randomly extracted by selecting five types of objects (walls, floors, grass, storage cabinets and doors) in a test environment, and a statistical result of the test marking precision is given by comparing the predicted semantic value and the real semantic value of each point, wherein the statistical standard is shown as the following formula:

the results are shown in table 2:

TABLE 2 semantic reconstruction tag precision

The marking precision of the five types of objects is higher than 90%, and the selected types are the types which are easy to be confused in the conventional process, so that the operation objects and the environment are relatively single and the operation speed is relatively slow in comparison with the extraterrestrial celestial body patrol, and a cognitive result with higher precision can be obtained, so that abundant semantic information on a patrol path can be obtained by utilizing semantic construction, and the patrol device has the capability of environment understanding, thereby planning the optimal path of the patrol and executing an autonomous set task with ideas. This is certainly the development trend of future patrols, and the research of the present invention can provide a certain reference for this purpose.

Claims

1. An environmental semantic perception method based on visual information is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein in step two, the color map and the depth map after registration are obtained by combining OpenNI and OpenCV.

3. The method for semantic perception of an environment based on visual information as claimed in claim 1, wherein in step two, specifically, the three main parallel threads of the ORB _ SLAM2 are as follows:

4. The method as claimed in claim 1, wherein in step two, specifically, the ORB SLAM2 process is optimized by bundle adjustment.

5. The method for sensing environmental semantics based on visual information according to claim 1, wherein in step three, specifically, the point cloud is a three-dimensional point cloud.

6. The method for sensing environmental semantics based on visual information according to claim 1, wherein in step three, specifically, a pyramid scene analysis network PSPNet is adopted as a model for implementing a semantic segmentation network.

7. The visual information-based environment semantic perception method according to claim 1, wherein in the fifth step, a maximum confidence fusion mode is adopted as a fusion mode in which the camera pose information is fused with the local semantic point cloud result generated in the fourth step.

8. The method for sensing environmental semantics based on visual information according to claim 1, wherein in step six, specifically, when a point cloud is inserted into a three-dimensional map, points are first filtered through a voxel filter to down-sample the points; then, inserting the points into an Octomap, and removing a free space by utilizing ray projection so as to update an internal node of the Octomap, namely a voxel with lower resolution; and finally, sorting the updated Octomap to realize visualization.