CN110930519A

CN110930519A - Semantic ORB-SLAM sensing method and device based on environment understanding

Info

Publication number: CN110930519A
Application number: CN201911113708.7A
Authority: CN
Inventors: 柯晶晶; 周广兵; 蒙仕格; 郑辉; 林飞堞; 陈惠纲; 王珏
Original assignee: South China Robotics Innovation Research Institute
Current assignee: South China Robotics Innovation Research Institute
Priority date: 2019-11-14
Filing date: 2019-11-14
Publication date: 2020-03-27
Anticipated expiration: 2039-11-14
Also published as: CN110930519B

Abstract

The invention discloses a semantic ORB-SLAM perception method and device based on environment understanding, wherein the method comprises the following steps: inputting the sequence frame into an ORB-SLAM front end Tracking thread to carry out key frame extraction processing, and acquiring key frame data; inputting the key frame data into an adjacent key frame image optimization thread to perform key frame data optimization processing, and acquiring key frame data after image optimization; calculating an error value between the graph optimized key frame data, and generating a candidate set based on the error value; and performing closed-loop correction processing on the candidate set based on global map optimization and loop fusion, and performing synchronous positioning and map construction based on a correction result. In the embodiment of the invention, the improvement of the environment perception of the robot has a remarkable effect, and meanwhile, the robot can obtain higher-layer cognitive information of a scene, so that a more natural application mode is provided for application neighborhoods including robot navigation, augmented reality and automatic driving.

Description

Semantic ORB-SLAM sensing method and device based on environment understanding

Technical Field

The invention relates to the technical field of intelligent robot perception, in particular to a semantic ORB-SLAM perception method and device based on environment understanding.

Background

Synchronous positioning and Mapping (SLAM) is the basis for realizing autonomous navigation in an unknown environment by a mobile robot and is one of the precondition for realizing autonomy and intellectualization; currently, the visual SLAM can perform real-time positioning and three-dimensional map construction in a static environment in a certain range, however, the map generated by the traditional visual SLAM only contains simple geometric information (points, lines and the like) or low-level pixel level information (colors, brightness and the like), and does not contain semantic information. Although these simple geometric information and pixel level information can satisfy the autonomous navigation of a robot in a single environment, they cannot satisfy the requirement of a mobile robot to perform higher level tasks.

Patent CN201811514700 discloses a visual SLAM method based on ORB features, which only adopts ORB features to replace traditional SIFT feature extraction in a front-end link, and performs feature matching judgment by using hamming distance, so that the calculated amount can be reduced to a certain extent, and the real-time performance of the visual SLAM is improved; and in the back-end module, a graph optimization idea is adopted, and the accuracy of loop detection can be well improved based on a point cloud fusion optimization idea combining local loop and global loop.

However, although the ORB features are used for replacing the traditional SIFT feature extraction, the visual SLAM method based on ORB features effectively improves the calculation speed, but can only work in a static state or in a scene with a small number of dynamic objects, if a large number of feature points fall on the dynamic objects, the SLAM tracking and positioning result can shift along with the movement of the dynamic objects, the robot mapping and positioning accuracy is greatly influenced, and even the pose calculation failure can occur; most pixel information in an original picture is discarded in the generation process of feature points of the visual SLAM method based on ORB features, effective semantic information is lacked, and further understanding of the robot to environment perception is seriously influenced.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a semantic ORB-SLAM perception method and device based on environment understanding, which improve the remarkable effect of a robot on environment perception, enable the robot to obtain higher-layer cognitive information on a scene, and provide a more natural application mode for application fields including robot navigation, augmented reality and automatic driving.

In order to solve the above technical problem, an embodiment of the present invention provides a semantic ORB-SLAM sensing method based on environment understanding, where the method includes:

inputting the sequence frame into an ORB-SLAM front end Tracking thread to carry out key frame extraction processing, and acquiring key frame data;

inputting the key frame data into an adjacent key frame image optimization thread to perform key frame data optimization processing, and acquiring key frame data after image optimization;

calculating an error value between the graph optimized key frame data, and generating a candidate set based on the error value;

and performing closed-loop correction processing on the candidate set based on global map optimization and loop fusion, and performing synchronous positioning and map construction based on a correction result.

Optionally, the inputting the sequence frame into an ORB-SLAM front end Tracking thread to perform key frame extraction processing, and acquiring key frame data includes:

an ORB-SLAM front end Tracking thread adopts an interframe difference method to carry out dynamic background removal processing on an input sequence frame, and a sequence frame with a dynamic background removed is obtained;

establishing a mapping relation between the sequence frame with the dynamic background removed and the object characteristic points, and acquiring a sequence frame in the mapping relation with the object characteristic points;

carrying out ORB feature extraction processing on the sequence frame in mapping relation with the object feature points to obtain ORB features of the sequence frame;

matching ORB characteristics of the current frame with ORB characteristics of the previous frame to obtain matched characteristic point pairs;

performing pose estimation and repositioning processing based on the matching feature point pairs to obtain pose estimation and repositioning results;

and performing pose estimation and repositioning processing according to the matched adjacent sequence frames to obtain pose optimization of adjacent frames, and acquiring a key frame sequence based on the pose optimization of the adjacent frames.

Optionally, the performing, by the ORB-SLAM front end Tracking thread, dynamic background removal processing on the input sequence frame by using an inter-frame difference method to obtain a sequence frame with a dynamic background removed includes:

performing difference operation on adjacent frames in the continuous time interval in the sequence frames, and performing change detection by using strong correlation of the adjacent frames in the sequence frames to obtain a moving target;

and based on the selected threshold value, eliminating the dynamic background of the moving target in the sequence frame, and acquiring the sequence frame with the dynamic background removed.

Optionally, the establishing a mapping relationship between the sequence frame without the dynamic background and the object feature point to obtain a sequence frame in a mapping relationship with the object feature point includes:

observing a picture point observed according to the sequence frame of the current frame with the dynamic background removed, and observing the sequence frame of the next frame with the dynamic background removed based on the picture point to be used as an adjacent sequence frame of the current frame with the dynamic background removed;

taking the sequence frame of the current frame without the dynamic background as a root node, and taking the adjacent sequence frame as a child node to generate a node tree;

and constructing a mapping relation between the sequence frame without the dynamic background and the object characteristic points based on the node tree, and acquiring the sequence frame in the mapping relation with the object characteristic points.

Optionally, the performing pose estimation and repositioning processing based on the matching feature point pairs includes:

and calculating the relative displacement of the sequence frame of the current frame and the sequence frame of the previous frame by utilizing the minimized reprojection error according to the matched feature point pairs.

Optionally, the method further includes:

when pose estimation and repositioning processing fails based on the matched feature point pairs, obtaining the most similar sequence frames among the sequence frames of the current frame based on the mapping relation with the object feature points;

obtaining the most similar ORB characteristics of the sequence frames, and matching the ORB characteristics of the sequence frames of the current frame with the most similar ORB characteristics of the sequence frames to obtain a first matching characteristic point pair;

and performing pose estimation and repositioning calculation again by using the first matching feature point pair to obtain a pose estimation and repositioning result.

Optionally, the obtaining a sequence of key frames based on pose optimization of the adjacent frames includes:

calculating the minimum re-projection error between the adjacent frames, and establishing a common view based on the minimum re-projection error;

and extracting the sequence frame in the common view as a key sequence frame.

Optionally, the inputting the key frame data into an adjacent key frame image optimization thread to perform key frame data optimization processing, and acquiring the key frame data after image optimization, includes:

and inputting the key frame data into an adjacent key frame image optimization thread, and then sequentially performing redundant point elimination processing, semantic extraction processing, new image point creation processing and adjacent frame optimization processing on the key frame data to obtain the key frame data after image optimization.

Optionally, the semantic extraction processing is performed on the key frame data after the redundant point elimination processing, and includes:

performing object detection on the key frame data subjected to the redundant point elimination processing based on a YOLO-v3 algorithm to obtain an object detection result;

performing semantic association processing on the object detection result by using a conditional random field to obtain combined object class probability and scene context information;

correcting and optimizing the combined object class probability and scene context information to generate a temporary object information candidate set;

judging whether the temporary object information in the temporary object information candidate set is a new object or an existing object, searching each point information of each temporary object information in the temporary object information candidate set in a corresponding neighborhood thereof, and acquiring a three-dimensional point closest to the point;

and calculating the Euler distance between the point and the three-dimensional point, and if the Euler distance is smaller than a preset threshold value, considering the point and the three-dimensional point as the same point.

In addition, an embodiment of the present invention further provides a semantic ORB-SLAM sensing apparatus based on environment understanding, where the apparatus includes:

the key frame extraction module: the system comprises a sequence frame input unit, an ORB-SLAM front end Tracking thread, a key frame extraction unit, a frame synchronization unit and a frame synchronization unit, wherein the sequence frame input unit is used for inputting a sequence frame into the ORB-SLAM front end Tracking thread to perform key frame extraction processing to obtain key frame data;

a key frame optimization module: the key frame data are input into an adjacent key frame image optimization thread to be subjected to key frame data optimization processing, and the key frame data after image optimization are obtained;

an error calculation module: the device is used for calculating an error value between the graph optimized key frame data and generating a candidate set based on the error value;

the synchronous positioning and map building module: and the method is used for carrying out closed-loop correction processing on the candidate set based on global map optimization and loop fusion, and carrying out synchronous positioning and map construction based on a correction result.

In the embodiment of the invention, aiming at the defects that the traditional visual ORB-SLAM is easily interfered by a dynamic target in the characteristic extraction process, the extracted characteristic points only contain color brightness and geometric information and lack of object environment semantic information, in the front Tracking thread of the ORB-SLAM, firstly, the adjacent frames in the sequence frames are subjected to difference operation by using an interframe difference method, a threshold value is set, dynamic objects are eliminated, then, the mapping relation between the sequence frames and the object characteristic points is constructed, ORB characteristic extraction is carried out, the object environment information extracted based on deep learning semantics is integrated into an ORB-SLAM system, the semantic ORB-SLAM perception method for realizing environment understanding is realized, and the visual ORB-SLAM sensing method has the advantages of stable performance, difficulty in environmental interference, accurate matching and deeper environment understanding; the robot has a remarkable effect on environment perception, can obtain higher-layer cognitive information of a scene, and provides a more natural application mode for application fields including robot navigation, augmented reality and automatic driving.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a semantic ORB-SLAM perception method based on environment understanding in an embodiment of the present invention;

fig. 2 is a schematic structural composition diagram of a semantic ORB-SLAM perception device based on environment understanding in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

Referring to fig. 1, fig. 1 is a flowchart illustrating a semantic ORB-SLAM sensing method based on environment understanding according to an embodiment of the present invention.

As shown in fig. 1, a semantic ORB-SLAM perception method based on environment understanding, the method comprising:

s11, inputting the sequence frame into an ORB-SLAM front end Tracking thread to perform key frame extraction processing, and acquiring key frame data;

in the specific implementation process of the present invention, the inputting the sequence frame into the ORB-SLAM front end Tracking thread for key frame extraction processing to obtain key frame data includes: an ORB-SLAM front end Tracking thread adopts an interframe difference method to carry out dynamic background removal processing on an input sequence frame, and a sequence frame with a dynamic background removed is obtained; establishing a mapping relation between the sequence frame with the dynamic background removed and the object characteristic points, and acquiring a sequence frame in the mapping relation with the object characteristic points; carrying out ORB feature extraction processing on the sequence frame in mapping relation with the object feature points to obtain ORB features of the sequence frame; matching ORB characteristics of the current frame with ORB characteristics of the previous frame to obtain matched characteristic point pairs; performing pose estimation and repositioning processing based on the matching feature point pairs to obtain pose estimation and repositioning results; and performing pose estimation and repositioning processing according to the matched adjacent sequence frames to obtain pose optimization of adjacent frames, and acquiring a key frame sequence based on the pose optimization of the adjacent frames.

Further, the ORB-SLAM front end Tracking thread performs dynamic background removal processing on the input sequence frame by using an inter-frame difference method to obtain a sequence frame with a dynamic background removed, including: performing difference operation on adjacent frames in the continuous time interval in the sequence frames, and performing change detection by using strong correlation of the adjacent frames in the sequence frames to obtain a moving target; and based on the selected threshold value, eliminating the dynamic background of the moving target in the sequence frame, and acquiring the sequence frame with the dynamic background removed.

Further, the establishing a mapping relationship between the sequence frame without the dynamic background and the object feature point to obtain a sequence frame in a mapping relationship with the object feature point includes: observing a picture point observed according to the sequence frame of the current frame with the dynamic background removed, and observing the sequence frame of the next frame with the dynamic background removed based on the picture point to be used as an adjacent sequence frame of the current frame with the dynamic background removed; taking the sequence frame of the current frame without the dynamic background as a root node, and taking the adjacent sequence frame as a child node to generate a node tree; and constructing a mapping relation between the sequence frame without the dynamic background and the object characteristic points based on the node tree, and acquiring the sequence frame in the mapping relation with the object characteristic points.

Further, the performing pose estimation and repositioning processing based on the matching feature point pairs includes: and calculating the relative displacement of the sequence frame of the current frame and the sequence frame of the previous frame by utilizing the minimized reprojection error according to the matched feature point pairs.

Further, the method further comprises: when pose estimation and repositioning processing fails based on the matched feature point pairs, obtaining the most similar sequence frames among the sequence frames of the current frame based on the mapping relation with the object feature points; obtaining the most similar ORB characteristics of the sequence frames, and matching the ORB characteristics of the sequence frames of the current frame with the most similar ORB characteristics of the sequence frames to obtain a first matching characteristic point pair; and performing pose estimation and repositioning calculation again by using the first matching feature point pair to obtain a pose estimation and repositioning result.

Further, the obtaining of the key frame sequence based on the pose optimization of the adjacent frames comprises: calculating the minimum re-projection error between the adjacent frames, and establishing a common view based on the minimum re-projection error; and extracting the sequence frame in the common view as a key sequence frame.

Specifically, in an ORB-SLAM front end Tracking thread, a sequence frame is input into the ORB-SLAM front end Tracking thread, dynamic background removal is firstly carried out, noise interference and influence of a dynamic object on a subsequent characteristic point extraction and matching process are eliminated, an interframe difference method is adopted, adjacent frames in continuous time intervals in the sequence frame are extracted for difference operation, strong correlation of the adjacent frames in the sequence frame is utilized for change detection, a moving target is detected, and then a moving area in the sequence frame is removed through selecting a threshold value; among the sequential frames, the k-th frame f_k(x, y) and k +1 frames f_k+1The change between (x, y) can be represented by the binarized difference value D (x, y) as follows:

wherein, T is a set binary difference threshold; the part of '1' in the binary difference is composed of parts of the corresponding pixel gray values of the front and the back frames which are changed, and usually comprises a moving object and noise; the part of "0" is composed of the parts of the corresponding pixels with unchanged gray values of the previous and next frames.

In a front-end Tracking thread, in order to integrate the extracted semantic information into an ORB-SLAM frame, a mapping relation between a sequence frame with a dynamic background removed and object feature points needs to be established; in the ORB-SLAM, each sequence frame for removing the dynamic background stores the image points observed by the frame, and simultaneously, each image point also stores the sequence frame for removing the dynamic background and observing the image points; establishing a spanning tree of ORB-SLAM according to the relation between the sequence frame and the image point of the removed dynamic background; in order to construct a spanning tree, firstly, finding out a sequence frame for removing the dynamic background and observing the image point according to the image point observed by the current sequence frame for removing the dynamic background, wherein the sequence frame for removing the dynamic background is an adjacent sequence frame of the current sequence frame for removing the dynamic background and has a large number of same image points with the current sequence frame for removing the dynamic background; at the same time, the image points between the current sequence frames for removing the dynamic background, and each image point is provided with an associated sequence frame for removing the dynamic background; therefore, a spanning tree which takes the current sequence frame without the dynamic background as a root node and takes the adjacent sequence frame as a child node can be generated; in the spanning tree, the relationship between the child node and the parent node is determined by the number of common graph points; according to the spanning tree, the current sequence frame without the dynamic background can find the adjacent sequence frame, so that more associated image points can be found; the mapping relation between the sequence frames for removing the dynamic background and the object is established in the following way:

each object O_iComprises the following steps:

point cloud data which are contained in an object under a world coordinate system and are obtained through calculation according to camera projection; the number of object classes and the probability of the corresponding object classes, wherein the probability is iteratively updated through an iterative Bayesian process; observing a set of keyframes for the object; the class to which the object belongs corresponds to the class of the object with the highest probability; the number of times the object is observed.

The color image corresponding to the sequence frame with the dynamic background removed is used for object detection; the depth image corresponding to the sequence frame with the dynamic background removed is used for generating object point cloud data; the sequence of frames with the dynamic background removed is observed as object information. Based on the mapping between the image point and the sequence frame with the dynamic background removed, after the relationship construction operation between the object and the sequence frame with the dynamic background removed is completed, the sequence frame with the dynamic background removed can find the associated object, and the object can also find the associated sequence frame with the dynamic background removed.

ORB feature extraction is carried out on the sequence frame in the mapping relation with the object feature points, ORB feature points are extracted, SIFT feature points are replaced, and therefore the operand can be effectively retrieved, and the operation efficiency is accelerated.

In an ORB-SLAM front end Tracking thread, performing pose estimation and repositioning according to a sequence frame with a dynamic background removed from a previous frame, namely matching ORB characteristics of the sequence frame of a current frame with ORB characteristics of the sequence frame of the previous frame to obtain matched characteristic point pairs, and then calculating relative displacement between the current frame and the previous frame by using a minimized reprojection error of the currently matched characteristic point pairs; if the tracking and positioning fails, a sequence frame which is most similar to the current frame is found by using a scene failure mode, the current frame is matched with the sequence frame to obtain matched image points, and the pose of the current frame is recalculated by using the matched image points.

Generally, two adjacent frames can simultaneously observe a part of the same image points, the minimum reprojection error between the two adjacent frames is calculated, the smaller the reprojection error is, the greater the correlation between the two adjacent sequence frames, so a corresponding preset threshold value is set, the projection error is compared with the preset threshold value, the projection error is required to be smaller than or equal to the preset threshold value, otherwise, the corresponding adjacent sequence frames are removed, on the premise, a common view can be established, pose optimization between adjacent frames is formed, and the sequence frame in the common view is obtained as the most key sequence frame.

S12: inputting the key frame data into an adjacent key frame image optimization thread to perform key frame data optimization processing, and acquiring key frame data after image optimization;

in a specific implementation process of the present invention, the inputting the key frame data into an adjacent key frame image optimization thread to perform key frame data optimization processing, and acquiring the key frame data after image optimization, includes: and inputting the key frame data into an adjacent key frame image optimization thread, and then sequentially performing redundant point elimination processing, semantic extraction processing, new image point creation processing and adjacent frame optimization processing on the key frame data to obtain the key frame data after image optimization.

Further, the semantic extraction processing is performed on the key frame data after the redundant point elimination processing, and the semantic extraction processing includes: performing object detection on the key frame data subjected to the redundant point elimination processing based on a YOLO-v3 algorithm to obtain an object detection result; performing semantic association processing on the object detection result by using a conditional random field to obtain combined object class probability and scene context information; correcting and optimizing the combined object class probability and scene context information to generate a temporary object information candidate set; judging whether a new object or an existing object exists in the temporary object information candidate set, and searching each point information of each piece of temporary object information in the temporary object information candidate set in a corresponding neighborhood thereof to obtain a three-dimensional point closest to the point; and calculating the Euler distance between the point and the three-dimensional point, and if the Euler distance is smaller than a preset threshold value, considering the point and the three-dimensional point as the same point.

Specifically, after key frames are obtained, the key frames are input into an adjacent key frame image optimization thread, redundant points are removed, a semantic extraction algorithm is designed, the image optimization process of adjacent frames among the key frames is realized, the designed created semantic extraction algorithm comprises the functions of object detection, object semantic association, temporary object generation, object association, object model updating and the like, the object detection is responsible for extracting object information from the images by using a deep learning network, the extracted object information is subjected to semantic association, and the extracted detection objects are corrected and optimized through the semantic association and are stored in a temporary object information set; and the object association and update is responsible for associating the temporary object information with the existing object information in the object database according to the mapping relation among the key frame, the object information and the map point, and updating and fusing the temporary object information to the corresponding object information.

Here, the YOLO-v 3-based algorithm is used for object detection, which divides each picture into N × N squares, then performs an object detection operation only once for each square, and finally fuses the detection results together.

Semantic detection is carried out on the key frame by using a YOLO-v3 algorithm, semantic association is further carried out on the objects extracted through deep learning detection by using a conditional random field, and the detection classification accuracy is improved by combining object class probability and scene context information, wherein an energy equation corresponding to the designed conditional random field combining the object class probability and the context information is as follows:

E(x)＝∑_iψ_μ(x_i)+∑_i<jψ_P(x_i,y_i)；

where x represents a random variable of the object class, i, j ranges from 1 to k, where k is the number of objects detected in the image, Z is a normalization factor ensuring that the result is a probability, e (x) is an energy function of the conditional random field, and a univariate potential function ψ_uPlotting probability of random field graph node label category, binary potential function psi_PIs to characterize the correlation between the nodes of the random field map.

Unitary potential function psi_uAs follows:

ψ_μ＝-log p(x_i)；

binary potential function psi_PAs follows:

wherein, p (x)_i) Represents the probability distribution, omega, of the class to which the ith object belongs given by the YOLO-v3 model_mIs the linear combination weight, mu is the mark and cumA capacitive function representing the likelihood of simultaneous occurrence of different classes within a neighborhood.

Semantic association is carried out on the detected objects through a conditional random field, the detection result is corrected and optimized, a temporary object information candidate set is generated, the temporary objects are judged, and whether the temporary objects are new objects or objects already exist in the candidate set is determined; and aiming at the data of each candidate object, searching each point information of the temporary object in the neighborhood thereof, finding out a three-dimensional point closest to the point from the point cloud data of the candidate object, calculating the Euler distance between the two points, and if the Euler distance between the two points is smaller than a set threshold value, considering the two points as the same point.

S13: calculating an error value between the graph optimized key frame data, and generating a candidate set based on the error value;

in the specific implementation process of the invention, the error value between the optimized key frame data of the graph is calculated, and the candidate set can be generated according to the error value.

S14: and performing closed-loop correction processing on the candidate set based on global map optimization and loop fusion, and performing synchronous positioning and map construction based on a correction result.

In the specific implementation process of the method, closed-loop correction processing is carried out on a candidate set through global map optimization and loop-back fusion; closed loop detection is realized, the positioning precision is improved, and errors are reduced; and synchronous positioning and map construction are carried out based on the correction result.

Examples

Referring to fig. 2, fig. 2 is a schematic structural composition diagram of a semantic ORB-SLAM sensing apparatus based on environment understanding according to an embodiment of the present invention.

As shown in fig. 2, a semantic ORB-SLAM aware device based on environment understanding, the device comprising:

the key frame extraction module 21: the system comprises a sequence frame input unit, an ORB-SLAM front end Tracking thread, a key frame extraction unit, a frame synchronization unit and a frame synchronization unit, wherein the sequence frame input unit is used for inputting a sequence frame into the ORB-SLAM front end Tracking thread to perform key frame extraction processing to obtain key frame data;

Further, the method further comprises: when pose estimation and repositioning processing fails based on the matched feature point pairs, obtaining the most similar sequence frames among the sequence frames of the previous frame based on the mapping relation with the object feature points; obtaining the most similar ORB characteristics of the sequence frames, and matching the ORB characteristics of the sequence frames of the current frame with the most similar ORB characteristics of the sequence frames to obtain a first matching characteristic point pair; and performing pose estimation and repositioning calculation again by using the first matching feature point pair to obtain a pose estimation and repositioning result.

Specifically, in the ORB-SLAM front end Tracking thread, the sequence frame is input into the ORB-SLAM front endIn the Tracking thread, firstly, dynamic background removal is carried out, noise interference and influence of a dynamic object on the subsequent characteristic point extraction and matching process are eliminated, an interframe difference method is adopted, adjacent frames in continuous time intervals in a sequence frame are extracted for difference operation, change detection is carried out by utilizing strong correlation of the adjacent frames in the sequence frame, so that a moving target is detected, and then a moving area in the sequence frame is removed by selecting a threshold value; among the sequential frames, the k-th frame f_k(x, y) and k +1 frames f_k+1The change between (x, y) can be represented by the binarized difference value D (x, y) as follows:

In a front-end Tracking thread, in order to integrate the extracted semantic information into an ORB-SLAM frame, a mapping relation between a sequence frame with a dynamic background removed and object feature points needs to be established; in the ORB-SLAM, each sequence frame for removing the dynamic background stores the image points observed by the frame, and simultaneously, each image point also stores the sequence frame for removing the dynamic background and observing the image points; establishing a spanning tree of ORB-SLAM according to the relation between the sequence frame and the image point of the removed dynamic background; in order to construct a spanning tree, firstly, finding out a sequence frame for removing the dynamic background and observing the image point according to the image point observed by the current sequence frame for removing the dynamic background, wherein the sequence frame for removing the dynamic background is an adjacent sequence frame of the current sequence frame for removing the dynamic background and has a large number of same image points with the current sequence frame for removing the dynamic background; meanwhile, the current sequence frames for removing the dynamic background have the map points, and each map point has the associated sequence frame for removing the dynamic background; therefore, a spanning tree which takes the current sequence frame without the dynamic background as a root node and takes the adjacent sequence frame as a child node can be generated; in the spanning tree, the relationship between the child node and the parent node is determined by the number of common graph points; according to the spanning tree, the current sequence frame without the dynamic background can find the adjacent sequence frame, so that more associated image points can be found; the mapping relation between the sequence frame for removing the dynamic background and the object is established according to the following mode:

each object O_iComprises the following steps:

In an ORB-SLAM front end Tracking thread, performing pose estimation and repositioning according to a sequence frame with a dynamic background removed from a previous frame, matching ORB characteristics of a sequence frame of a current frame with ORB characteristics of the sequence frame of the previous frame to obtain matched characteristic point pairs, then feeling the currently matched characteristic point pairs, and calculating relative displacement between the current frame and the previous frame by using a minimized reprojection error; if the tracking and positioning fails, a sequence frame which is most similar to the current frame is found by using a scene failure mode, the current frame is matched with the sequence frame to obtain a matched image point, and the pose of the current frame is recalculated by using a matched land.

The key frame optimization module 22: the key frame data are input into an adjacent key frame image optimization thread to be subjected to key frame data optimization processing, and the key frame data after image optimization are obtained;

in a specific implementation process of the present invention, the inputting the key frame data into an adjacent key frame image optimization thread to perform key frame data optimization processing, and acquiring the key frame data after image optimization, includes: and inputting the key frame data into an adjacent key frame image optimization thread, and then sequentially performing redundant point elimination processing, semantic extraction processing and new image point processing opinion adjacent frame optimization processing on the key frame data to obtain image optimized key frame data.

Further, the semantic extraction processing is performed on the key frame data after the redundant point elimination processing, and the semantic extraction processing includes: performing object detection on the key frame data subjected to the redundant point elimination processing based on a YOLO-v3 algorithm to obtain an object detection result; performing semantic association processing on the object detection result by using a conditional random field to obtain combined object class probability and scene context information; correcting and optimizing the combined object class probability and scene context information to generate a temporary object information candidate set; judging whether the temporary object information in the temporary object information candidate set is a new object or an existing object, searching each point information of each temporary object information in the temporary object information candidate set in a corresponding neighborhood thereof, and acquiring a three-dimensional point closest to the point; and calculating the Euler distance between the point and the three-dimensional point, and if the Euler distance is smaller than a preset threshold value, considering the point and the three-dimensional point as the same point.

Specifically, after key frames are obtained, the key frames are input into an adjacent key frame image optimization thread, redundant points are removed, a semantic extraction algorithm is designed, the image optimization process of adjacent frames among the key frames is realized, the designed created semantic extraction algorithm comprises the functions of object detection, object semantic association, temporary object generation, object association, object model updating and the like, the object detection is responsible for extracting object information from the images by using a deep learning network, the extracted object information is associated with corresponding semantics by semantic tags, and then the object detection is corrected and optimized through semantic association, so that the extracted detected objects are more accurate and reliable and are stored in a temporary object information set; and the object association and update is responsible for associating the temporary object information with the existing object information in the object database according to the mapping relation among the key frame, the object information and the map point, and updating and fusing the temporary object information to the corresponding object information.

The method is used for object detection based on a YOLO algorithm, each picture is divided into N × N grids, then object detection operation is carried out on each grid only once, and finally detection results are fused together; the YOLO design solves the problem of duplicate detection.

Semantic detection is carried out on the key frame by using a YOLO algorithm, semantic association is further carried out on the object extracted by deep learning detection by using a conditional random field, and the detection classification accuracy is improved by combining object class probability and scene context information, wherein an energy equation corresponding to the designed conditional random field combining the object class probability and the context information is as follows:

E(x)＝∑_iψ_μ(x_i)+∑_i<jψ_P(x_i,y_i)；

wherein x representsRandom variables of object classes, i, j, range from 1 to k, where k is the number of objects detected in the image, Z is a normalization factor ensuring that the result of the computation is a probability, e (x) is an energy function of the conditional random field, a univariate potential function ψ_uPlotting probability of random field graph node label category, binary potential function psi_PIs to characterize the correlation between the nodes of the random field map.

Unitary potential function psi_uAs follows:

ψ_μ＝-log p(x_i)；

binary potential function psi_PAs follows:

wherein, p (x)_i) Representing the probability distribution, ω, of the class to which the ith object belongs given by the YOLO model_mIs a linear combination weight and μ is a tag compatibility function, representing the likelihood of the simultaneous occurrence of different classes within the neighborhood.

The error calculation module 23: the device is used for calculating an error value between the graph optimized key frame data and generating a candidate set based on the error value;

The synchronized positioning and mapping module 24: and the method is used for carrying out closed-loop correction processing on the candidate set based on global map optimization and loop fusion, and carrying out synchronous positioning and map construction based on a correction result.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, or the like.

In addition, the semantic ORB-SLAM sensing method and apparatus based on environment understanding provided by the embodiment of the present invention are described in detail above, and a specific example should be used herein to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A semantic ORB-SLAM perception method based on environmental understanding, the method comprising:

2. The semantic ORB-SLAM sensing method of claim 1, wherein the inputting of the sequence frames into an ORB-SLAM front end Tracking thread for key frame extraction processing to obtain key frame data comprises:

3. The semantic ORB-SLAM sensing method of claim 2, wherein the ORB-SLAM front end Tracking thread performs dynamic background removal processing on the input sequence frame by using an inter-frame difference method to obtain a sequence frame with a dynamic background removed, comprising:

4. The semantic ORB-SLAM perception method according to claim 2, wherein the establishing a mapping relationship between the sequence frames with the dynamic background removed and object feature points to obtain the sequence frames with the mapping relationship with the object feature points comprises:

5. The semantic ORB-SLAM perception method of claim 2, wherein the pose estimation and repositioning based on the matching feature point pairs comprises:

6. The semantic ORB-SLAM perception method of claim 2, wherein the method further comprises:

7. The semantic ORB-SLAM perception method of claim 2, wherein the obtaining a sequence of key frames based on pose optimization of the neighboring frames comprises:

and extracting the sequence frame in the common view as a key sequence frame.

8. The semantic ORB-SLAM sensing method of claim 1, wherein the entering the keyframe data into an adjacent keyframe graph optimization thread for keyframe data optimization processing to obtain graph-optimized keyframe data comprises:

9. The semantic ORB-SLAM sensing method of claim 8, wherein the semantic extraction processing of the key frame data after the redundant point elimination processing comprises:

10. A semantic ORB-SLAM aware apparatus based on environmental understanding, the apparatus comprising: