CN114359493B

CN114359493B - Method and system for generating three-dimensional semantic map for unmanned ship

Info

Publication number: CN114359493B
Application number: CN202111559071.1A
Authority: CN
Inventors: 李波; 陈小乔
Original assignee: 709th Research Institute of CSIC
Current assignee: 709th Research Institute of CSIC
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2023-01-03
Anticipated expiration: 2041-12-20
Also published as: CN114359493A

Abstract

The invention discloses a method and a system for generating a three-dimensional semantic map by an unmanned ship, which comprises the following steps: processing image frames acquired by the unmanned ship in the process of sailing by using an ORB-SLAM algorithm to obtain a key frame and an initialized three-dimensional point cloud map; performing instance segmentation on the obtained key frame by using a Mask R-CNN algorithm to obtain all object instances in the key frame, and performing semantic annotation on the object instances; adding the obtained key frame into the initialized three-dimensional point cloud map to update the three-dimensional point cloud map; matching the feature points extracted from the key frames with all object examples obtained by segmenting the examples with the nearest key frames to obtain the key frames carrying object example semantic annotation information; and mapping the key frame carrying object semantic annotation information to the coordinates of the key frame on the three-dimensional point cloud map to obtain the three-dimensional semantic map. The invention improves the accuracy of the unmanned ship in recognizing the boundary of the shot object.

Description

Method and system for generating three-dimensional semantic map for unmanned ship

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a method and a system for generating a three-dimensional semantic map by an unmanned ship.

Background

In recent years, the market of unmanned ships is continuously growing, and autonomous navigation is used as a guarantee for increasing safe navigation of unmanned ships along with the development of technologies. The unmanned ship wants to realize applications, and one key problem is to realize autonomous positioning and environmental perception. In many scenarios involving unmanned ship applications, such as path planning, environmental awareness, obstacle avoidance, etc., the process of positioning and mapping is seen as a prerequisite. Meanwhile, the vision SLAM (Simultaneous Localization and Mapping) algorithm can establish a map of an environment in a movement process without environment prior information after a camera is used as a vision sensor, and can perform autonomous positioning at the same time.

The ORB-SLAM algorithm (organic FAST and BRIEF-Simultaneous Localization and Mapping) is a monocular vision SLAM algorithm completely based on sparse feature points, and comprises visual odometry, tracking and loop detection, and the core of the ORB-SLAM algorithm is to use ORB (organic FAST and BRIEF) as the core feature in the whole visual SLAM. However, the map constructed by the ORB-SLAM algorithm cannot provide semantic information of objects in the surrounding environment, and image information acquired by the visual sensor cannot be fully utilized. Therefore, the ORB-SLAM algorithm can only solve the problem of autonomous navigation of the unmanned ship, and objects shot by the vision sensor in the sailing process cannot be intelligently identified.

The Mask R-CNN algorithm is a very representative example segmentation algorithm based on deep learning in recent years, and can be applied to the fields of target detection, example segmentation, target key point detection and the like. Since a large number of objects have complex and irregular boundary shapes, and the accurate prediction of the boundary shapes has a great influence on the segmentation of the whole example, the Mask R-CNN algorithm has a little defect in the accuracy of the boundary identification of the object example.

In summary, although the existing ORB-SLAM method can solve the autonomous navigation function of the unmanned ship, it cannot identify the type of the object shot in the navigation process, and the existing Mask R-CNN algorithm is slightly deficient in accurately identifying the boundary of the object. Therefore, it is an urgent task in the current unmanned ship technology field to provide a method and a system for generating a three-dimensional semantic map for an unmanned ship, which can enhance the efficiency of the unmanned ship in performing tasks such as navigation and exploration.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method and a system for generating a three-dimensional semantic map by an unmanned ship, and solves the technical problem that the boundary of an object cannot be accurately identified in the navigation process of the unmanned ship in the prior art.

In order to achieve the purpose, the invention provides a method for generating a three-dimensional semantic map by an unmanned ship, which comprises the following steps:

(1) Acquiring a key frame: continuously acquiring image frames by using a visual sensor in the process of sailing of the unmanned ship to obtain an image frame sequence, processing the continuous image frames in the image frame sequence by adopting an ORB-SLAM algorithm to obtain an initialized three-dimensional point cloud map, simultaneously processing the extracted feature points of the continuous image frames to obtain continuous key frames, and respectively performing the step (2) and the step (3) below;

(2) A key frame example segmentation step: performing instance segmentation on each key frame in the continuous key frames by using a Mask R-CNN algorithm to obtain object instances contained in each key frame in the continuous key frames, performing semantic annotation on the object instances, and turning to the step (4);

(3) Updating the three-dimensional point cloud map: adding the continuous key frames into the initialized three-dimensional point cloud map to update the three-dimensional point cloud map, and turning to the step (4);

(4) And (3) feature fusion step: matching the feature points extracted from each key frame in the continuous key frames and the obtained object examples with the feature points and the object examples of the previous key frame, and performing local nonlinear optimization on the pose estimation result of each key frame by integrating the feature points and the object example matching results to obtain object example semantic labeling information carried by each key frame in the continuous key frames;

(5) Constructing a three-dimensional semantic map: and associating the object instance semantic annotation information carried by each key frame in the continuous key frames with the coordinates of each key frame on the three-dimensional point cloud map, so as to realize semantic annotation on the three-dimensional point cloud map and obtain the three-dimensional semantic map.

In an optional embodiment, in the step of obtaining the key frame, the feature points of the continuous image frames are extracted and processed to obtain the continuous key frame, and the specific process includes:

extracting feature points of each image frame by adopting an ORB-SLAM algorithm, and matching the feature points extracted by each image frame with the feature points of the previous image frame to obtain a pose estimation result of the visual sensor when each image frame is acquired; if the matching with the previous image frame fails, the image frame needs to be matched with the previous key frame to obtain a pose estimation result of the image frame, if the matching with the previous key frame fails, the image frame needs to be subjected to global repositioning to obtain a pose estimation result of the image frame, and after the pose estimation result of each image frame is obtained, each image frame is screened to obtain a key frame.

In an optional embodiment, the global relocation specifically includes:

and generating a key frame database by the acquired continuous key frames, matching the image frames with all the key frames in the key frame database, and determining the pose estimation result of the image frames.

In an optional embodiment, the screening of each image frame to obtain a key frame specifically includes:

selecting the image frames meeting the following conditions as key frames:

the image frames are spaced at least 20 image frames from the last globally repositioned image frame;

the mapping process is idle, or more than 20 image frames are added to the key frame last time;

the image frame tracks at least 50 map points;

the number of map points matched by the image frame cannot exceed 90 percent of the number of map points matched by the reference key frame.

In an optional embodiment, the mapping process is idle, and further includes:

and if the mapping thread is busy when the key frame is added, sending a signal to the mapping thread, stopping the current task of the mapping thread, and enabling the key frame to be processed in time, wherein the mapping thread is a thread used for constructing the navigation map of the unmanned ship in the ORB-SLAM algorithm.

In an optional embodiment, the map points are specifically:

the visual sensor observes the road sign during the navigation of the unmanned ship.

In an optional embodiment, the reference key frame specifically includes:

the reference key frame is a key frame with the highest matching degree with the feature points of the image frames in the key frame database.

In an optional embodiment, the key frame instance segmenting step further includes:

in the process of performing example segmentation on the key frame by using Mask R-CNN, a boundary thinning module is also needed to be used for fitting the segmented object example boundary, so as to obtain a more accurate object example boundary contour.

In an optional embodiment, in the step of segmenting the key frame instance, semantic labeling is performed on the key frame instance, specifically:

the semantics are labeled as the categories of the object instances in the key frame and the relations between the object instances.

Another object of the present invention is to provide a system for generating a three-dimensional semantic map for an unmanned ship, comprising the following units:

acquiring a key frame unit: continuously acquiring image frames by using a visual sensor in the process of sailing of the unmanned ship to obtain an image frame sequence, processing the continuous image frames in the image frame sequence by adopting an ORB-SLAM algorithm to obtain an initialized three-dimensional point cloud map, simultaneously, processing the feature points of the extracted continuous image frames to obtain continuous key frames, and respectively switching to a key frame example segmentation unit and a three-dimensional point cloud map updating unit from the following steps;

key frame instance partitioning unit: performing instance segmentation on each key frame in the continuous key frames by using a Mask R-CNN algorithm to obtain object instances contained in each key frame in the continuous key frames, performing semantic annotation on the object instances, and transferring the object instances to a feature fusion unit;

a three-dimensional point cloud map updating unit: adding the continuous key frames into the initialized three-dimensional point cloud map, updating the three-dimensional point cloud map, and transferring to a feature fusion unit;

a feature fusion unit: matching the feature points extracted from each key frame in the continuous key frames, the obtained object example, the feature points of the previous key frame and the object example, and performing local nonlinear optimization on the pose estimation result of each key frame by integrating the feature points and the object example matching result to obtain object example semantic labeling information carried by each key frame in the continuous key frames;

constructing a three-dimensional semantic map unit: and associating the object instance semantic annotation information carried by each key frame in the continuous key frames with the coordinates of each key frame on the three-dimensional point cloud map, so as to realize semantic annotation on the three-dimensional point cloud map and obtain the three-dimensional semantic map.

Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

the invention provides a method and a system for generating a three-dimensional semantic map by an unmanned ship, wherein an image frame acquired by an unmanned ship vision sensor is processed by using a vision SLAM algorithm to obtain a key frame and an initial three-dimensional point cloud map, and the obtained key frame is subjected to instance segmentation by using a Mask R-CNN algorithm to obtain all object instances in the key frame.

On the other hand, matching the feature points extracted from the current key frame and the object example obtained after the example is segmented with the feature points and the object example of the nearest key frame, and performing local nonlinear optimization on the pose estimation result of the key frame by integrating the feature points and the example segmentation matching result to obtain the key frame carrying object example semantic annotation information; and mapping the key frame carrying object semantic annotation information to the coordinates of the key frame on the three-dimensional point cloud map to obtain the three-dimensional semantic map.

The invention combines two algorithms of visual SLAM and Mask R-CNN, improves the Mask R-CNN algorithm, adds a boundary thinning module, realizes autonomous navigation of the unmanned ship in unknown environment or navigation environment with poor communication signals, and can identify the type of the object ahead, thereby greatly improving the task execution effect.

Drawings

FIG. 1 is a flow chart of a method for an unmanned ship to generate a three-dimensional semantic map of the present invention;

FIG. 2 is a Mask R-CNN example segmentation algorithm model with an added boundary refining module according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

As shown in fig. 1, the method for generating a three-dimensional semantic map for an unmanned ship of the present invention includes the following steps:

(4) A characteristic fusion step: matching the feature points extracted from each key frame in the continuous key frames, the obtained object example, the feature points of the previous key frame and the object example, and performing local nonlinear optimization on the pose estimation result of each key frame by integrating the feature points and the object example matching result to obtain object example semantic labeling information carried by each key frame in the continuous key frames;

Specifically, the method for generating the three-dimensional semantic map by the unmanned ship provided by the invention comprises the following steps of:

(1) Acquiring a key frame:

extracting feature points of each image frame by using an ORB-SLAM algorithm, and matching the feature points extracted from each image frame with the feature points of a nearest key frame to obtain a pose estimation result of a visual sensor when each image frame is acquired; if the matching with the latest key frame fails, global repositioning is needed to obtain the pose estimation result of the current image frame, and after the pose estimation result of each image frame is obtained, each image frame is screened to obtain the key frame.

(2) And a key frame example segmentation step.

The example segmentation is performed on the key frames mainly to acquire all object examples in each frame of the key frames. Example segmentation is a technology combining semantic segmentation and target detection, and compared with a bounding box of target detection, example segmentation can obtain more accurate object edges; with respect to semantic segmentation, instance segmentation can distinguish different individuals of the same object.

An example segmentation algorithm adopted by the embodiment of the invention is Mask R-CNN. The Mask R-CNN algorithm can perform example segmentation while performing target detection, and achieves excellent effect. The Mask R-CNN process includes two stages, the first stage scans key frames and generates proposals, i.e., regions that may contain a target, and the second stage classifies the proposals and generates bounding boxes and masks.

The specific flow is shown in fig. 2: inputting the continuous key frames into a Backbone network (Backbone), and performing feature extraction to generate a feature map (Featuremap); inputting the obtained Feature map into a Feature Pyramid Network (FPN), and performing side connection from top to bottom on the high-level features of low-resolution and high-semantic information and the low-level features of high-resolution and low-semantic information in the Feature map, so that the features under all scales have rich semantic information; inputting the feature map processed by the FPN into an RPN (Region suggestion network), wherein the RPN is a lightweight neural network, scans an image by using a sliding window and searches for a Region with a target, the RPN generates candidate regions in the feature map, the candidate regions obtained by each RPN are subjected to RoIAlign operation, and each RoI generates a feature map with a fixed size through a RoIAlign layer; these processed feature maps are then used for two branches: one is a classifier branch used for generating a regression box and a prediction category; the other is a Mask branch for generating a Mask.

In general, objects have complex and irregular boundary shapes, and accurate prediction of the boundary shapes has a great influence on the accuracy of intelligent recognition of the front objects by the unmanned ship. Therefore, in order to improve the segmentation accuracy of the edge, in the embodiment of the present invention, a Boundary refinement module (BR for short) is added after the Mask path, and the module optimizes the generated Mask through a residual structure to further fit the example Boundary, so that the positioning capability at the Boundary is greatly improved.

In the embodiment of the present invention, a joint training loss function is used to optimize a mask, and a formula of the joint training loss function L is:

L＝L _cls +L _box +L _mask

wherein L is _cls Representing the cross entropy of target category prediction of the interest region in the target detection task, and representing the classification error; l is _box Representing a pose correction loss function for the interest area in target detection and representing a detection error; l is _mask And representing the sum of the pixel-by-pixel cross entropy of the predicted mask and the actual mask in the semantic segmentation task, and representing the segmentation error.

Wherein L is _mask For the average binary cross entropy loss function, the formula can be expressed as:

wherein y represents the correctly labeled data after binarization,

representing the predicted segmentation result after binarization. However, the deficiency of the cross entropy loss function in the segmentation task is dependent on the region information, so that the prediction of the boundary is ignored, and the accuracy of boundary segmentation is not high on the final segmentation result.

In the embodiment of the invention L _mask BWL (boundary weighted loss function) is added to optimize the generated mask. L is _mask-bwl For the optimized formula, the following is shown:

wherein L is _dis Representing the boundary weighting loss function, alpha representing the weighting coefficient, B representing the boundary of the segmentation result, R representing the entire segmentation region, M _dist The distance transformation of the data segmentation bounding box representing the correct mark corresponds to a distance map. This is equivalent to L _mask The weight of frame loss is increased, and the segmentation result can be more accurate.

And combining the object examples in the key frame obtained after the processing of the boundary thinning module with the classifer branch processing to obtain object category information to obtain a processing result of a Mask R-CNN algorithm, namely the object examples in the key frame and the semantic annotation information thereof. And (4) after the Mask R-CNN algorithm processing of the continuous key frames is finished, turning to the step (4).

(3) And updating the three-dimensional point cloud map.

Adding the continuous key frames into the initialized three-dimensional point cloud map to update the three-dimensional point cloud map, and turning to the step (4);

(4) And (5) feature fusion.

Matching the feature points extracted from each key frame in the continuous key frames, the obtained object example, the feature points of the previous key frame and the object example, and performing local nonlinear optimization on the pose estimation result of each key frame by integrating the feature points and the object example matching result to obtain object example semantic labeling information carried by each key frame in the continuous key frames;

(5) And constructing a three-dimensional semantic map.

And associating object instance semantic annotation information carried by each key frame in the continuous key frames with the coordinates of each key frame on the three-dimensional point cloud map to realize semantic annotation of the three-dimensional point cloud map and obtain the three-dimensional semantic map.

Specifically, the system for generating the three-dimensional semantic map by the unmanned ship comprises the following units:

acquiring a key frame unit: continuously acquiring image frames by using a visual sensor in the process of sailing of the unmanned ship to obtain an image frame sequence, processing the continuous image frames in the image frame sequence by adopting an ORB-SLAM algorithm to obtain an initialized three-dimensional point cloud map, simultaneously processing the feature points of the extracted continuous image frames to obtain continuous key frames, and respectively switching to a key frame example segmentation unit and a three-dimensional point cloud map updating unit;

a feature fusion unit: matching the feature points extracted from each key frame in the continuous key frames and the obtained object examples with the feature points and the object examples of the previous key frame, and performing local nonlinear optimization on the pose estimation result of each key frame by integrating the feature points and the object example matching results to obtain object example semantic labeling information carried by each key frame in the continuous key frames;

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for generating a three-dimensional semantic map by an unmanned ship is characterized by comprising the following steps:

(4) And (3) feature fusion step: matching the feature points extracted from each key frame in the continuous key frames, the obtained object example, the feature points of the previous key frame and the object example, and performing local nonlinear optimization on the pose estimation result of each key frame by integrating the feature points and the object example matching result to obtain object example semantic labeling information carried by each key frame in the continuous key frames;

2. The method for generating the three-dimensional semantic map by the unmanned ship as claimed in claim 1, wherein the step of acquiring the keyframes comprises processing feature points of the extracted continuous image frames to obtain continuous keyframes, and the specific process comprises:

extracting feature points of each image frame by adopting an ORB-SLAM algorithm, and matching the feature points extracted from each image frame with the feature points of the previous image frame to obtain a pose estimation result of the visual sensor when each image frame is acquired; if the matching with the previous image frame fails, the image frame needs to be matched with the previous key frame to obtain a pose estimation result of the image frame, if the matching with the previous key frame fails, the image frame needs to be subjected to global repositioning to obtain a pose estimation result of the image frame, and after the pose estimation result of each image frame is obtained, each image frame is screened to obtain a key frame.

3. The method for unmanned ship to generate three-dimensional semantic map according to claim 2, characterized in that the global relocation is specifically:

4. The method for generating the three-dimensional semantic map by the unmanned ship as claimed in claim 2, wherein the filtering of each image frame to obtain the key frame specifically comprises:

selecting the image frames meeting the following conditions as key frames:

the image frame tracks at least 50 map points;

5. The method for unmanned ship to generate a three-dimensional semantic map of claim 4, wherein the mapping routine is idle, further comprising:

6. The method for unmanned ship to generate three-dimensional semantic map according to claim 4, wherein the map points are specifically:

7. The method for unmanned ship to generate three-dimensional semantic map according to claim 4, wherein the reference key frame is specifically:

the reference key frame is a key frame which is in a key frame database and has the highest matching degree with the feature points of the image frame.

8. The method for unmanned ship to generate a three-dimensional semantic map of claim 1, wherein the keyframe instance segmentation step further comprises:

in the process of segmenting the key frame by using the Mask R-CNN, a boundary thinning module is also needed to be used for fitting the segmented object instance boundary to obtain a more accurate object instance boundary outline.

9. The method for generating a three-dimensional semantic map for an unmanned ship according to claim 1, wherein the key frame instance is segmented and semantically labeled in the key frame instance segmentation step, specifically:

the semantic labels are the categories of the object instances in the key frame and the relationship between them.

10. A system for generating a three-dimensional semantic map for an unmanned ship is characterized by comprising the following units:

key frame instance partitioning unit: performing instance segmentation on each key frame in continuous key frames by using a Mask R-CNN algorithm to obtain object instances contained in each key frame in the continuous key frames, performing semantic annotation on the object instances, and transferring the object instances to a feature fusion unit;

a three-dimensional point cloud map updating unit: adding the continuous key frames into the initialized three-dimensional point cloud map to update the three-dimensional point cloud map, and transferring to a feature fusion unit;