CN111767854B

CN111767854B - SLAM loop detection method combined with scene text semantic information

Info

Publication number: CN111767854B
Application number: CN202010608535.2A
Authority: CN
Inventors: 杨国青; 李夷奇; 李红; 吕攀; 吴朝晖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2022-07-01
Anticipated expiration: 2040-06-29
Also published as: CN111767854A

Abstract

The invention discloses a SLAM loop detection method combining scene text semantic information, which utilizes a deep neural network to extract features in an image input by a sensor, detects and identifies texts appearing in the image, and finally performs weighted fusion on feature point similarity and text semantic similarity. Aiming at the real-time requirement of SLAM and the limitation of computing resources of an embedded platform, the invention provides a lightweight text detection model EAST-light on the basis of an EAST model, and a ShuffleNet V2 model is used for replacing a VGG16 model in a feature extraction module, thereby greatly improving the running speed of the model and realizing better balance on speed and precision.

Description

SLAM loop detection method combined with scene text semantic information

Technical Field

The invention belongs to the technical field of synchronous positioning and map construction, and particularly relates to a SLAM loop detection method combined with scene text semantic information.

Background

The intelligent mobile robot has wide attention due to wide application prospect, and with the development of artificial intelligence technology, technical innovation in the fields of machine learning and the like is also integrated into the robot technology, so that the mobility and the intelligence of the robot are improved. In order to play a greater role in industry and life, the intelligent mobile robot needs to have the capability of autonomous movement, namely, positioning and navigation are performed by sensing environmental information, which is a problem to be solved by a simultaneouspositioning and Mapping (SLAM) technology. The robot based on the SLAM technology can perform self-positioning according to pose estimation and sensor data in the moving process, meanwhile, an incremental map is constructed for the surrounding environment, and functions of path planning, navigation and the like are further realized.

Loop detection is an important link of SLAM, namely, the problem that pose estimation drifts along with time is solved by enabling a robot to identify a scene which is reached once; in visual SLAM, the loop detection consists in finding the similarity between two frame images. Similarity is generally calculated by a Bag-of-Words (BoW) model in traditional loop tests: after the visual features of the manual design in the images are extracted, the BoW model clusters the feature descriptors to obtain words, a dictionary is built, then the words contained in each frame of image are found to form description vectors, and whether a loop appears or not is judged by calculating the similarity between the vectors. The disadvantage of the BoW model is that only whether a word appears in an image is concerned, the relative position of the word in the space is ignored, and the BoW model completely depends on the visual characteristics of manual design and is easy to generate deviation when illumination changes or shakes.

The vigorous development of deep learning nowadays promotes the great progress in the field of computer vision, and features extracted by a neural network are more robust than those designed manually, and can better represent original data. The development of the text detection and recognition technology is also beneficial to mining the text which is an element frequently appearing in the SLAM scene, and a new idea is provided for loopback detection by utilizing semantic information of the text. Gaoyang et al, in the document Loop Detection for Visual SLAM Systems Using Deep Neural Networks, propose to use a Deep Neural network structure, i.e., a stacked auto-encoder, to learn how to extract features from an image and use the learned features for detecting loops. The chinese patent with application number 201910999570.9 proposes a visual SLAM method based on instance segmentation, which uses Mask RCNN to perform instance segmentation, and uses semantic information of image classification to construct a semantic map, thereby implementing loop detection. A method for utilizing Text information in a scene in SLAM is proposed in a document 'textSLAM with Planar Text Features' by Boying Li et al, but only the Text is treated as a Planar feature, and semantic information contained in the Text itself is not well mined.

In some application scenes of the visual SLAM, such as supermarkets, parking lots, markets and the like, text pictures often appear and contain abundant texture features and semantic information, the texture features and the semantic features of texts cannot be fully utilized by the prior method, and if the text features can be combined into the SLAM method, the performance of the SLAM method in such scenes can be expected to be remarkably improved.

Disclosure of Invention

In view of the above, the invention provides a SLAM loop detection method combined with scene text semantic information, which is used for solving the problem of the loop detection method based on a bag-of-words model, automatically extracts image features by using a neural network, and fuses the image features with semantic information of text road signs in scenes and relative position information of the text road signs in space.

A SLAM loop detection method combined with scene text semantic information comprises the following steps:

(1) building and training a text detection model and a text recognition model based on a lightweight neural network;

(2) acquiring an environment image by using a monocular camera, detecting a text in the image by using a text detection model, outputting coordinates of a text box, and saving a feature image output of a second stage of a feature extraction part of the text detection model;

(3) recognizing the detected text area by using a text recognition model;

(4) calculating a characteristic information vector and a semantic information vector of the current frame according to the text detection result and the recognition result obtained in the step (2) and the step (3), and obtaining a total information vector through weighting fusion;

(5) calculating the cosine similarity of the total information vector of any key frame in the key frame set and the total information vector of the current frame, and taking the key frame which has the similarity larger than a certain threshold and is not directly adjacent to the current frame as a loop candidate frame;

(6) when three consecutive adjacent loop candidate frames appear, the loop is judged to appear.

Further, aiming at the real-time requirement of SLAM and the limitation of computing resources of the embedded platform, the step (1) is improved on the basis of an EAST (efficient and accurate Scene text) model to obtain a text detection model based on a lightweight neural network: the input is a picture, the corresponding area of the text information in the picture is directly predicted by using a full convolution network, the area exceeding a set threshold value is subjected to non-maximum suppression in the area predicted by the full convolution network, and the result of the non-maximum suppression is the final output of the model, namely the text box coordinate on the picture.

Further, the step (1) adopts a crnn (conditional recovery Neural network) model as the text recognition model based on the lightweight Neural network.

Further, the full convolution network comprises three parts of feature extraction, feature fusion and output layer, wherein the feature extraction part adopts a ShuffleNet V2 model to output a feature map f of four levels₁，f₂，f₃，f₄The sizes are 1/32, 1/16, 1/8 and 1/4, respectively, of the original.

Further, the feature fusion part outputs four levels of feature maps f to the ShuffleNet V2 model₁，f₂，f₃，f₄Performing step-by-step feature fusion, wherein three feature fusion stages are provided, in each feature fusion stage, firstly performing up-sampling on a feature map from the previous stage to make the feature map have the same size as the current feature map, then cascading the feature map and the current feature map along a channel direction, further reducing the number of channels of the feature map after cascading by using a 1 × 1 convolutional layer to reduce the calculated amount, and finally performing information fusion on the feature map by using a 3 × 3 convolutional layer to generate a result of the current feature fusion stage; after the last feature fusion stage, generating a final feature map by using a 3 multiplied by 3 convolutional layer and inputting the final feature map into an output layer; the number of channels of the 1 × 1 convolutional layer in the three feature fusion stages is 1256, 244, 88, respectively, the number of channels of the 3 × 3 convolutional layer in the three feature fusion stages is 128, 1256, 32, respectively, and the number of channels of the 3 × 3 convolutional layer after the last feature fusion stage is 32.

Further, the stepsIn step (4), for the current frame, the feature map f of the second stage of the model feature extraction part in the text detection process is taken₂Carrying out global average pooling on each channel to obtain a feature map f of each element in a feature information vector f of the current frame₂Average value of corresponding channel in the channel.

Further, in the step (4), for the current frame, semantic information of the current frame is described by a vector, and the semantic information vector of the current frame is noted as t ═ e₁，e₂，…，e_N]Wherein e is_i＝[p_i，x_1i，y_1i，x_2i，y_2i]N denotes the number of textual signposts, e_iInformation describing the ith textual roadmap in the current frame, p_iIndicating whether the ith text road sign appears in the current frame or not, if so, p_i1, otherwise p_i＝0，(x_1i，y_1i) And (x)_2i，y_2i) And respectively outputting the coordinates of the upper left corner and the lower right corner of the text box corresponding to the ith text signpost in the current frame, wherein the information is output by a trained text detection model and a trained text recognition model.

Further, in the step (4), the feature information vector f and the semantic information vector t are weighted and fused by a formula s ═ λ t + f to obtain a total information vector s, where λ is a weight occupied by the semantic information vector f and may be set to 0.1.

Further, in the step (5), for two total information vectors m and n, calculating cosine similarity cos (m, n) of the two total information vectors m and n through the following formula;

the first step of utilizing text information in a scene in the visual SLAM is to extract the text information from an image captured by a camera sensor, in order to avoid the unicity of artificial visual features, the invention uses a deep neural network to automatically extract image features, and designs a text detection model EAST-light based on a lightweight neural network aiming at the condition that the common deployment platform (embedded platform) of the visual SLAM algorithm is limited in computing resources, thereby meeting the requirements of the SLAM on the real-time property. The method uses the EAST-light model to simultaneously extract image features and detect the text, uses another neural network model CRNN to identify the text, uses the extracted image features and text semantic information for SLAM loop detection, and adds the coordinate information of each detected text object into a feature vector, thereby overcoming the defect that the semantic information cannot completely represent the image features and improving the accuracy of loop detection.

Compared with the prior art, the invention has the following advantages:

1. the invention provides a SLAM loop detection method combined with scene text semantic information, which utilizes a deep neural network to extract image features.

2. The method comprises the steps of detecting and identifying text objects appearing in a scene, and extracting semantic information of the text; compared with visual features such as ORB (object relational features) and the like, semantic information in the image is a more stable variable, and when dynamic interference exists in a scene, the influence on the semantic information of the image is smaller than the influence on the visual features of the image; the visual loop detection is an algorithm for calculating the similarity of image data, and the accuracy of similarity judgment can be improved by correct weighted fusion of the similarity of the feature points and the semantic similarity, so that the accuracy of loop detection is improved, and the robustness of an SLAM system is enhanced.

3. Aiming at the real-time requirement of SLAM and the limitation of computing resources of an embedded platform, the invention provides a light text detection model EAST-light on the basis of improving an EAST model, the EAST-light model changes a feature extraction grid VGG16 of the EAST model into a ShuffleNet V2 network, the running speed of the model is greatly improved, an image with the resolution of 512 multiplied by 512 is processed on a Jetson TX2 development board, the EAST model needs 0.42 second, and the EAST-light model only needs 0.06 second; on the public data set ICDAR2015 test set, the accuracy of EAST was 80.46% and the accuracy of EAST-light was 71.54%, so EAST-light achieved a better balance in speed and accuracy.

Drawings

FIG. 1 is a schematic flow chart of EAST-light model according to the present invention.

FIG. 2 is a schematic diagram of a full convolutional network structure in EAST-light of the present invention.

Detailed Description

In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments.

The invention relates to a SLAM loop detection method combined with scene text semantic information, which comprises the following steps:

step 1: and constructing and training a text detection and recognition model based on the lightweight neural network model.

The invention provides a lightweight text detection model EAST-light on the basis of an EAST model: as shown in fig. 1, EAST-light is divided into two parts of a multi-channel full convolution network and non-maximum suppression: the text information prediction is directly given by using the full convolution network, in the prediction area obtained by the full convolution network, the non-maximum suppression is carried out on the prediction area exceeding the preset threshold, and the result of the non-maximum suppression is the final output of the model, namely the detected text box coordinate on the picture.

As shown in fig. 2, the full convolutional network is divided into 3 parts: and a feature extraction layer, a feature fusion layer and an output layer.

The feature extraction part adopts a Shufflenet V2 model to output feature maps f of four levels₁，f₂，f₃，f₄The sizes are 1/32, 1/16, 1/8 and 1/4 of the original drawings, respectively.

Feature fusion part outputs four-level feature map f to Shufflenet V2 model₁，f₂，f₃，f₄Step-by-step feature fusion is carried out, 3 feature fusion stages are provided in total, in each feature fusion stage, the feature graph from the previous stage is firstly up-sampled to make the feature graph have the same size as the current feature graph, and then the feature graph and the current feature graph are cascaded along the channel direction, so that a 1-mode feature is utilized1, reducing the number of channels and the calculated amount by the convolution layer, and finally fusing information by a 3 multiplied by 3 convolution layer to generate a result of the characteristic fusion stage; after the last feature fusion stage, generating a final feature map by using a 3 multiplied by 3 convolutional layer and inputting the final feature map into an output layer; the number of channels of the 1 × 1 convolutional layer in the 3 feature fusion stages is 1256, 244 and 88 respectively; the number of channels of the 3 × 3 convolutional layers in the 3 feature fusion stages is 128, 1256 and 32 respectively; the number of channels of the 3 × 3 convolutional layer after the last feature fusion stage is 32; the specific network settings and output sizes at each stage are shown in table 1:

TABLE 1

The output layer outputs the probability that each pixel of the image belongs to the text region and the geometric information of the text box, the geometric information of the text box is represented by 4-dimensional axial bounding box parameters (AABB) R and 1-dimensional rotation angle theta, and the 4-dimensional parameters of R respectively represent the distances from the pixel points to the upper, right, lower and left boundaries of the rectangular box.

After the model is built by utilizing the open source deep learning frame PyTorch, images in an application scene are collected by using a monocular camera, a data set is manufactured, the manufactured data set is used for training a text detection model EAST-light and a text recognition model CRNN on a computer with a GPU, and the trained model weight is stored.

Step 2: the method comprises the steps of using a Jetson TX2 development board of the Invidant company as a computing platform of a SLAM loop detection method, receiving an environment image collected by a monocular camera sensor as input, detecting texts in the image through a text detection model EAST-light, outputting coordinates of a text box, storing the coordinates of the text detection model, extracting features of a second stage of a network ShuffleNet V2, and outputting f a feature map₂。

And 3, step 3: the detected text region is identified by a text identification model CRNN.

And 4, step 4: from the text detection and recognition results obtained in steps 2 and 3,feature map f of the second stage of the Shufflenet V2 model₂Performing global average pooling on each channel to obtain a feature information vector f, wherein each element in the f is equal to the feature map f₂Average value of corresponding channel in the channel.

Describing semantic information of an image by a vector, and recording a text semantic information vector as t ═ e₁，e₂，…，e_N]Wherein e is_i＝[p_i，x_1i，y_1i，x_2i，y_2i]N denotes the number of textual signposts, e_iDescribing the information of the ith textual road sign in the image, p_iIndicating whether the ith textual landmark appears in the image, (x)_1i，y_1i) And (x)_2i，y_2i) The coordinates of the upper left corner and the lower right corner of the text box are respectively, and the information is output by the trained text detection and recognition model.

And obtaining a total information vector by weighting and fusing the characteristic information vector and the semantic information vector of the current frame: s is λ t + f, where λ represents the weight occupied by the semantic information vector and can be set to 0.1, and the similarity is calculated by cosine value, and the cosine similarity between the vector m and the vector n is

And 5: for each key frame in the set of key frames, its total information vector s is calculated_iThe total information vector s of the current frame_jCosine similarity of

And taking the key frame with the similarity larger than a certain threshold value and not directly adjacent to the current frame as a loop candidate frame.

Step 6: if three consecutive adjacent loop candidate frames occur, a loop is considered to occur.

The method utilizes the deep neural network to extract the image characteristics, is more robust than the manually designed characteristics, can better represent original data when the scene slightly changes, simultaneously detects and identifies text objects appearing in the scene, extracts the semantic information of the text, performs weighted fusion on the similarity of the characteristic points and the semantic similarity, improves the precision of similarity judgment, improves the accuracy of loopback detection, and enhances the robustness of the SLAM system. Compared with the EAST, the EAST-light model greatly improves the running speed of the model and realizes better balance on speed and precision for the real-time requirement of the SLAM and the limitation of the computing resource of the embedded platform.

The embodiments described above are presented to enable a person having ordinary skill in the art to make and use the invention. It will be readily apparent to those skilled in the art that various modifications to the above-described embodiments may be made, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims

1. A SLAM loop detection method combined with scene text semantic information comprises the following steps:

aiming at the real-time requirement of SLAM and the limitation of computing resources of an embedded platform, the method is improved on the basis of an EAST model to obtain a text detection model based on a lightweight neural network, and specifically comprises the following steps: the method comprises the steps that an input picture is a picture, a corresponding area of text information in the picture is directly predicted by using a full convolution network, non-maximum suppression is carried out on the area exceeding a set threshold value in the area obtained by prediction of the full convolution network, and the result of the non-maximum suppression is the final output of a model, namely the coordinates of a text box on the picture;

the full convolution network comprises three parts of feature extraction, feature fusion and output layer, wherein the feature extraction part adopts a ShuffleNet V2 model to output a feature map f of four levels₁,f₂,f₃,f₄Size 1/32, 1/16, 1/8 and 1/4 of original drawing; the feature fusion moiety pair ShuffleNet V2 model output four-level feature map f₁,f₂,f₃,f₄Performing step-by-step feature fusion, wherein three feature fusion stages are provided, in each feature fusion stage, firstly performing up-sampling on a feature map from the previous stage to make the feature map have the same size as the current feature map, then cascading the feature map and the current feature map along a channel direction, further reducing the number of channels of the feature map after cascading by using a 1 × 1 convolutional layer to reduce the calculated amount, and finally performing information fusion on the feature map by using a 3 × 3 convolutional layer to generate a result of the current feature fusion stage; after the last feature fusion stage, generating a final feature map by using a 3 multiplied by 3 convolutional layer and inputting the final feature map into an output layer; the number of channels of the 1 × 1 convolutional layer in the three feature fusion stages is 1256, 244 and 88 respectively, the number of channels of the 3 × 3 convolutional layer in the three feature fusion stages is 128, 1256 and 32 respectively, and the number of channels of the 3 × 3 convolutional layer after the last feature fusion stage is 32;

(3) recognizing the detected text area by using a text recognition model;

for the current frame, taking the feature picture f of the second stage of the model feature extraction part in the text detection process₂Carrying out global average pooling on each channel to obtain a feature map f of each element in a feature information vector f of the current frame₂Average values of corresponding channels in the channel;

2. The SLAM loop back detection method of claim 1, wherein: and (1) adopting a CRNN model as a text recognition model based on a lightweight neural network.

3. The SLAM loopback detection method as recited in claim 1, wherein: in the step (4), for the current frame, the semantic information of the current frame is described by one vector, and the semantic information vector of the current frame is recorded as t ═ e₁,e₂,…,e_N]Wherein e is_i＝[p_i,x_1i,y_1i,x_2i,y_2i]N denotes the number of textual signposts, e_iInformation describing the ith textual roadmap in the current frame, p_iIndicating whether the ith text road sign appears in the current frame or not, if so, p_i1, otherwise p_i＝0，(x_1i,y_1i) And (x)_2i,y_2i) And respectively outputting the coordinates of the upper left corner and the lower right corner of the text box corresponding to the ith text signpost in the current frame, wherein the information is output by a trained text detection model and a trained text recognition model.

4. The SLAM loop back detection method of claim 1, wherein: in the step (4), the feature information vector f and the semantic information vector t are subjected to weighted fusion through a formula s ═ λ t + f to obtain a total information vector s, wherein λ is the weight occupied by the semantic information vector t.

5. The SLAM loop back detection method of claim 1, wherein: in the step (5), for two total information vectors m and n, calculating cosine similarity cos (m, n) of the two total information vectors m and n through the following formula;