CN113591770A

CN113591770A - Multimode fusion obstacle detection method and device based on artificial intelligence blind guiding

Info

Publication number: CN113591770A
Application number: CN202110913691.4A
Authority: CN
Inventors: 秦文健; 张旺
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2021-08-10
Filing date: 2021-08-10
Publication date: 2021-11-02
Anticipated expiration: 2041-08-10
Also published as: WO2023015799A1; CN113591770B

Abstract

The invention discloses a multimode fusion obstacle detection method based on artificial intelligence blind guiding, which comprises the following steps: the method comprises the steps that an infrared camera and a color camera are respectively responsible for acquiring an infrared image and a color image of a scene; the acquired infrared and color bimodal images are respectively transmitted to a convolutional neural network Q1 and a convolutional neural network Q2, and the convolutional neural network Q1 and the convolutional neural network Q2 respectively convert the images into a first multichannel feature map and a second multichannel feature map to be flattened into vectors at the back; vectorizing and representing the first multi-channel feature map and the second multi-channel feature map, and performing feature vector coding on the first multi-channel feature map and the second multi-channel feature map sequence to generate a plurality of prediction vectors; the invention discloses a method for detecting obstacles, which is characterized in that a Transformer structure is introduced in the process of detecting the obstacles, so that multi-mode fusion is more effectively realized, a Transformer-block is introduced, the characteristics of infrared and color images are fully fused, and the obstacle detection precision under the low-illumination situation is improved.

Description

Multimode fusion obstacle detection method and device based on artificial intelligence blind guiding

Technical Field

The invention relates to the technical field of natural image processing, in particular to a multimode fusion obstacle detection method and device based on artificial intelligence blind guiding.

Background

According to the statistics of Chinese Union, at least 500 million blind people are in China at present, and the number of blind people is increased year by year with the aging of population. The 'guiding blind for blind people' is always a hot research problem. Before the rise of artificial intelligence, intelligent blind guiding is always the blind guiding solution pursued by researchers. This pursuit is increasingly becoming a reality as artificial intelligence begins to break out in this century. The appearance of deep learning and convolutional neural networks enables computer vision to apply traditional blind guiding technologies which rely on ultrasonic waves and the like to avoid obstacles gradually, and the problem that obstacle detection is complex and difficult to process is solved.

At present, most of the latest blind guiding technologies applying deep target detection upload collected images to a server, then the collected images are processed by a network trained by a supervised or unsupervised method, and blind guiding is performed by combining other sensing information. The method fully utilizes the advantages of processing complex images by deep learning, and has good performance under the general blind guiding situation. Experiments show that the blind guiding device can accurately identify common objects such as a garbage can, a chair, people and the like in the life scene of the blind through deep learning. Although this type of method works well, the detection results are not satisfactory for dark scenes. Most of vision-based blind guiding technologies are realized by using a color image training network under bright illumination, and a bright image of a dark scene is difficult to obtain. One solution is multi-modal image fusion, i.e., acquiring an infrared image and a common color image of a dark scene, and obtaining a more reliable detection result by respectively extracting and fusing the characteristics of the infrared image and the color image. In dark scenes, the validity of the features of the color image is greatly reduced, the object contour is not easy to identify, but the infrared image can obtain the object contour information more easily. The characteristics of the two images extracted by the neural network are fused by a certain method, so that the target detection performance of the neural network can be greatly improved. Most of the existing multi-modal image fusion is based on CNN, and the CNN can not be fully fused when fusing multi-modal characteristics, so a Transformer structure is introduced, so that the image characteristics of different modalities can be fully fused, and the detection precision is improved.

At present, obstacle detection methods of blind guiding equipment can be divided into traditional vision-free methods, traditional machine vision methods and machine vision methods based on deep learning.

(1) Most of the traditional vision-free sensors only use ultrasonic and infrared sensors, the judgment of the barrier is only limited to the azimuth distance, and the precision is low;

(2) the traditional machine vision mainly utilizes a pre-written algorithm to identify the characteristics of a target in an image, and the method has weak migration capability and no intelligence;

(3) the machine vision method based on deep learning trains the characteristics of the learning images through the data set, images of various scenes can be recognized, target detection is carried out, the detection effect is good, but in a dark scene, the color images can obtain little object information, and obstacles are difficult to effectively detect.

(4) The CNN-based multi-modal obstacle detection method can extract infrared and color bimodal image features for fusion so as to better detect obstacles, but the features cannot be fully fused.

Disclosure of Invention

The invention aims to introduce a Transformer structure in the process of detecting an obstacle, more effectively realize multi-mode fusion, introduce a Transformer block, fully fuse the characteristics of infrared and color images and improve the obstacle detection precision under the low-illumination situation.

In a first aspect, the invention provides a multimode fusion obstacle detection method based on artificial intelligence blind guiding, which comprises the following steps:

the method comprises the steps that an infrared camera and a color camera are respectively responsible for acquiring an infrared image and a color image of a scene;

the acquired infrared and color bimodal images are respectively transmitted to a convolutional neural network Q1 and a convolutional neural network Q2, and the convolutional neural network Q1 and the convolutional neural network Q2 respectively convert the images into a first multichannel feature map and a second multichannel feature map to be flattened into vectors at the back;

vectorizing and representing the first multi-channel feature map and the second multi-channel feature map, and performing feature vector coding on the first multi-channel feature map and the second multi-channel feature map sequence to generate a plurality of prediction vectors;

and performing classification and position prediction on the generated multiple prediction vectors.

Preferably, the acquired infrared and color bimodal images are respectively transmitted to a convolutional neural network Q1 and a convolutional neural network Q2, the convolutional neural network Q1 and the convolutional neural network Q2 respectively convert the images into a first multichannel feature map and a second multichannel feature map, and the subsequent flattening as a vector specifically includes respectively inputting color images of different specifications or infrared images into a VGG-16 backbone network through scaling, padding and deformation to 227 × 227 size, cutting off full connection layers, that is, obtaining 512 7 × 7 feature maps through convolution pooling.

Preferably, the vectorization of the first multi-channel feature map and the second multi-channel feature map is performed, feature vector coding is performed on the first multi-channel feature map and the second multi-channel feature map sequence, and the generation of the plurality of prediction vectors includes that the first multi-channel feature map and the second multi-channel feature map of the infrared and color images are firstly flattened to obtain 512 × 49 feature maps, then the feature maps are regarded as 49 512-dimensional feature vectors, so that the slices can fully notice each other between pixels, and then the two modal vectors are spliced into 98 feature vectors with the length of 512 dimensions.

Preferably, the classifying and the predicting the position of the generated multiple prediction vectors specifically include performing a loss calculation on the multiple prediction vectors through a set loss function and a label.

Preferably, before performing the loss calculation on the plurality of prediction vectors by the set loss function and the labels, the bipartite graph matching method further includes finding the best match between one prediction vector and a label, then calculating the category loss by using cross entropy, and calculating the position loss in a regression manner to be the global loss.

In a second aspect, the invention further provides a multimode fusion obstacle detection device based on artificial intelligence blind guiding, which comprises

The image acquisition module consists of an infrared camera and a color camera and is used for respectively acquiring an infrared image and a color image of a scene;

the characteristic extraction module is used for acquiring infrared and color bimodal images and respectively transmitting the infrared and color bimodal images to a convolutional neural network Q1 and a convolutional neural network Q2, wherein the convolutional neural network Q1 and the convolutional neural network Q2 respectively convert the images into a first multichannel characteristic diagram and a second multichannel characteristic diagram, and the images are flattened into vectors in preparation for later use;

the feature fusion module is used for vectorizing and representing the first multi-channel feature map and the second multi-channel feature map, and performing feature vector coding on the first multi-channel feature map and the second multi-channel feature map sequence to generate a plurality of prediction vectors;

a classification module to classify and position predict the generated plurality of prediction vectors.

Preferably, the feature fusion module comprises an encoder and a decoder.

Preferably, the encoder comprises an embedded tokens, a regularization layer, a multi-head self-attention layer and a feedforward neural network layer; the decoder includes a regularization layer, a multi-headed self-attention layer, and a feed-forward neural network layer.

The method of the invention has the following advantages:

in the invention, a Transformer structure is introduced in the process of detecting the obstacle, multi-mode fusion is more effectively realized, a Transformer-block is introduced, the characteristics of infrared and color images are fully fused, and the obstacle detection precision under the low-illumination situation is improved.

Drawings

Fig. 1 is a flow chart of a multimode fusion obstacle detection method based on artificial intelligence blind guiding provided by the invention.

Fig. 2 is a schematic diagram of a sensor space structure of the multi-modal fusion obstacle detection method based on artificial intelligence blind guiding provided by the invention.

FIG. 3 is a schematic diagram of a feature fusion module provided in the present invention.

Detailed Description

The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention. In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings of the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without any inventive step, are within the scope of protection of the invention.

As shown in fig. 1, the invention provides a multi-modal fusion obstacle detection method based on artificial intelligence blind guiding,

the method comprises the following steps:

s1, acquiring an infrared image and a color image of a scene through the infrared camera and the color camera respectively;

s2, respectively transmitting the acquired infrared and color bimodal images to a convolutional neural network Q1 and a convolutional neural network Q2, respectively converting the images into a first multichannel feature map and a second multichannel feature map by the convolutional neural network Q1 and the convolutional neural network Q2, and flattening the images into vectors in preparation for later use;

s3, vectorizing and representing the first multichannel feature map and the second multichannel feature map, and performing feature vector coding on the first multichannel feature map and the second multichannel feature map sequence to generate a plurality of prediction vectors;

and S4, classifying and predicting the position of the generated prediction vectors.

The acquired infrared and color bimodal images are respectively transmitted to a convolutional neural network Q1 and a convolutional neural network Q2, the convolutional neural network Q1 and the convolutional neural network Q2 respectively convert the images into a first multichannel feature map and a second multichannel feature map, the subsequent flattening is used as a vector, specifically, the method comprises the steps of respectively inputting color images with different specifications or infrared images into a VGG-16 backbone network through scaling, padding and deformation to obtain 227 × 227 images, cutting off full connection layers, and obtaining 512 7 × 7 feature maps after convolution pooling.

Vectorizing and representing the first multichannel feature map and the second multichannel feature map, performing feature vector coding on the first multichannel feature map and the second multichannel feature map sequence, and generating a plurality of prediction vectors, wherein the step of flattening the first multichannel feature map and the second multichannel feature map of the infrared and color images to obtain 512 × 49 feature maps, and then regarding the feature maps as 49 512-dimensional feature vectors, so that the slicing can give full attention to each other between pixels, and then splicing the two modal vectors into 98 feature vectors with the length of 512 dimensions.

The classifying and position predicting the generated multiple prediction vectors specifically comprises performing loss calculation on the multiple prediction vectors through a set loss function and a label.

Before loss calculation is carried out on a plurality of prediction vectors through a set loss function and labels, a bipartite graph matching method is used, the best matching of one prediction vector and one label is found, then the cross entropy is used for calculating the category loss, and the position loss calculated in a regression mode is added up to be the global loss.

As shown in fig. 2, the invention further provides a multi-modal fusion obstacle detection device based on artificial intelligence blind guiding, which comprises an image acquisition module, a feature extraction module, a feature fusion module and a classification module.

An image acquisition module: the system consists of an infrared camera and a color camera which are respectively responsible for acquiring an infrared image and a color image of a scene.

And the feature extraction module is used for acquiring infrared and color bimodal images and respectively transmitting the infrared and color bimodal images to the convolutional neural network Q1 and the convolutional neural network Q2, and the convolutional neural network Q1 and the convolutional neural network Q2 respectively convert the images into a first multichannel feature map and a second multichannel feature map so as to be flattened into vectors in the future. In the module, the convolutional neural network can select a classic CNN network framework, such as VGG-16 and the like. The color images with different specifications are transformed into images with the size of 227 × 227 through scaling and padding, and the images are respectively input into a VGG-16 backbone network, and full connection layers are cut off, namely 512 characteristic maps of 7 × 7 are obtained after convolution pooling. Similarly, 512 characteristic maps of 7 × 7 are obtained for the infrared image.

A feature fusion module (a transform block, a feature fusion module) configured to perform vectorization representation on the first multi-channel feature map and the second multi-channel feature map, perform feature vector coding on the first multi-channel feature map and the second multi-channel feature map, and generate a plurality of prediction vector feature fusion modules, as shown in fig. 3, where the feature fusion module mainly includes an encoder and a decoder. The encoder comprises an embedded tokens (embedded token), a regularization layer, a Multi-head self-attention (Multi-head attention) and a feed-forward neural network layer; the decoder includes a regularization layer, a multi-headed self-attention layer, and a feed-forward neural network layer.

The function of the embedded tokens is to vectorize the image to make the image into an input form conforming to the transform-Encoder. The Transformer is the model based on Natural Language Processing (NLP) from the beginning, and the Encoder inputs word vectors, so when the Transformer is used for processing images, image information needs to be converted into vector form and input into the encode. The embedded tokens are vector forms obtained by the two mode images through a convolutional neural network, and thus the embedded tokens conform to the input form of an encoder.

The module firstly flattens multi-channel feature graphs of the infrared and color images respectively to obtain 512-49 feature graphs, then considers the feature graphs as 49 512-dimensional feature vectors, so that the slicing can make the pixels sufficiently notice each other, then splices the two modal vectors into 98 feature vectors with the length of 512, inputs the 98 feature vectors into a transform block, encodes the feature vectors by an Encoder, and sends results to a classification module, wherein the number of the encoders is selectable, and the performance of the encoders is improved to a certain extent.

In each Encoder, the feature vector is linearly added to its corresponding position-coding vector and input to a Multi-head integration composed of a plurality of single-headed Self-integrations. A single-headed Self-attention case is introduced: assuming that the input is X, the single attention is to subject X to three different linear changes Wq, Wk, and Wv, and three results are represented as query, key, and value, respectively, which are denoted as Q-WqX, K-WkX, and V-WvX. Each Q is then multiplied by each K, denoted QKT, through a layer of softmax and then multiplied by the corresponding V to obtain the final result O ═ softmax (qkt) V.

This is the attention of a single head, and a Multi-head (Multi-head) is to cut an input X into n segments, perform linear transformation respectively, and then splice back after the transformation.

And then, performing residual error connection on the previous input X and output O, then making a layer norm (regularization layer), inputting the norm finished result into a feed forward neural network layer in a full connection layer, and finishing an encoder after residual error connection and regularization are performed again.

Decoder, basically the same as encoder structure, except that the input is not just a quantity, multiplied by Wq is object queries (object queries), this vector acts as a different local in the attention map, similar to what is somewhere in the "query" graph, the vector is initialized randomly, and can be obtained by training, the output of encoder operates with Wv, Wk at the same time, wherein the output that operates with Wk is also added with position coding in advance.

A classification module to classify and position predict the generated plurality of prediction vectors. The classification module is mainly a set loss function. The parallel vectors output by the decoder are prediction vectors, and the number of object queries vectors determines the number of prediction classes. Each vector predicts the category and position information of a target, and loss calculation is carried out through a set loss function and a label (grountrith). Before calculation, the best match between a prediction vector and a label is found by using a bipartite graph matching method (Hungarian algorithm), then the category loss is calculated by using cross entropy, and the position loss calculated by a regression mode is added up to be the global loss.

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A multimode fusion obstacle detection method based on artificial intelligence blind guiding is characterized by comprising the following steps: comprises that

2. The method for detecting the multi-modal fusion obstacles based on artificial intelligence blind guiding as claimed in claim 1, characterized in that:

the acquired infrared and color bimodal images are respectively transmitted to a convolutional neural network Q1 and a convolutional neural network Q2, the convolutional neural network Q1 and the convolutional neural network Q2 respectively convert the images into a first multichannel feature map and a second multichannel feature map, the subsequent flattening is used as a vector, specifically, the method comprises the steps of respectively inputting color images with different specifications or infrared images into images with the size of 227 x 227 through scaling, padding and deformation into a VGG-16 backbone network, cutting off full connection layers, and obtaining 512 feature maps with 7 x 7 after convolution pooling.

3. The method for detecting the multi-modal fusion obstacles based on artificial intelligence blind guiding as claimed in claim 1, characterized in that: vectorizing and representing the first multichannel feature map and the second multichannel feature map, performing feature vector coding on the first multichannel feature map and the second multichannel feature map sequence, and generating a plurality of prediction vectors, wherein the step of flattening the first multichannel feature map and the second multichannel feature map of the infrared and color images to obtain 512 × 49 feature maps, and then regarding the feature maps as 49 512-dimensional feature vectors, so that the slicing can give full attention to each other between pixels, and then splicing the two modal vectors into 98 feature vectors with the length of 512 dimensions.

4. The method for detecting the multi-modal fusion obstacles based on artificial intelligence blind guiding as claimed in claim 1, characterized in that: the classifying and position predicting the generated multiple prediction vectors specifically comprises performing loss calculation on the multiple prediction vectors through a set loss function and a label.

5. The method for detecting the multi-modal fusion obstacles based on artificial intelligence blind guiding as claimed in claim 4, characterized in that: before loss calculation is carried out on a plurality of prediction vectors through a set loss function and labels, a bipartite graph matching method is used, the best matching of one prediction vector and one label is found, then the cross entropy is used for calculating the category loss, and the position loss calculated in a regression mode is added up to be the global loss.

6. The utility model provides a multimode fuses barrier detection device based on artificial intelligence leads blind, its characterized in that: comprises that

7. The multi-modal fusion obstacle detection device based on artificial intelligence blind guiding of claim 6, wherein: the feature fusion module includes an encoder and a decoder.

8. The multi-modal fusion obstacle detection device based on artificial intelligence blind guiding of claim 7, wherein: the encoder comprises an embedded tokens, a regularization layer, a multi-head self-attention layer and a feedforward neural network layer; the decoder includes a regularization layer, a multi-headed self-attention layer, and a feed-forward neural network layer.