CN114266938A

CN114266938A - Scene recognition method based on multi-mode information and global attention mechanism

Info

Publication number: CN114266938A
Application number: CN202111592561.1A
Authority: CN
Inventors: 孙宁; 李响; 朱良伟
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2022-04-01

Abstract

The invention relates to a scene recognition method based on multi-modal information and a global attention mechanism, which specifically comprises the following steps of 1: selecting RGB images and depth images of a plurality of scenes, pairing the coded depth images and the RGB images, and dividing the coded depth images into a training set and a test set; step 2: constructing a two-channel deep neural network model; and step 3: sending the training set divided in the step 1 into the two-channel deep neural network in the step 2 for training; and 4, step 4: a scene picture is identified. The method effectively utilizes the complementarity between the RGB image and the depth image, respectively obtains corresponding learnable category vectors by carrying out global attention monitoring on the RGB image and the depth image, thereby carrying out scene classification.

Description

Scene recognition method based on multi-mode information and global attention mechanism

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a scene recognition method based on multi-mode information and a global attention mechanism.

Background

Currently, scene recognition technology has become an important branch of the computer vision field, and is mainly applied to the fields of image retrieval, robot navigation, intelligent video monitoring, automatic driving, disaster detection and the like. With the revival of deep neural networks and the appearance of large data sets, the performance of scene recognition is significantly improved. However, due to the characteristics of numerous targets in the scene image, complex spatial layout, large intra-class difference and small inter-class difference, the scene recognition obtained by completely depending on the data of the RGB image in the mode has a large difference with the discrimination capability of human beings.

With the rapid development of the depth sensor, scientific researchers find that the RGB image and the depth image have strong complementarity, so that a scene recognition method combining the RGB image and the depth image is greatly developed, and researches find that the multi-mode scene recognition method based on the RGB image and the depth image has obvious advantages compared with a method using single-mode data.

In recent years, researchers find that human beings concentrate on key and useful information and reduce the attention on useless information in the process of recognizing things through research. According to the characteristics of human visual mechanism, researchers have proposed attention mechanism architecture. However, in the current image recognition field, most attention mechanism architectures are realized by connecting an attention mechanism module with a convolutional neural network.

Disclosure of Invention

In order to solve the problems, the invention provides a scene recognition method based on multi-modal information and a global attention mechanism, wherein an RGB image and a depth image are sent into a network together, the relationship between the RGB image and the depth image is further mined, and the complementarity of the RGB image and the depth image is well utilized; in addition, the invention provides a global attention mechanism system without a convolutional neural network through a graph embedded network and a global attention coding network, thereby not only increasing the capability of feature extraction, but also keeping the characteristics of parallel computation.

In order to achieve the purpose, the invention is realized by the following technical scheme:

the invention relates to a scene recognition method based on multi-modal information and a global attention mechanism, which comprises the following steps:

step (1): selecting RGB images and depth images of a plurality of scenes from a multi-modal scene database, recoding the depth images by using three channels, pairing the coded depth images (hereinafter referred to as HHA images) with the RGB images, and dividing the paired images into a training set and a test set according to corresponding proportions.

Step (2): and constructing an end-to-end trainable double-channel deep neural network model (hereinafter referred to as SR-MGA) combining a global attention mechanism and multi-modal information. The SR-MGA comprises a graph embedding network, a global attention coding network, a feature fusion network and a classification network. The two-channel structure is the same and is composed of a graph embedding network and a global attention coding network. And (4) respectively inputting the paired RGB images and HHA images obtained in the step (1.1) into an image embedding network to obtain corresponding RGB image block sequences and HHA image block sequences. And inputting the two block sequences into a global attention coding network for learning respectively, wherein lateral connection is added between the two global attention coding networks, then RGB image features and HHA image features obtained by learning are sent into a feature fusion network for splicing to obtain fusion features, and finally the features are sent into a classification network.

And (3): and (3) sending the training set divided in the step (1) into a deep neural network for training. During training, the problem of class imbalance is solved by using a cross entropy loss function with weight.

And (4): when scene pictures are identified, the paired RGB images and HHA images obtained in the step (1) are input into an SR-MGA network model, the prediction probability value of each scene category in a plurality of scene categories corresponding to each paired multi-modal image is obtained, if the prediction probability value is consistent with the real category, the prediction is correct, and finally the classification accuracy of the scene images is obtained. The classification accuracy is the ratio of the number of correct predictions to the total number of predictions.

The invention is further improved in that: the three channels in the step (1) respectively refer to horizontal parallax, ground height and inclination angles of the local surface of the pixel and the inferred gravity direction.

The invention is further improved in that: and respectively converting the paired RGB images and HHA images into corresponding 1-dimensional RGB image block sequences and 1-dimensional HHA image block sequences through a graph embedding network of SR-MGA. The graph-embedded network is composed of a convolutional layer. Specifically, the input 2-dimensional image is recorded as x ∈ R^H×W×CWhere H and W are the width and height of the image, respectively, and C is the number of channels of the image. Dividing the image into image blocks of size P × P to obtain

Image co-segmentation into N HW/P²Number of image blocks, N is also the image block sequence length. Which is then mapped to a dimension of size D by linear transformation. In order to obtain image characteristics later, a learnable category vector X is added on the basis of the two-dimensional matrix_class. At the same time, position codes E need to be added_posTo encode the position information and finally obtain the image block sequence z₀。

The expression of the above steps is:

the invention is further improved in that: the global attention coding network of the SR-MGA in the step (2) is composed of 12 same global attention coding modules. Wherein the global attention coding module is composed of two residual error blocks connected in series. The first residual block consists of one layer normalization, three fully connected layers, one self-attention mechanism and one fully connected layer. The second residual block consists of one layer normalization, two fully connected layers, and two feature loss layers. In order to further solve the problem that the network is easy to be over-fitted, a feature loss layer is added at the connection jump of the two residual blocks.

The invention is further improved in that: and (3) adding a lateral connection between the two channels of the SR-MGA in the step (2), and adding the output of the Nth global attention coding module in the global attention coding network corresponding to the HHA image to the input of the (N + 1) th global attention coding module in the global attention coding network corresponding to the RGB image.

The invention is further improved in that: during training in the step (3), the problem of class imbalance is solved by using a cross entropy loss function with weight, the weight of classes with small number is improved, and the weight of classes with large number is reduced. The weight calculation formula is:

a is class a scene class, N_nNumber of images for the nth class of scene category. The cross entropy loss function with weights is:

the invention has the beneficial effects that: coding the depth image by using three channels to obtain an HHA image, further acquiring rich information similar to a gray image from the depth image, inputting a paired RGB image and the HHA image into an image embedding network, respectively converting the RGB image and the HHA image into image block sequences beneficial to global attention coding network learning, and focusing on information from different areas at different positions through the global attention coding network and lateral connection; the complementary relation between the RGB image and the depth image is further mined, the problem of uneven distribution of categories is solved through a cross entropy loss function with weight, the weight of the category with small number is improved, and the weight of the category with large number is reduced; the module improves the parallel computing capability and the interpretability of the network while abandoning the tradition of a convolutional neural network connection attention mechanism. In general, the SR-MGA provided by the invention improves the accuracy of multi-modal scene recognition.

Drawings

Fig. 1 is a flow chart illustrating a structure of a scene recognition method according to the present invention.

FIG. 2 is input data for the method of the present invention, where FIG. 2(a) is an RGB image, FIG. 2(b) is a depth image, and FIG. 2(c) is an HHA image.

FIG. 3 is a schematic diagram of a global attention coding network structure according to the present invention.

Detailed Description

In the following description, for purposes of explanation, numerous implementation details are set forth in order to provide a thorough understanding of the embodiments of the invention. It should be understood, however, that these implementation details are not to be interpreted as limiting the invention. That is, in some embodiments of the invention, such implementation details are not necessary.

As illustrated in fig. 1: the invention provides a scene recognition method based on multi-modal information and a global attention mechanism, which comprises the following steps of:

step 1: selecting RGB images and depth images of a plurality of scenes from a multi-modal scene database, recoding the depth images by using three channels, wherein the three channels respectively refer to horizontal parallax, ground height and inclination angles of local surfaces of pixels and the inferred gravity direction, pairing the coded depth images and the RGB images, and dividing the paired images into a training set and a testing set according to corresponding proportions. RGB image as shown in fig. 2(a), depth image as shown in fig. 2(b), depth image hereinafter referred to as HHA image, as shown in fig. 2(c),

step 2: and constructing an end-to-end trainable double-channel deep neural network model combining a global attention mechanism and multi-mode information, which is hereinafter referred to as SR-MGA. The SR-MGA comprises a graph embedding network, a global attention coding network, a feature fusion network and a classification network. The two-channel structure is the same and is composed of a graph embedding network and a global attention coding network. Inputting the paired RGB images and HHA images obtained in the step (1) into an image embedding network respectively to obtain corresponding RGB image block sequences and HHA image block sequences, inputting the two block sequences into a global attention coding network respectively for learning, adding lateral connection between the two channels of global attention coding networks, sending the learned RGB image characteristics and HHA image characteristics into a characteristic fusion network for splicing to obtain fusion characteristics, and finally sending the characteristics into a classification network.

And respectively converting the paired RGB images and HHA images into corresponding 1-dimensional RGB image block sequences and 1-dimensional HHA image block sequences through a graph embedding network of SR-MGA. The graph-embedded network is composed of a convolutional layer. Specifically, the input 2-dimensional image is recorded as x ∈ R^H×W×CWhere H and W are the width and height of the image, respectively, both width and height being 224, and C is the number of channels of the image, the number of channels being 3. Dividing the image into image blocks of size P × P, P being 16 in this embodiment, to obtain

Image co-segmentation into N HW/P²The number of image blocks, N, is also the image block sequence length, 196. It is then mapped by linear transformation to a dimension of size D, set to 768. In order to obtain image characteristics later, a learnable category vector X is added on the basis of the two-dimensional matrix_class. At the same time, position codes E need to be added_posTo encode the position information and finally obtain the image block sequence z₀。

The expression of the above steps is:

next, the RGB image block sequence and the HHA image block sequence are each fed into a global attention coding network consisting of 12 identical global attention coding modules, as shown in fig. 3. Wherein the global attention coding module is composed of two residual error blocks connected in series. The first residual block consists of one layer normalization, three fully connected layers, one self-attention mechanism and one fully connected layer. The second residual block consists of one layer normalization, two fully connected layers, and two feature loss layers. In order to further solve the problem that the network is easy to be over-fitted, a feature loss layer is added at the connection jump of the two residual blocks.

In order to better utilize the complementarity between the RGB image and the depth image, a lateral connection is added between the two channels of the SR-MGA, the output of the Nth global attention coding module in the global attention coding network corresponding to the HHA image is added to the input of the (N + 1) th global attention coding module in the global attention coding network corresponding to the RGB image, and N is set to be 10.

And step 3: and (4) sending the training set divided in the step (1) into a deep neural network for training. During training, the problem of class imbalance is solved by using a cross entropy loss function with weight.

The weight calculation formula is:

and 4, step 4: when scene pictures are identified, the paired RGB images and HHA images obtained in the step 1 are input into an SR-MGA network model, the prediction probability value of each scene category in a plurality of scene categories corresponding to each paired multi-modal image is obtained, if the prediction probability value is consistent with the real category, the prediction is correct, and finally the classification accuracy of the scene images is obtained, wherein the classification accuracy is the ratio of the correct prediction number to the total prediction number.

The method effectively utilizes the complementarity between the RGB image and the depth image, respectively obtains corresponding learnable category vectors by carrying out global attention monitoring on the RGB image and the depth image, thereby carrying out scene classification.

The above description is only an embodiment of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A scene recognition method based on multi-modal information and a global attention mechanism is characterized in that: the scene recognition method comprises the following steps:

step 1: selecting RGB images and depth images of a plurality of scenes from a multi-modal scene database, recoding the depth images by using three channels, pairing the coded depth images with the RGB images, and dividing the paired images into a training set and a test set according to corresponding proportions;

step 2: constructing an end-to-end trainable double-channel deep neural network model combining a global attention mechanism and multi-mode information;

and step 3: sending the training set divided in the step 1 into the two-channel deep neural network in the step 2 for training;

and 4, step 4: identifying a scene picture: inputting the paired RGB images and HHA images obtained in the step 1 into the dual-channel deep neural network model in the step 2, obtaining the prediction probability value of each scene category in a plurality of scene categories corresponding to each paired multi-modal image, and if the prediction probability value is consistent with the real category, indicating that the prediction is correct, and finally obtaining the classification accuracy of the scene image.

2. The method of claim 1, wherein the scene recognition method based on multi-modal information and global attention mechanism comprises: the two-channel deep neural network model in the step 2 comprises a graph embedding network, a global attention coding network, a feature fusion network and a classification network, wherein two channels are formed by the graph embedding network and the global attention coding network, and the construction process of the two-channel deep neural network model is as follows:

step 2-1: inputting the RGB image and the depth image which are well paired in the step 1 into an image embedding network respectively to obtain a corresponding RGB image block sequence and a corresponding depth image block sequence;

step 2-2: inputting the RGB image block sequence and the depth image block sequence obtained in the step 2-1 into a global attention coding network for learning;

step 2-3: and sending the RGB image characteristics and the depth image characteristics obtained by learning into a characteristic fusion network for splicing to obtain fusion characteristics, and finally sending the characteristics into a classification network.

3. The method of claim 2, wherein the scene recognition method based on multi-modal information and global attention mechanism comprises: in the step 2-2, a lateral connection is added between the two-channel global attention coding networks, specifically: and adding the output of the Nth global attention coding module in the global attention coding network corresponding to the depth image to the input of the (N + 1) th global attention coding module in the global attention coding network corresponding to the RGB image.

4. The method of claim 2, wherein the scene recognition method based on multi-modal information and global attention mechanism comprises: the global attention coding network in step 2 is composed of 12 same global attention coding modules, wherein each global attention coding module is composed of two residual blocks which are connected in series, the first residual block is composed of a layer normalization layer, three full connection layers, a self-attention mechanism and a full connection layer, the second residual block is composed of a layer normalization layer, two full connection layers and two feature loss layers, and the feature loss layer is added at the connection jump position of the two residual blocks.

5. The method of claim 1, wherein the scene recognition method based on multi-modal information and global attention mechanism comprises: when the training set divided in the step 1 is sent to the two-channel deep neural network in the step 2 for training, the problem of class imbalance is solved by using a cross entropy loss function with weight, the weight of classes with small number is improved, and the weight of classes with large number is reduced, wherein a weight calculation formula is as follows:

a is class a scene class, N_nNumber of images for the nth class of scene category;

the cross entropy loss function with weights is:

6. the method of claim 1, wherein the scene recognition method based on multi-modal information and global attention mechanism comprises: the three channels in step 1 refer to horizontal parallax, ground height and inclination angle of the local surface of the pixel and the inferred gravity direction, respectively.