CN114266938A - Scene recognition method based on multi-mode information and global attention mechanism - Google Patents

Scene recognition method based on multi-mode information and global attention mechanism Download PDF

Info

Publication number
CN114266938A
CN114266938A CN202111592561.1A CN202111592561A CN114266938A CN 114266938 A CN114266938 A CN 114266938A CN 202111592561 A CN202111592561 A CN 202111592561A CN 114266938 A CN114266938 A CN 114266938A
Authority
CN
China
Prior art keywords
global attention
image
network
scene
images
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111592561.1A
Other languages
Chinese (zh)
Inventor
孙宁
李响
朱良伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202111592561.1A priority Critical patent/CN114266938A/en
Publication of CN114266938A publication Critical patent/CN114266938A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to a scene recognition method based on multi-modal information and a global attention mechanism, which specifically comprises the following steps of 1: selecting RGB images and depth images of a plurality of scenes, pairing the coded depth images and the RGB images, and dividing the coded depth images into a training set and a test set; step 2: constructing a two-channel deep neural network model; and step 3: sending the training set divided in the step 1 into the two-channel deep neural network in the step 2 for training; and 4, step 4: a scene picture is identified. The method effectively utilizes the complementarity between the RGB image and the depth image, respectively obtains corresponding learnable category vectors by carrying out global attention monitoring on the RGB image and the depth image, thereby carrying out scene classification.

Description

Scene recognition method based on multi-mode information and global attention mechanism
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a scene recognition method based on multi-mode information and a global attention mechanism.
Background
Currently, scene recognition technology has become an important branch of the computer vision field, and is mainly applied to the fields of image retrieval, robot navigation, intelligent video monitoring, automatic driving, disaster detection and the like. With the revival of deep neural networks and the appearance of large data sets, the performance of scene recognition is significantly improved. However, due to the characteristics of numerous targets in the scene image, complex spatial layout, large intra-class difference and small inter-class difference, the scene recognition obtained by completely depending on the data of the RGB image in the mode has a large difference with the discrimination capability of human beings.
With the rapid development of the depth sensor, scientific researchers find that the RGB image and the depth image have strong complementarity, so that a scene recognition method combining the RGB image and the depth image is greatly developed, and researches find that the multi-mode scene recognition method based on the RGB image and the depth image has obvious advantages compared with a method using single-mode data.
In recent years, researchers find that human beings concentrate on key and useful information and reduce the attention on useless information in the process of recognizing things through research. According to the characteristics of human visual mechanism, researchers have proposed attention mechanism architecture. However, in the current image recognition field, most attention mechanism architectures are realized by connecting an attention mechanism module with a convolutional neural network.
Disclosure of Invention
In order to solve the problems, the invention provides a scene recognition method based on multi-modal information and a global attention mechanism, wherein an RGB image and a depth image are sent into a network together, the relationship between the RGB image and the depth image is further mined, and the complementarity of the RGB image and the depth image is well utilized; in addition, the invention provides a global attention mechanism system without a convolutional neural network through a graph embedded network and a global attention coding network, thereby not only increasing the capability of feature extraction, but also keeping the characteristics of parallel computation.
In order to achieve the purpose, the invention is realized by the following technical scheme:
the invention relates to a scene recognition method based on multi-modal information and a global attention mechanism, which comprises the following steps:
step (1): selecting RGB images and depth images of a plurality of scenes from a multi-modal scene database, recoding the depth images by using three channels, pairing the coded depth images (hereinafter referred to as HHA images) with the RGB images, and dividing the paired images into a training set and a test set according to corresponding proportions.
Step (2): and constructing an end-to-end trainable double-channel deep neural network model (hereinafter referred to as SR-MGA) combining a global attention mechanism and multi-modal information. The SR-MGA comprises a graph embedding network, a global attention coding network, a feature fusion network and a classification network. The two-channel structure is the same and is composed of a graph embedding network and a global attention coding network. And (4) respectively inputting the paired RGB images and HHA images obtained in the step (1.1) into an image embedding network to obtain corresponding RGB image block sequences and HHA image block sequences. And inputting the two block sequences into a global attention coding network for learning respectively, wherein lateral connection is added between the two global attention coding networks, then RGB image features and HHA image features obtained by learning are sent into a feature fusion network for splicing to obtain fusion features, and finally the features are sent into a classification network.
And (3): and (3) sending the training set divided in the step (1) into a deep neural network for training. During training, the problem of class imbalance is solved by using a cross entropy loss function with weight.
And (4): when scene pictures are identified, the paired RGB images and HHA images obtained in the step (1) are input into an SR-MGA network model, the prediction probability value of each scene category in a plurality of scene categories corresponding to each paired multi-modal image is obtained, if the prediction probability value is consistent with the real category, the prediction is correct, and finally the classification accuracy of the scene images is obtained. The classification accuracy is the ratio of the number of correct predictions to the total number of predictions.
The invention is further improved in that: the three channels in the step (1) respectively refer to horizontal parallax, ground height and inclination angles of the local surface of the pixel and the inferred gravity direction.
The invention is further improved in that: and respectively converting the paired RGB images and HHA images into corresponding 1-dimensional RGB image block sequences and 1-dimensional HHA image block sequences through a graph embedding network of SR-MGA. The graph-embedded network is composed of a convolutional layer. Specifically, the input 2-dimensional image is recorded as x ∈ RH×W×CWhere H and W are the width and height of the image, respectively, and C is the number of channels of the image. Dividing the image into image blocks of size P × P to obtain
Figure BDA0003429726670000031
Image co-segmentation into N HW/P2Number of image blocks, N is also the image block sequence length. Which is then mapped to a dimension of size D by linear transformation. In order to obtain image characteristics later, a learnable category vector X is added on the basis of the two-dimensional matrixclass. At the same time, position codes E need to be addedposTo encode the position information and finally obtain the image block sequence z0
The expression of the above steps is:
Figure BDA0003429726670000032
the invention is further improved in that: the global attention coding network of the SR-MGA in the step (2) is composed of 12 same global attention coding modules. Wherein the global attention coding module is composed of two residual error blocks connected in series. The first residual block consists of one layer normalization, three fully connected layers, one self-attention mechanism and one fully connected layer. The second residual block consists of one layer normalization, two fully connected layers, and two feature loss layers. In order to further solve the problem that the network is easy to be over-fitted, a feature loss layer is added at the connection jump of the two residual blocks.
The invention is further improved in that: and (3) adding a lateral connection between the two channels of the SR-MGA in the step (2), and adding the output of the Nth global attention coding module in the global attention coding network corresponding to the HHA image to the input of the (N + 1) th global attention coding module in the global attention coding network corresponding to the RGB image.
The invention is further improved in that: during training in the step (3), the problem of class imbalance is solved by using a cross entropy loss function with weight, the weight of classes with small number is improved, and the weight of classes with large number is reduced. The weight calculation formula is:
Figure BDA0003429726670000041
a is class a scene class, NnNumber of images for the nth class of scene category. The cross entropy loss function with weights is:
Figure BDA0003429726670000042
the invention has the beneficial effects that: coding the depth image by using three channels to obtain an HHA image, further acquiring rich information similar to a gray image from the depth image, inputting a paired RGB image and the HHA image into an image embedding network, respectively converting the RGB image and the HHA image into image block sequences beneficial to global attention coding network learning, and focusing on information from different areas at different positions through the global attention coding network and lateral connection; the complementary relation between the RGB image and the depth image is further mined, the problem of uneven distribution of categories is solved through a cross entropy loss function with weight, the weight of the category with small number is improved, and the weight of the category with large number is reduced; the module improves the parallel computing capability and the interpretability of the network while abandoning the tradition of a convolutional neural network connection attention mechanism. In general, the SR-MGA provided by the invention improves the accuracy of multi-modal scene recognition.
Drawings
Fig. 1 is a flow chart illustrating a structure of a scene recognition method according to the present invention.
FIG. 2 is input data for the method of the present invention, where FIG. 2(a) is an RGB image, FIG. 2(b) is a depth image, and FIG. 2(c) is an HHA image.
FIG. 3 is a schematic diagram of a global attention coding network structure according to the present invention.
Detailed Description
In the following description, for purposes of explanation, numerous implementation details are set forth in order to provide a thorough understanding of the embodiments of the invention. It should be understood, however, that these implementation details are not to be interpreted as limiting the invention. That is, in some embodiments of the invention, such implementation details are not necessary.
As illustrated in fig. 1: the invention provides a scene recognition method based on multi-modal information and a global attention mechanism, which comprises the following steps of:
step 1: selecting RGB images and depth images of a plurality of scenes from a multi-modal scene database, recoding the depth images by using three channels, wherein the three channels respectively refer to horizontal parallax, ground height and inclination angles of local surfaces of pixels and the inferred gravity direction, pairing the coded depth images and the RGB images, and dividing the paired images into a training set and a testing set according to corresponding proportions. RGB image as shown in fig. 2(a), depth image as shown in fig. 2(b), depth image hereinafter referred to as HHA image, as shown in fig. 2(c),
step 2: and constructing an end-to-end trainable double-channel deep neural network model combining a global attention mechanism and multi-mode information, which is hereinafter referred to as SR-MGA. The SR-MGA comprises a graph embedding network, a global attention coding network, a feature fusion network and a classification network. The two-channel structure is the same and is composed of a graph embedding network and a global attention coding network. Inputting the paired RGB images and HHA images obtained in the step (1) into an image embedding network respectively to obtain corresponding RGB image block sequences and HHA image block sequences, inputting the two block sequences into a global attention coding network respectively for learning, adding lateral connection between the two channels of global attention coding networks, sending the learned RGB image characteristics and HHA image characteristics into a characteristic fusion network for splicing to obtain fusion characteristics, and finally sending the characteristics into a classification network.
And respectively converting the paired RGB images and HHA images into corresponding 1-dimensional RGB image block sequences and 1-dimensional HHA image block sequences through a graph embedding network of SR-MGA. The graph-embedded network is composed of a convolutional layer. Specifically, the input 2-dimensional image is recorded as x ∈ RH×W×CWhere H and W are the width and height of the image, respectively, both width and height being 224, and C is the number of channels of the image, the number of channels being 3. Dividing the image into image blocks of size P × P, P being 16 in this embodiment, to obtain
Figure BDA0003429726670000051
Image co-segmentation into N HW/P2The number of image blocks, N, is also the image block sequence length, 196. It is then mapped by linear transformation to a dimension of size D, set to 768. In order to obtain image characteristics later, a learnable category vector X is added on the basis of the two-dimensional matrixclass. At the same time, position codes E need to be addedposTo encode the position information and finally obtain the image block sequence z0
The expression of the above steps is:
Figure BDA0003429726670000061
next, the RGB image block sequence and the HHA image block sequence are each fed into a global attention coding network consisting of 12 identical global attention coding modules, as shown in fig. 3. Wherein the global attention coding module is composed of two residual error blocks connected in series. The first residual block consists of one layer normalization, three fully connected layers, one self-attention mechanism and one fully connected layer. The second residual block consists of one layer normalization, two fully connected layers, and two feature loss layers. In order to further solve the problem that the network is easy to be over-fitted, a feature loss layer is added at the connection jump of the two residual blocks.
In order to better utilize the complementarity between the RGB image and the depth image, a lateral connection is added between the two channels of the SR-MGA, the output of the Nth global attention coding module in the global attention coding network corresponding to the HHA image is added to the input of the (N + 1) th global attention coding module in the global attention coding network corresponding to the RGB image, and N is set to be 10.
And step 3: and (4) sending the training set divided in the step (1) into a deep neural network for training. During training, the problem of class imbalance is solved by using a cross entropy loss function with weight.
The weight calculation formula is:
Figure BDA0003429726670000062
a is class a scene class, NnNumber of images for the nth class of scene category. The cross entropy loss function with weights is:
Figure BDA0003429726670000071
and 4, step 4: when scene pictures are identified, the paired RGB images and HHA images obtained in the step 1 are input into an SR-MGA network model, the prediction probability value of each scene category in a plurality of scene categories corresponding to each paired multi-modal image is obtained, if the prediction probability value is consistent with the real category, the prediction is correct, and finally the classification accuracy of the scene images is obtained, wherein the classification accuracy is the ratio of the correct prediction number to the total prediction number.
The method effectively utilizes the complementarity between the RGB image and the depth image, respectively obtains corresponding learnable category vectors by carrying out global attention monitoring on the RGB image and the depth image, thereby carrying out scene classification.
The above description is only an embodiment of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (6)

1. A scene recognition method based on multi-modal information and a global attention mechanism is characterized in that: the scene recognition method comprises the following steps:
step 1: selecting RGB images and depth images of a plurality of scenes from a multi-modal scene database, recoding the depth images by using three channels, pairing the coded depth images with the RGB images, and dividing the paired images into a training set and a test set according to corresponding proportions;
step 2: constructing an end-to-end trainable double-channel deep neural network model combining a global attention mechanism and multi-mode information;
and step 3: sending the training set divided in the step 1 into the two-channel deep neural network in the step 2 for training;
and 4, step 4: identifying a scene picture: inputting the paired RGB images and HHA images obtained in the step 1 into the dual-channel deep neural network model in the step 2, obtaining the prediction probability value of each scene category in a plurality of scene categories corresponding to each paired multi-modal image, and if the prediction probability value is consistent with the real category, indicating that the prediction is correct, and finally obtaining the classification accuracy of the scene image.
2. The method of claim 1, wherein the scene recognition method based on multi-modal information and global attention mechanism comprises: the two-channel deep neural network model in the step 2 comprises a graph embedding network, a global attention coding network, a feature fusion network and a classification network, wherein two channels are formed by the graph embedding network and the global attention coding network, and the construction process of the two-channel deep neural network model is as follows:
step 2-1: inputting the RGB image and the depth image which are well paired in the step 1 into an image embedding network respectively to obtain a corresponding RGB image block sequence and a corresponding depth image block sequence;
step 2-2: inputting the RGB image block sequence and the depth image block sequence obtained in the step 2-1 into a global attention coding network for learning;
step 2-3: and sending the RGB image characteristics and the depth image characteristics obtained by learning into a characteristic fusion network for splicing to obtain fusion characteristics, and finally sending the characteristics into a classification network.
3. The method of claim 2, wherein the scene recognition method based on multi-modal information and global attention mechanism comprises: in the step 2-2, a lateral connection is added between the two-channel global attention coding networks, specifically: and adding the output of the Nth global attention coding module in the global attention coding network corresponding to the depth image to the input of the (N + 1) th global attention coding module in the global attention coding network corresponding to the RGB image.
4. The method of claim 2, wherein the scene recognition method based on multi-modal information and global attention mechanism comprises: the global attention coding network in step 2 is composed of 12 same global attention coding modules, wherein each global attention coding module is composed of two residual blocks which are connected in series, the first residual block is composed of a layer normalization layer, three full connection layers, a self-attention mechanism and a full connection layer, the second residual block is composed of a layer normalization layer, two full connection layers and two feature loss layers, and the feature loss layer is added at the connection jump position of the two residual blocks.
5. The method of claim 1, wherein the scene recognition method based on multi-modal information and global attention mechanism comprises: when the training set divided in the step 1 is sent to the two-channel deep neural network in the step 2 for training, the problem of class imbalance is solved by using a cross entropy loss function with weight, the weight of classes with small number is improved, and the weight of classes with large number is reduced, wherein a weight calculation formula is as follows:
Figure FDA0003429726660000021
a is class a scene class, NnNumber of images for the nth class of scene category;
the cross entropy loss function with weights is:
Figure FDA0003429726660000022
6. the method of claim 1, wherein the scene recognition method based on multi-modal information and global attention mechanism comprises: the three channels in step 1 refer to horizontal parallax, ground height and inclination angle of the local surface of the pixel and the inferred gravity direction, respectively.
CN202111592561.1A 2021-12-23 2021-12-23 Scene recognition method based on multi-mode information and global attention mechanism Pending CN114266938A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111592561.1A CN114266938A (en) 2021-12-23 2021-12-23 Scene recognition method based on multi-mode information and global attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111592561.1A CN114266938A (en) 2021-12-23 2021-12-23 Scene recognition method based on multi-mode information and global attention mechanism

Publications (1)

Publication Number Publication Date
CN114266938A true CN114266938A (en) 2022-04-01

Family

ID=80829307

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111592561.1A Pending CN114266938A (en) 2021-12-23 2021-12-23 Scene recognition method based on multi-mode information and global attention mechanism

Country Status (1)

Country Link
CN (1) CN114266938A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114898080A (en) * 2022-04-19 2022-08-12 杭州电子科技大学 Image imaging equipment identification method based on ViT network
CN115359306A (en) * 2022-10-24 2022-11-18 中铁科学技术开发有限公司 Intelligent identification method and system for high-definition images of railway freight inspection
CN117752308A (en) * 2024-02-21 2024-03-26 中国科学院自动化研究所 epilepsy prediction method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582225A (en) * 2020-05-19 2020-08-25 长沙理工大学 Remote sensing image scene classification method and device
CN111860116A (en) * 2020-06-03 2020-10-30 南京邮电大学 Scene identification method based on deep learning and privilege information
AU2020103715A4 (en) * 2020-11-27 2021-02-11 Beijing University Of Posts And Telecommunications Method of monocular depth estimation based on joint self-attention mechanism
WO2021115159A1 (en) * 2019-12-09 2021-06-17 中兴通讯股份有限公司 Character recognition network model training method, character recognition method, apparatuses, terminal, and computer storage medium therefor

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021115159A1 (en) * 2019-12-09 2021-06-17 中兴通讯股份有限公司 Character recognition network model training method, character recognition method, apparatuses, terminal, and computer storage medium therefor
CN111582225A (en) * 2020-05-19 2020-08-25 长沙理工大学 Remote sensing image scene classification method and device
CN111860116A (en) * 2020-06-03 2020-10-30 南京邮电大学 Scene identification method based on deep learning and privilege information
AU2020103715A4 (en) * 2020-11-27 2021-02-11 Beijing University Of Posts And Telecommunications Method of monocular depth estimation based on joint self-attention mechanism

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114898080A (en) * 2022-04-19 2022-08-12 杭州电子科技大学 Image imaging equipment identification method based on ViT network
CN114898080B (en) * 2022-04-19 2024-05-31 杭州电子科技大学 Image imaging equipment identification method based on ViT network
CN115359306A (en) * 2022-10-24 2022-11-18 中铁科学技术开发有限公司 Intelligent identification method and system for high-definition images of railway freight inspection
CN117752308A (en) * 2024-02-21 2024-03-26 中国科学院自动化研究所 epilepsy prediction method and device
CN117752308B (en) * 2024-02-21 2024-05-24 中国科学院自动化研究所 Epilepsy prediction method and device

Similar Documents

Publication Publication Date Title
CN114266938A (en) Scene recognition method based on multi-mode information and global attention mechanism
CN111325797A (en) Pose estimation method based on self-supervision learning
CN111368943B (en) Method and device for identifying object in image, storage medium and electronic device
CN110717411A (en) Pedestrian re-identification method based on deep layer feature fusion
WO2024060321A1 (en) Joint modeling method and apparatus for enhancing local features of pedestrians
CN115713679A (en) Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map
CN114359130A (en) Road crack detection method based on unmanned aerial vehicle image
CN116229452B (en) Point cloud three-dimensional target detection method based on improved multi-scale feature fusion
Zhang et al. LiSeg: Lightweight road-object semantic segmentation in 3D LiDAR scans for autonomous driving
CN112348033A (en) Cooperative significance target detection method
CN115409989A (en) Three-dimensional point cloud semantic segmentation method for optimizing boundary
CN117274883A (en) Target tracking method and system based on multi-head attention optimization feature fusion network
Mukhopadhyay et al. A hybrid lane detection model for wild road conditions
Hou et al. Fe-fusion-vpr: Attention-based multi-scale network architecture for visual place recognition by fusing frames and events
CN114596548A (en) Target detection method, target detection device, computer equipment and computer-readable storage medium
Deng et al. Incremental joint learning of depth, pose and implicit scene representation on monocular camera in large-scale scenes
CN117710429A (en) Improved lightweight monocular depth estimation method integrating CNN and transducer
CN117975565A (en) Action recognition system and method based on space-time diffusion and parallel convertors
CN113870312A (en) Twin network-based single target tracking method
CN110516640B (en) Vehicle re-identification method based on feature pyramid joint representation
CN116501908B (en) Image retrieval method based on feature fusion learning graph attention network
CN116994164A (en) Multi-mode aerial image fusion and target detection combined learning method
CN116824133A (en) Intelligent interpretation method for remote sensing image
Hu et al. Lightweight attention‐guided redundancy‐reuse network for real‐time semantic segmentation
Xu et al. Unsupervised learning of depth estimation and camera pose with multi-scale GANs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination