CN113822232B

CN113822232B - Pyramid attention-based scene recognition method, training method and device

Info

Publication number: CN113822232B
Application number: CN202111372903.9A
Authority: CN
Inventors: 杨铀; 熊若非; 刘琼
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-02-08
Anticipated expiration: 2041-11-19
Also published as: CN113822232A; US11514660B1

Abstract

The invention discloses a pyramid attention-based scene recognition method, a pyramid attention-based training method and a pyramid attention-based scene recognition device, and belongs to the field of computer vision. The method comprises the following steps: pyramid layering is carried out on the color feature map and the depth feature map respectively, and attention maps and attention outputs corresponding to the layers are obtained through calculation based on an attention mechanism; taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the up-sampling of the final feature map of the last layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer; respectively carrying out scale transformation on the attention diagram and the final characteristic diagram corresponding to each layer in each layer, taking the average value of the two new attention diagrams as the final attention diagram, and mapping the maximum k positions in the final attention diagram to the final characteristic diagram of the layer to obtain local characteristics of the layer; after the global features and the local features of each layer are fused, the accuracy of scene recognition can be improved.

Description

Pyramid attention-based scene recognition method, training method and device

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a scene recognition method, a training method and a device based on pyramid attention.

Background

Since the indoor scene usually contains many objects and has diversified spatial layout, it is difficult to obtain robust indoor scene representation. In addition, a Depth map can provide information on spatial layout and geometric position, and RGBD scene recognition is rapidly developing in recent years. However, global features are not sufficient to represent complex indoor scenes. In addition, using local object-based features to represent scenes avoids noisy information in some scenes, and using local or global features alone may result in poor recognition performance. At the same time, not all objects contribute to scene recognition, which requires that the model we design is able to adaptively select features that are critical to scene recognition. In addition, the semantic gap between the two modalities is also an aspect which cannot be ignored, and how to effectively realize the multi-modal fusion still needs to be researched by people in an effort.

Chinese patent CN113408590A discloses a scene recognition method, a training method, a device, an electronic device and a program product based on a graph convolution network. On the basis of extracting the global features of the two modal images, the method firstly utilizes a space attention mechanism to extract important local features in a color image and a depth image, and utilizes a graph convolution network to aggregate and update the local features of the two modal images so as to reduce the semantic difference between the two modes and further improve the accuracy of scene recognition.

However, the method only considers local features of a single scale, and is not suitable for indoor scenes with various object types and diversified layouts.

Disclosure of Invention

Aiming at the defects and improvement requirements of the prior art, the invention provides a scene recognition method based on pyramid attention, a training method, a training device, electronic equipment and a computer-readable storage medium, and aims to solve the technical problem that the local features of a single scale are not enough to express a complex indoor scene in the prior art.

In order to achieve the above object, in a first aspect, the present invention provides a method for scene recognition based on pyramid attention, including:

acquiring a color image and a depth image of a scene to be identified, and respectively extracting features to obtain a corresponding color feature map and a corresponding depth feature map; respectively carrying out feature transformation on the color feature map and the depth feature map to obtain corresponding color global features and depth global features; pyramid layering is carried out on the color feature map and the depth feature map respectively, and attention maps and attention outputs corresponding to the layers are obtained through calculation based on an attention mechanism; taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the up-sampling of the final feature map of the last layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer(ii) a Respectively carrying out scale transformation on the attention diagram and the final characteristic diagram corresponding to each layer in the layers to obtain two new attention diagrams; taking the average value of the two new attention diagrams as a final attention diagram, and mapping the maximum value in the final attention diagramkObtaining the local characteristics of the layer by mapping the positions to the final characteristic diagram of the layer; and fusing the color global features, the depth global features and the local features of all layers to obtain multi-modal features of the scene to be recognized, and recognizing the scene based on the multi-modal features.

Further, the performing scale transformation on the attention map and the final feature map corresponding to each of the layers to obtain two new attention maps includes: summing the attention diagrams corresponding to each layer in the layers along the column direction and performing reshape operation to obtain a new attention diagram; and performing two-dimensional convolution operation on the final characteristic diagram corresponding to each layer in the layers to obtain another new attention diagram.

Further, the fusing the color global features, the depth global features and the local features of each layer to obtain the multi-modal features of the scene to be recognized includes: performing semantic-based feature fusion on the local features of each layer by using a GCN algorithm to obtain final local features; and fusing the color global feature, the depth global feature and the final local feature to obtain the multi-modal feature of the scene to be recognized.

Further, the performing semantic-based feature fusion on the local features of each layer by using the GCN algorithm to obtain a final local feature includes: respectively constructing a color image structure and a depth image structure based on the local features of each layer of the color feature image and the local features of each layer of the depth feature image, wherein the color image structure is used for representing the position incidence relation among object nodes in the color image, and the depth image structure is used for representing the position incidence relation among the object nodes in the depth image; according to the characteristics of the nodes in the color graph structure, connecting the nodes of each layer in the color graph structure by sparse connection, and obtaining a first local characteristic through the aggregation and updating operation of a GCN algorithm; according to the characteristics of the nodes in the depth map structure, connecting the nodes of each layer in the depth map structure by sparse connection, and obtaining a second local characteristic through aggregation and update operations of a GCN algorithm; according to the characteristics of the nodes in the color graph structure and the characteristics of the nodes in the depth graph structure, connecting the nodes of each layer in the color graph structure and the nodes of the corresponding layer in the depth graph structure by sparse connection, and obtaining a third local characteristic through the aggregation and update operations of a GCN algorithm; and performing cascade processing and feature transformation on the first local feature, the second local feature and the third local feature to obtain a final local feature.

In a second aspect, the present invention provides a training method for a scene recognition model, including:

acquiring a training data set, wherein the training data set comprises at least one group of color training images, depth training images and scene category labels of training scenes; and training a preset scene recognition model by using the training data set to obtain a trained scene recognition model, wherein the trained scene recognition model is used for processing the color training image and the depth training image according to the scene recognition method of any one of the first aspect.

Further, the training a preset scene recognition model by using the training data set includes: inputting the color training images and the depth training images into a preset scene recognition model, so that the preset scene recognition model respectively extracts the characteristics of the color training images and the depth training images, and obtaining color global training characteristics corresponding to the color training images, depth global training characteristics corresponding to the depth training images and local training characteristics of each layer; fusing the color global training features, the depth global training features and the local training features of all layers to obtain multi-mode training features of a training scene; and performing parameter adjustment processing on the preset scene recognition model based on the cross entropy loss function of the multi-modal training features and the scene category labels until the training is completed.

In a third aspect, the present invention provides a pyramid attention-based scene recognition apparatus, including:

the first image acquisition module is used for acquiring a color image and a depth image of a scene to be identified, and respectively extracting features to obtain a corresponding color feature map and a corresponding depth feature map; the global feature acquisition module is used for respectively carrying out feature transformation on the color feature map and the depth feature map to obtain corresponding color global features and depth global features; the local feature acquisition module is used for respectively carrying out pyramid layering on the color feature map and the depth feature map, and obtaining an attention map corresponding to each layer and attention output based on attention mechanism calculation; taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the up-sampling of the final feature map of the last layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer; respectively carrying out scale transformation on the attention diagram and the final characteristic diagram corresponding to each layer in the layers to obtain two new attention diagrams; taking the average value of the two new attention diagrams as a final attention diagram, and mapping the maximum value in the final attention diagramkObtaining the local characteristics of the layer by mapping the positions to the final characteristic diagram of the layer; and the fusion and recognition module is used for fusing the color global features, the depth global features and the local features of all layers to obtain multi-modal features of the scene to be recognized, and recognizing the scene based on the multi-modal features.

In a fourth aspect, the present invention provides a training apparatus for a scene recognition model, including:

the second image acquisition module is used for acquiring a training data set, wherein the training data set comprises at least one group of color training images, depth training images and scene category labels of a training scene; a training module, configured to train a preset scene recognition model with the training data set to obtain a trained scene recognition model, where the trained scene recognition model is used to process the color training image and the depth training image according to the scene recognition method of any one of the first aspect.

In a fifth aspect, the present invention provides an electronic device, comprising: a memory and at least one processor; the memory stores computer-executable instructions; the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method for scene recognition according to any one of the first aspect, or the method for training the scene recognition model according to any one of the second aspect.

In a sixth aspect, the present invention provides a computer-readable storage medium, having stored thereon computer-executable instructions, which, when executed by a processor, implement the scene recognition method according to any one of the first aspect, or implement the training method of the scene recognition model according to any one of the second aspect.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

after the characteristics of a color image and a depth image of a scene to be identified are extracted, pyramid layering is respectively carried out on the color characteristic image and the depth characteristic image, and attention force diagrams corresponding to the layers and attention force output are obtained through calculation based on an attention force mechanism; taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the up-sampling of the final feature map of the last layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer; secondly, carrying out scale transformation on the attention diagram and the final characteristic diagram corresponding to each layer respectively to obtain two new attention diagrams; taking the average value of the two new attention diagrams as a final attention diagram, and mapping the maximum value in the final attention diagramkObtaining the local characteristics of the layer by mapping the positions to the final characteristic diagram of the layer; and further fusing the local features of the layers to obtain the final local feature. Compared with the existing method for acquiring the local features, the method can extract the local features which have long-term dependence and can express complex indoor scenes, so that the accuracy of scene recognition can be improved after the global features and the local features are fused.

Drawings

Fig. 1 is a flowchart illustrating a scene recognition method based on pyramid attention according to an embodiment of the present invention.

Fig. 2 is a schematic flow chart illustrating a process of calculating a final feature map of each layer according to an embodiment of the present invention.

Fig. 3 is a schematic flow chart illustrating a process of fusing local features of layers according to an embodiment of the present invention.

Fig. 4 is a schematic flowchart of a training method for a scene recognition model according to an embodiment of the present invention.

Fig. 5 is a block diagram of a scene recognition apparatus based on pyramid attention according to an embodiment of the present invention.

Fig. 6 is a block diagram of a training apparatus for a scene recognition model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

In the present application, the terms "first," "second," and the like (if any) in the description and the drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Example one

Referring to fig. 1, an embodiment of the present invention provides a scene identification method based on pyramid attention, including:

s101, acquiring a color image and a depth image of a scene to be identified, and respectively extracting features to obtain a corresponding color feature map and a corresponding depth feature map.

In this embodiment, the feature extraction algorithm may be a ResNet101 algorithm, a VGG algorithm, an AlexNet algorithm, or the like. By the feature extraction algorithm, the corresponding color feature map and depth feature map can be obtained.

And S102, respectively carrying out feature transformation on the color feature map and the depth feature map to obtain corresponding color global features and depth global features.

In this embodiment, the two fully-connected layers of the preset feature extraction network are used to perform feature transformation on the color feature map and the depth feature map respectively, so as to obtain corresponding color global features and depth global features.

S103, pyramid layering is carried out on the color feature map and the depth feature map respectively, and attention maps corresponding to the layers and attention outputs are obtained through calculation based on an attention mechanism; and taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the upsampling of the final feature map of the previous layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer.

It should be noted that the number of pyramid layers may be selected according to an actual experimental effect, and if the number of selected pyramid layers is too small, it is not enough to express multi-layer features, and if the number of selected pyramid layers is too large, more calculation amount is brought, in this embodiment, the number of pyramid layers is 3. In the present invention, a Transformer is used as one of the attention-calling mechanisms for capturing the non-local dependency, but other attention-calling mechanisms may be used instead.

To be provided with

And

respectively representing the last layer of feature maps of RGB and depth modes, and the size of the feature map is (B,C,H,W)，BWhich represents the size of the batch at the time of training,Cwhich is indicative of the number of channels,HandWrepresenting the height and width of the feature, respectively. Taking RGB map as an example, as shown in FIG. 2, we take

And

as goldCharacter tower

Dimension and

the characteristics of the scale, using two-dimensional convolution to calculate Q, K, V in the Transformer structure, can obtain an attention diagram:

wherein the content of the first and second substances,Trepresenting the transpose operation, the softmax activation function is used to regularize the calculated attention map.

Final self-attention output

Can be calculated by the following formula:

since the low-resolution feature map usually contains more semantic information, and the high-resolution feature map has more spatial information, the two can be complementary. Therefore, fusing features of different scales is more helpful for the selection of the subsequent key features.

And after obtaining the attention diagrams and the attention outputs corresponding to the layers, taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the upsampling of the final feature map of the previous layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer. As shown in fig. 2, the final feature map of the previous layer

Output of the result after upsampling and attention of the layer

Added to form the final characteristic diagram of the layer

：

Wherein the content of the first and second substances,

representing an upsampling operation.

S104, respectively carrying out scale transformation on the attention diagrams and the final characteristic diagrams corresponding to each layer to obtain two new attention diagrams; taking the average value of the two new attention diagrams as a final attention diagram, and mapping the maximum value in the final attention diagramkAnd (4) mapping the position to the final characteristic diagram of the layer to obtain the local characteristic of the layer.

In this embodiment, since the scene does not have accurate label information of the key features, it is difficult to directly train the network model to find the key features. Even with the attention mechanism, it is difficult to obtain valid features in complex indoor scenarios without the associated constraints. In order to ensure the effectiveness of node selection, the attention diagram and the final feature diagram corresponding to each layer in each layer are subjected to scale transformation respectively to obtain two new attention diagrams

And

：

wherein the content of the first and second substances,ReandSumrepresenting reshape operation and summing along the column direction respectively,

a two-dimensional convolution operation is represented,mrepresenting the pyramid layer number, forcing two attention diagrams to be similar in spatial position in the training process, and measuring the pyramid scaleiAn efficient representation of the key features can be obtained.

Finally, two new attention maps are used

And

is used as the final attention map, and maps the largest in the final attention mapkAnd (4) mapping the position to the final characteristic diagram of the layer to obtain the local characteristic of the layer.

And S105, fusing the color global features, the depth global features and the local features of all layers to obtain multi-modal features of the scene to be recognized, and recognizing the scene based on the multi-modal features.

Further, performing semantic-based feature fusion on the local features of each layer by using a GCN algorithm to obtain final local features; and then fusing the color global feature, the depth global feature and the final local feature to obtain the multi-modal feature of the scene to be recognized.

Furthermore, a color image structure and a depth image structure are respectively constructed based on the local features of each layer of the color feature image and the local features of each layer of the depth feature image, wherein the color image structure is used for representing the position association relationship among the object nodes in the color image, and the depth image structure is used for representing the position association relationship among the object nodes in the depth image; according to the characteristics of the nodes in the color graph structure, connecting the nodes of each layer in the color graph structure by sparse connection, and obtaining a first local characteristic through the aggregation and updating operation of a GCN algorithm; according to the characteristics of the nodes in the depth map structure, connecting the nodes of each layer in the depth map structure by sparse connection, and obtaining a second local characteristic through aggregation and update operations of a GCN algorithm; according to the characteristics of the nodes in the color graph structure and the characteristics of the nodes in the depth graph structure, connecting the nodes of each layer in the color graph structure and the nodes of the corresponding layer in the depth graph structure by sparse connection, and obtaining a third local characteristic through the aggregation and update operations of a GCN algorithm; and performing cascade processing and feature transformation on the first local feature, the second local feature and the third local feature to obtain a final local feature.

Illustratively, to effectively fuse the complementary information of two modalities based on selected features, a hierarchical graph modelG=(V,E) Constructed to represent an indoor scene. Wherein the content of the first and second substances,Va local feature representing the above-mentioned selection,Erepresenting connections between nodes.VCan be divided into two categories: 2D color image nodeV _rAnd 3D depth map nodesV _d. In addition, theEComprises three parts: connections between single-modal single scales, connections between multi-modal single scales, and connections between single-modal multi-scales.

Single-mode single-scale graph connection: first consider the construction of a single-modal single-scale map model. The contribution of each node to the scene recognition task is different, and the distinguishing processing should be carried out in the graph modeling. In our graph model, the importance of each node is represented by its value in the attention map, where a larger value means a greater contribution to scene recognition. In addition, the nodes in the graph are represented as high-dimensional feature graph vectors in the channel direction, which helps to represent key features in the scene. Specifically, we obtain the shape of (through the node selection of the previous step) < 2 >B,k,C) Is expressed as

WhereinmIs shown asmAnd (4) each dimension. To be provided withmFor example, =1, we set upk=16, comprising 1 primary central node, 3 secondary primary central nodes and 12 leaf nodes. To build intra-modal connections, 3 sub-mastersThe central node is connected to the main central node, and the rest leaf nodes are connected to the secondary main central node through Euclidean distances.

Multi-modal single-scale graph connection: even in the same scene, the local features of the two modalities are different. In other words, there is a semantic gap between the two modalities. Thus, a sparse connection between selected features between the two modalities is more appropriate than a full connection. When considering the connection of RGB and depth modalities, we only connect the corresponding master hub node

And

secondary main center

And

are respectively connected. Wherein

And

respectively representiColor map and depth map of layerjAnd (4) each node.

Single-mode multi-scale graph connection: in order to utilize the multi-scale characteristics, the relation of different scales in the graph needs to be considered. Furthermore, given that the propagation of the features of the nodes over the entire graph can be done over several iterations, sparse connections are also used to construct the single-modal multi-scale graph. To be provided withm=1 andmfor example, =2, the node of scale 1 is connected to the primary center and the secondary primary center node corresponding to scale 2, that is, the primary center node

And

sub-hub node

And

respectively, as is the case for depth images.

And effectively combining the multi-modal single-scale map and the single-modal multi-scale map to obtain a final hierarchical map. For each node

And

we learn their updated representation by aggregating the features of their neighbors. Finally, the updated features are fused together to generate the final partial representation for RGB-D scene recognition. Taking the pyramid level as 3 as an example, the hierarchical graph model is constructed as shown in fig. 3.

After the final local features are obtained, fusing the color global features, the depth global features and the final local features to obtain multi-modal features of the scene to be recognized; and carrying out scene recognition on the multi-modal characteristics of the scene to be recognized to obtain a recognition result of the scene to be recognized.

Example two

Referring to fig. 4, a schematic flow chart of a training method for a scene recognition model provided in an embodiment of the present invention includes:

s401, acquiring a training data set, wherein the training data set comprises at least one group of color training images, depth training images and scene category labels of a training scene;

s402, training a preset scene recognition model by using the training data set to obtain a trained scene recognition model, wherein the trained scene recognition model is used for processing the color training image and the depth training image according to the scene recognition method in the first embodiment.

In this embodiment, the training data set may be a SUN RGBD data set or a NYU Depth v2 data set. The training data set comprises a plurality of groups of training scenes, each group of training scenes comprises a plurality of training scenes, and each training scene comprises a color training image, a depth training image and a scene class label corresponding to the training scene.

Further, training a preset scene recognition model by using the training data set, including: inputting the color training images and the depth training images into a preset scene recognition model, so that the preset scene recognition model respectively extracts the characteristics of the color training images and the depth training images, and obtaining color global training characteristics corresponding to the color training images, depth global training characteristics corresponding to the depth training images and local training characteristics of each layer; fusing the color global training features, the depth global training features and the local training features of all layers to obtain multi-mode training features of a training scene; and performing parameter adjustment processing on the preset scene recognition model based on the cross entropy loss function of the multi-modal training features and the scene category labels until the training is completed.

In particular, not only are the features of the two modalities complementary, but the global and local features are also complementary for scene recognition. As described above in detail with reference to the accompanying drawings,

and

respectively representing the final layer of feature maps of RGB and depth modes. Global features

And

is prepared by mixing

And

and respectively obtaining the global features through a full connection layer, and respectively using the two cross entropy loss functions for learning the global features. In addition, local features learned through the hierarchical graph model may be represented as

. Further, local features

And global features

And

is cascaded into

Common for final scene recognition:

where Cat represents a cascade operation.

Finally, the final scene recognition result can be predicted by an additional cross-entropy loss function, and the total loss comprises three parts: 1) loss of global features

And

(ii) a 2) Final classification loss

(ii) a 3) Loss of similarity

Total loss of

This can be calculated by the following formula:

wherein the content of the first and second substances,

，

，

in a manner of calculation of

The same is true.

It should be noted that in the testing phase we only use

And carrying out a final scene recognition task.

EXAMPLE III

Referring to fig. 5, the present invention provides a pyramid attention-based scene recognition apparatus 500 according to an embodiment of the present invention, where the apparatus 500 includes:

a first image obtaining module 510, configured to obtain a color image and a depth image of a scene to be identified, and perform feature extraction respectively to obtain a corresponding color feature map and a corresponding depth feature map;

a global feature obtaining module 520, configured to perform feature transformation on the color feature map and the depth feature map, respectively, to obtain corresponding color global features and depth global features;

a local feature obtaining module 530 for obtaining the color feature map and the depth feature map respectivelyCarrying out pyramid layering, and calculating to obtain an attention diagram and attention output corresponding to each layer based on an attention mechanism; taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the up-sampling of the final feature map of the last layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer; respectively carrying out scale transformation on the attention diagram and the final characteristic diagram corresponding to each layer in the layers to obtain two new attention diagrams; taking the average value of the two new attention diagrams as a final attention diagram, and mapping the maximum value in the final attention diagramkObtaining the local characteristics of the layer by mapping the positions to the final characteristic diagram of the layer;

and the fusion and recognition module 540 is configured to fuse the color global features, the depth global features, and the local features of each layer to obtain multi-modal features of the scene to be recognized, and perform scene recognition based on the multi-modal features.

In this embodiment, please refer to the description in the first embodiment for the specific implementation of each module, which will not be repeated herein.

Example four

Referring to fig. 6, an embodiment of the present invention provides a training apparatus 600 for a scene recognition model, where the apparatus 600 includes:

a second image obtaining module 610, configured to obtain a training data set, where the training data set includes at least one set of color training images, depth training images, and scene category labels of a training scene;

a training module 620, configured to train a preset scene recognition model with the training data set to obtain a trained scene recognition model, where the trained scene recognition model is used to process the color training image and the depth training image according to the scene recognition method described in the first embodiment.

In this embodiment, please refer to the description in the second embodiment for the specific implementation of each module, which will not be repeated here.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A pyramid attention-based scene recognition method is characterized by comprising the following steps:

acquiring a color image and a depth image of a scene to be identified, and respectively extracting features to obtain a corresponding color feature map and a corresponding depth feature map;

respectively carrying out feature transformation on the color feature map and the depth feature map to obtain corresponding color global features and depth global features;

pyramid layering is carried out on the color feature map and the depth feature map respectively, and attention maps and attention outputs corresponding to the layers are obtained through calculation based on an attention mechanism; taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the up-sampling of the final feature map of the last layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer;

respectively carrying out scale transformation on the attention diagram and the final characteristic diagram corresponding to each layer in the layers to obtain two new attention diagrams; taking the average value of the two new attention diagrams as a final attention diagram, and mapping the maximum value in the final attention diagramkObtaining the local characteristics of the layer by mapping the positions to the final characteristic diagram of the layer;

and fusing the color global features, the depth global features and the local features of all layers to obtain multi-modal features of the scene to be recognized, and recognizing the scene based on the multi-modal features.

2. The method of claim 1, wherein the scaling the attention map and the final feature map corresponding to each of the layers respectively to obtain two new attention maps comprises:

summing the attention diagrams corresponding to each layer in the layers along the column direction and performing reshape operation to obtain a new attention diagram;

and performing two-dimensional convolution operation on the final characteristic diagram corresponding to each layer in the layers to obtain another new attention diagram.

3. The method according to claim 1 or 2, wherein the fusing the color global features, the depth global features and the local features of each layer to obtain multi-modal features of the scene to be recognized comprises:

performing semantic-based feature fusion on the local features of each layer by using a GCN algorithm to obtain final local features;

and fusing the color global feature, the depth global feature and the final local feature to obtain the multi-modal feature of the scene to be recognized.

4. The method according to claim 3, wherein the performing semantic-based feature fusion on the local features of each layer by using the GCN algorithm to obtain the final local feature comprises:

respectively constructing a color image structure and a depth image structure based on the local features of each layer of the color feature image and the local features of each layer of the depth feature image, wherein the color image structure is used for representing the position incidence relation among object nodes in the color image, and the depth image structure is used for representing the position incidence relation among the object nodes in the depth image;

according to the characteristics of the nodes in the color graph structure, connecting the nodes of each layer in the color graph structure by sparse connection, and obtaining a first local characteristic through the aggregation and updating operation of a GCN algorithm;

according to the characteristics of the nodes in the depth map structure, connecting the nodes of each layer in the depth map structure by sparse connection, and obtaining a second local characteristic through aggregation and update operations of a GCN algorithm;

according to the characteristics of the nodes in the color graph structure and the characteristics of the nodes in the depth graph structure, connecting the nodes of each layer in the color graph structure and the nodes of the corresponding layer in the depth graph structure by sparse connection, and obtaining a third local characteristic through the aggregation and update operations of a GCN algorithm;

and performing cascade processing and feature transformation on the first local feature, the second local feature and the third local feature to obtain a final local feature.

5. A pyramid attention-based scene recognition apparatus, comprising:

the first image acquisition module is used for acquiring a color image and a depth image of a scene to be identified, and respectively extracting features to obtain a corresponding color feature map and a corresponding depth feature map;

the global feature acquisition module is used for respectively carrying out feature transformation on the color feature map and the depth feature map to obtain corresponding color global features and depth global features;

the local feature acquisition module is used for respectively carrying out pyramid layering on the color feature map and the depth feature map, and obtaining an attention map corresponding to each layer and attention output based on attention mechanism calculation; taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the up-sampling of the final feature map of the last layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer; respectively carrying out scale transformation on the attention diagram and the final characteristic diagram corresponding to each layer in the layers to obtain two new attention diagrams; taking the average value of the two new attention diagrams as a final attention diagram, and mapping the maximum value in the final attention diagramkObtaining the local characteristics of the layer by mapping the positions to the final characteristic diagram of the layer;

and the fusion and recognition module is used for fusing the color global features, the depth global features and the local features of all layers to obtain multi-modal features of the scene to be recognized, and recognizing the scene based on the multi-modal features.

6. An electronic device, comprising: a memory and at least one processor;

the memory stores computer-executable instructions;

the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method of any one of claims 1-4.

7. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the method of any one of claims 1-4.