CN113822232B - Pyramid attention-based scene recognition method, training method and device - Google Patents

Pyramid attention-based scene recognition method, training method and device Download PDF

Info

Publication number
CN113822232B
CN113822232B CN202111372903.9A CN202111372903A CN113822232B CN 113822232 B CN113822232 B CN 113822232B CN 202111372903 A CN202111372903 A CN 202111372903A CN 113822232 B CN113822232 B CN 113822232B
Authority
CN
China
Prior art keywords
attention
layer
final
depth
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111372903.9A
Other languages
Chinese (zh)
Other versions
CN113822232A (en
Inventor
杨铀
熊若非
刘琼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202111372903.9A priority Critical patent/CN113822232B/en
Publication of CN113822232A publication Critical patent/CN113822232A/en
Application granted granted Critical
Publication of CN113822232B publication Critical patent/CN113822232B/en
Priority to US17/835,361 priority patent/US11514660B1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a pyramid attention-based scene recognition method, a pyramid attention-based training method and a pyramid attention-based scene recognition device, and belongs to the field of computer vision. The method comprises the following steps: pyramid layering is carried out on the color feature map and the depth feature map respectively, and attention maps and attention outputs corresponding to the layers are obtained through calculation based on an attention mechanism; taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the up-sampling of the final feature map of the last layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer; respectively carrying out scale transformation on the attention diagram and the final characteristic diagram corresponding to each layer in each layer, taking the average value of the two new attention diagrams as the final attention diagram, and mapping the maximum k positions in the final attention diagram to the final characteristic diagram of the layer to obtain local characteristics of the layer; after the global features and the local features of each layer are fused, the accuracy of scene recognition can be improved.

Description

Pyramid attention-based scene recognition method, training method and device
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a scene recognition method, a training method and a device based on pyramid attention.
Background
Since the indoor scene usually contains many objects and has diversified spatial layout, it is difficult to obtain robust indoor scene representation. In addition, a Depth map can provide information on spatial layout and geometric position, and RGBD scene recognition is rapidly developing in recent years. However, global features are not sufficient to represent complex indoor scenes. In addition, using local object-based features to represent scenes avoids noisy information in some scenes, and using local or global features alone may result in poor recognition performance. At the same time, not all objects contribute to scene recognition, which requires that the model we design is able to adaptively select features that are critical to scene recognition. In addition, the semantic gap between the two modalities is also an aspect which cannot be ignored, and how to effectively realize the multi-modal fusion still needs to be researched by people in an effort.
Chinese patent CN113408590A discloses a scene recognition method, a training method, a device, an electronic device and a program product based on a graph convolution network. On the basis of extracting the global features of the two modal images, the method firstly utilizes a space attention mechanism to extract important local features in a color image and a depth image, and utilizes a graph convolution network to aggregate and update the local features of the two modal images so as to reduce the semantic difference between the two modes and further improve the accuracy of scene recognition.
However, the method only considers local features of a single scale, and is not suitable for indoor scenes with various object types and diversified layouts.
Disclosure of Invention
Aiming at the defects and improvement requirements of the prior art, the invention provides a scene recognition method based on pyramid attention, a training method, a training device, electronic equipment and a computer-readable storage medium, and aims to solve the technical problem that the local features of a single scale are not enough to express a complex indoor scene in the prior art.
In order to achieve the above object, in a first aspect, the present invention provides a method for scene recognition based on pyramid attention, including:
acquiring a color image and a depth image of a scene to be identified, and respectively extracting features to obtain a corresponding color feature map and a corresponding depth feature map; respectively carrying out feature transformation on the color feature map and the depth feature map to obtain corresponding color global features and depth global features; pyramid layering is carried out on the color feature map and the depth feature map respectively, and attention maps and attention outputs corresponding to the layers are obtained through calculation based on an attention mechanism; taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the up-sampling of the final feature map of the last layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer(ii) a Respectively carrying out scale transformation on the attention diagram and the final characteristic diagram corresponding to each layer in the layers to obtain two new attention diagrams; taking the average value of the two new attention diagrams as a final attention diagram, and mapping the maximum value in the final attention diagramkObtaining the local characteristics of the layer by mapping the positions to the final characteristic diagram of the layer; and fusing the color global features, the depth global features and the local features of all layers to obtain multi-modal features of the scene to be recognized, and recognizing the scene based on the multi-modal features.
Further, the performing scale transformation on the attention map and the final feature map corresponding to each of the layers to obtain two new attention maps includes: summing the attention diagrams corresponding to each layer in the layers along the column direction and performing reshape operation to obtain a new attention diagram; and performing two-dimensional convolution operation on the final characteristic diagram corresponding to each layer in the layers to obtain another new attention diagram.
Further, the fusing the color global features, the depth global features and the local features of each layer to obtain the multi-modal features of the scene to be recognized includes: performing semantic-based feature fusion on the local features of each layer by using a GCN algorithm to obtain final local features; and fusing the color global feature, the depth global feature and the final local feature to obtain the multi-modal feature of the scene to be recognized.
Further, the performing semantic-based feature fusion on the local features of each layer by using the GCN algorithm to obtain a final local feature includes: respectively constructing a color image structure and a depth image structure based on the local features of each layer of the color feature image and the local features of each layer of the depth feature image, wherein the color image structure is used for representing the position incidence relation among object nodes in the color image, and the depth image structure is used for representing the position incidence relation among the object nodes in the depth image; according to the characteristics of the nodes in the color graph structure, connecting the nodes of each layer in the color graph structure by sparse connection, and obtaining a first local characteristic through the aggregation and updating operation of a GCN algorithm; according to the characteristics of the nodes in the depth map structure, connecting the nodes of each layer in the depth map structure by sparse connection, and obtaining a second local characteristic through aggregation and update operations of a GCN algorithm; according to the characteristics of the nodes in the color graph structure and the characteristics of the nodes in the depth graph structure, connecting the nodes of each layer in the color graph structure and the nodes of the corresponding layer in the depth graph structure by sparse connection, and obtaining a third local characteristic through the aggregation and update operations of a GCN algorithm; and performing cascade processing and feature transformation on the first local feature, the second local feature and the third local feature to obtain a final local feature.
In a second aspect, the present invention provides a training method for a scene recognition model, including:
acquiring a training data set, wherein the training data set comprises at least one group of color training images, depth training images and scene category labels of training scenes; and training a preset scene recognition model by using the training data set to obtain a trained scene recognition model, wherein the trained scene recognition model is used for processing the color training image and the depth training image according to the scene recognition method of any one of the first aspect.
Further, the training a preset scene recognition model by using the training data set includes: inputting the color training images and the depth training images into a preset scene recognition model, so that the preset scene recognition model respectively extracts the characteristics of the color training images and the depth training images, and obtaining color global training characteristics corresponding to the color training images, depth global training characteristics corresponding to the depth training images and local training characteristics of each layer; fusing the color global training features, the depth global training features and the local training features of all layers to obtain multi-mode training features of a training scene; and performing parameter adjustment processing on the preset scene recognition model based on the cross entropy loss function of the multi-modal training features and the scene category labels until the training is completed.
In a third aspect, the present invention provides a pyramid attention-based scene recognition apparatus, including:
the first image acquisition module is used for acquiring a color image and a depth image of a scene to be identified, and respectively extracting features to obtain a corresponding color feature map and a corresponding depth feature map; the global feature acquisition module is used for respectively carrying out feature transformation on the color feature map and the depth feature map to obtain corresponding color global features and depth global features; the local feature acquisition module is used for respectively carrying out pyramid layering on the color feature map and the depth feature map, and obtaining an attention map corresponding to each layer and attention output based on attention mechanism calculation; taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the up-sampling of the final feature map of the last layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer; respectively carrying out scale transformation on the attention diagram and the final characteristic diagram corresponding to each layer in the layers to obtain two new attention diagrams; taking the average value of the two new attention diagrams as a final attention diagram, and mapping the maximum value in the final attention diagramkObtaining the local characteristics of the layer by mapping the positions to the final characteristic diagram of the layer; and the fusion and recognition module is used for fusing the color global features, the depth global features and the local features of all layers to obtain multi-modal features of the scene to be recognized, and recognizing the scene based on the multi-modal features.
In a fourth aspect, the present invention provides a training apparatus for a scene recognition model, including:
the second image acquisition module is used for acquiring a training data set, wherein the training data set comprises at least one group of color training images, depth training images and scene category labels of a training scene; a training module, configured to train a preset scene recognition model with the training data set to obtain a trained scene recognition model, where the trained scene recognition model is used to process the color training image and the depth training image according to the scene recognition method of any one of the first aspect.
In a fifth aspect, the present invention provides an electronic device, comprising: a memory and at least one processor; the memory stores computer-executable instructions; the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method for scene recognition according to any one of the first aspect, or the method for training the scene recognition model according to any one of the second aspect.
In a sixth aspect, the present invention provides a computer-readable storage medium, having stored thereon computer-executable instructions, which, when executed by a processor, implement the scene recognition method according to any one of the first aspect, or implement the training method of the scene recognition model according to any one of the second aspect.
Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:
after the characteristics of a color image and a depth image of a scene to be identified are extracted, pyramid layering is respectively carried out on the color characteristic image and the depth characteristic image, and attention force diagrams corresponding to the layers and attention force output are obtained through calculation based on an attention force mechanism; taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the up-sampling of the final feature map of the last layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer; secondly, carrying out scale transformation on the attention diagram and the final characteristic diagram corresponding to each layer respectively to obtain two new attention diagrams; taking the average value of the two new attention diagrams as a final attention diagram, and mapping the maximum value in the final attention diagramkObtaining the local characteristics of the layer by mapping the positions to the final characteristic diagram of the layer; and further fusing the local features of the layers to obtain the final local feature. Compared with the existing method for acquiring the local features, the method can extract the local features which have long-term dependence and can express complex indoor scenes, so that the accuracy of scene recognition can be improved after the global features and the local features are fused.
Drawings
Fig. 1 is a flowchart illustrating a scene recognition method based on pyramid attention according to an embodiment of the present invention.
Fig. 2 is a schematic flow chart illustrating a process of calculating a final feature map of each layer according to an embodiment of the present invention.
Fig. 3 is a schematic flow chart illustrating a process of fusing local features of layers according to an embodiment of the present invention.
Fig. 4 is a schematic flowchart of a training method for a scene recognition model according to an embodiment of the present invention.
Fig. 5 is a block diagram of a scene recognition apparatus based on pyramid attention according to an embodiment of the present invention.
Fig. 6 is a block diagram of a training apparatus for a scene recognition model according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
In the present application, the terms "first," "second," and the like (if any) in the description and the drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Example one
Referring to fig. 1, an embodiment of the present invention provides a scene identification method based on pyramid attention, including:
s101, acquiring a color image and a depth image of a scene to be identified, and respectively extracting features to obtain a corresponding color feature map and a corresponding depth feature map.
In this embodiment, the feature extraction algorithm may be a ResNet101 algorithm, a VGG algorithm, an AlexNet algorithm, or the like. By the feature extraction algorithm, the corresponding color feature map and depth feature map can be obtained.
And S102, respectively carrying out feature transformation on the color feature map and the depth feature map to obtain corresponding color global features and depth global features.
In this embodiment, the two fully-connected layers of the preset feature extraction network are used to perform feature transformation on the color feature map and the depth feature map respectively, so as to obtain corresponding color global features and depth global features.
S103, pyramid layering is carried out on the color feature map and the depth feature map respectively, and attention maps corresponding to the layers and attention outputs are obtained through calculation based on an attention mechanism; and taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the upsampling of the final feature map of the previous layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer.
It should be noted that the number of pyramid layers may be selected according to an actual experimental effect, and if the number of selected pyramid layers is too small, it is not enough to express multi-layer features, and if the number of selected pyramid layers is too large, more calculation amount is brought, in this embodiment, the number of pyramid layers is 3. In the present invention, a Transformer is used as one of the attention-calling mechanisms for capturing the non-local dependency, but other attention-calling mechanisms may be used instead.
To be provided with
Figure 208325DEST_PATH_IMAGE001
And
Figure 420738DEST_PATH_IMAGE002
respectively representing the last layer of feature maps of RGB and depth modes, and the size of the feature map is (B,C,H,W),BWhich represents the size of the batch at the time of training,Cwhich is indicative of the number of channels,HandWrepresenting the height and width of the feature, respectively. Taking RGB map as an example, as shown in FIG. 2, we take
Figure 528371DEST_PATH_IMAGE003
And
Figure 985897DEST_PATH_IMAGE004
as goldCharacter tower
Figure 862586DEST_PATH_IMAGE005
Dimension and
Figure 514410DEST_PATH_IMAGE006
the characteristics of the scale, using two-dimensional convolution to calculate Q, K, V in the Transformer structure, can obtain an attention diagram:
Figure 984706DEST_PATH_IMAGE007
wherein the content of the first and second substances,Trepresenting the transpose operation, the softmax activation function is used to regularize the calculated attention map.
Final self-attention output
Figure 245923DEST_PATH_IMAGE008
Can be calculated by the following formula:
Figure 836173DEST_PATH_IMAGE009
since the low-resolution feature map usually contains more semantic information, and the high-resolution feature map has more spatial information, the two can be complementary. Therefore, fusing features of different scales is more helpful for the selection of the subsequent key features.
And after obtaining the attention diagrams and the attention outputs corresponding to the layers, taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the upsampling of the final feature map of the previous layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer. As shown in fig. 2, the final feature map of the previous layer
Figure 32799DEST_PATH_IMAGE010
Output of the result after upsampling and attention of the layer
Figure 613560DEST_PATH_IMAGE011
Added to form the final characteristic diagram of the layer
Figure 412888DEST_PATH_IMAGE012
Figure 998590DEST_PATH_IMAGE013
Wherein the content of the first and second substances,
Figure 490752DEST_PATH_IMAGE014
representing an upsampling operation.
S104, respectively carrying out scale transformation on the attention diagrams and the final characteristic diagrams corresponding to each layer to obtain two new attention diagrams; taking the average value of the two new attention diagrams as a final attention diagram, and mapping the maximum value in the final attention diagramkAnd (4) mapping the position to the final characteristic diagram of the layer to obtain the local characteristic of the layer.
In this embodiment, since the scene does not have accurate label information of the key features, it is difficult to directly train the network model to find the key features. Even with the attention mechanism, it is difficult to obtain valid features in complex indoor scenarios without the associated constraints. In order to ensure the effectiveness of node selection, the attention diagram and the final feature diagram corresponding to each layer in each layer are subjected to scale transformation respectively to obtain two new attention diagrams
Figure 561738DEST_PATH_IMAGE015
And
Figure 899179DEST_PATH_IMAGE016
Figure 604966DEST_PATH_IMAGE017
Figure 2450DEST_PATH_IMAGE018
wherein the content of the first and second substances,ReandSumrepresenting reshape operation and summing along the column direction respectively,
Figure 557802DEST_PATH_IMAGE019
a two-dimensional convolution operation is represented,mrepresenting the pyramid layer number, forcing two attention diagrams to be similar in spatial position in the training process, and measuring the pyramid scaleiAn efficient representation of the key features can be obtained.
Finally, two new attention maps are used
Figure 698934DEST_PATH_IMAGE020
And
Figure 400174DEST_PATH_IMAGE021
is used as the final attention map, and maps the largest in the final attention mapkAnd (4) mapping the position to the final characteristic diagram of the layer to obtain the local characteristic of the layer.
And S105, fusing the color global features, the depth global features and the local features of all layers to obtain multi-modal features of the scene to be recognized, and recognizing the scene based on the multi-modal features.
Further, performing semantic-based feature fusion on the local features of each layer by using a GCN algorithm to obtain final local features; and then fusing the color global feature, the depth global feature and the final local feature to obtain the multi-modal feature of the scene to be recognized.
Furthermore, a color image structure and a depth image structure are respectively constructed based on the local features of each layer of the color feature image and the local features of each layer of the depth feature image, wherein the color image structure is used for representing the position association relationship among the object nodes in the color image, and the depth image structure is used for representing the position association relationship among the object nodes in the depth image; according to the characteristics of the nodes in the color graph structure, connecting the nodes of each layer in the color graph structure by sparse connection, and obtaining a first local characteristic through the aggregation and updating operation of a GCN algorithm; according to the characteristics of the nodes in the depth map structure, connecting the nodes of each layer in the depth map structure by sparse connection, and obtaining a second local characteristic through aggregation and update operations of a GCN algorithm; according to the characteristics of the nodes in the color graph structure and the characteristics of the nodes in the depth graph structure, connecting the nodes of each layer in the color graph structure and the nodes of the corresponding layer in the depth graph structure by sparse connection, and obtaining a third local characteristic through the aggregation and update operations of a GCN algorithm; and performing cascade processing and feature transformation on the first local feature, the second local feature and the third local feature to obtain a final local feature.
Illustratively, to effectively fuse the complementary information of two modalities based on selected features, a hierarchical graph modelG=(V,E) Constructed to represent an indoor scene. Wherein the content of the first and second substances,Va local feature representing the above-mentioned selection,Erepresenting connections between nodes.VCan be divided into two categories: 2D color image nodeV r And 3D depth map nodesV d . In addition, theEComprises three parts: connections between single-modal single scales, connections between multi-modal single scales, and connections between single-modal multi-scales.
Single-mode single-scale graph connection: first consider the construction of a single-modal single-scale map model. The contribution of each node to the scene recognition task is different, and the distinguishing processing should be carried out in the graph modeling. In our graph model, the importance of each node is represented by its value in the attention map, where a larger value means a greater contribution to scene recognition. In addition, the nodes in the graph are represented as high-dimensional feature graph vectors in the channel direction, which helps to represent key features in the scene. Specifically, we obtain the shape of (through the node selection of the previous step) < 2 >B,k,C) Is expressed as
Figure 499717DEST_PATH_IMAGE022
WhereinmIs shown asmAnd (4) each dimension. To be provided withmFor example, =1, we set upk=16, comprising 1 primary central node, 3 secondary primary central nodes and 12 leaf nodes. To build intra-modal connections, 3 sub-mastersThe central node is connected to the main central node, and the rest leaf nodes are connected to the secondary main central node through Euclidean distances.
Multi-modal single-scale graph connection: even in the same scene, the local features of the two modalities are different. In other words, there is a semantic gap between the two modalities. Thus, a sparse connection between selected features between the two modalities is more appropriate than a full connection. When considering the connection of RGB and depth modalities, we only connect the corresponding master hub node
Figure 778251DEST_PATH_IMAGE023
And
Figure 129598DEST_PATH_IMAGE024
secondary main center
Figure 45864DEST_PATH_IMAGE025
And
Figure 316308DEST_PATH_IMAGE026
are respectively connected. Wherein
Figure 347718DEST_PATH_IMAGE027
And
Figure 237177DEST_PATH_IMAGE028
respectively representiColor map and depth map of layerjAnd (4) each node.
Single-mode multi-scale graph connection: in order to utilize the multi-scale characteristics, the relation of different scales in the graph needs to be considered. Furthermore, given that the propagation of the features of the nodes over the entire graph can be done over several iterations, sparse connections are also used to construct the single-modal multi-scale graph. To be provided withm=1 andmfor example, =2, the node of scale 1 is connected to the primary center and the secondary primary center node corresponding to scale 2, that is, the primary center node
Figure 37642DEST_PATH_IMAGE029
And
Figure 619933DEST_PATH_IMAGE030
sub-hub node
Figure 371595DEST_PATH_IMAGE031
And
Figure 923800DEST_PATH_IMAGE032
respectively, as is the case for depth images.
And effectively combining the multi-modal single-scale map and the single-modal multi-scale map to obtain a final hierarchical map. For each node
Figure 454138DEST_PATH_IMAGE033
And
Figure 331964DEST_PATH_IMAGE034
we learn their updated representation by aggregating the features of their neighbors. Finally, the updated features are fused together to generate the final partial representation for RGB-D scene recognition. Taking the pyramid level as 3 as an example, the hierarchical graph model is constructed as shown in fig. 3.
After the final local features are obtained, fusing the color global features, the depth global features and the final local features to obtain multi-modal features of the scene to be recognized; and carrying out scene recognition on the multi-modal characteristics of the scene to be recognized to obtain a recognition result of the scene to be recognized.
Example two
Referring to fig. 4, a schematic flow chart of a training method for a scene recognition model provided in an embodiment of the present invention includes:
s401, acquiring a training data set, wherein the training data set comprises at least one group of color training images, depth training images and scene category labels of a training scene;
s402, training a preset scene recognition model by using the training data set to obtain a trained scene recognition model, wherein the trained scene recognition model is used for processing the color training image and the depth training image according to the scene recognition method in the first embodiment.
In this embodiment, the training data set may be a SUN RGBD data set or a NYU Depth v2 data set. The training data set comprises a plurality of groups of training scenes, each group of training scenes comprises a plurality of training scenes, and each training scene comprises a color training image, a depth training image and a scene class label corresponding to the training scene.
Further, training a preset scene recognition model by using the training data set, including: inputting the color training images and the depth training images into a preset scene recognition model, so that the preset scene recognition model respectively extracts the characteristics of the color training images and the depth training images, and obtaining color global training characteristics corresponding to the color training images, depth global training characteristics corresponding to the depth training images and local training characteristics of each layer; fusing the color global training features, the depth global training features and the local training features of all layers to obtain multi-mode training features of a training scene; and performing parameter adjustment processing on the preset scene recognition model based on the cross entropy loss function of the multi-modal training features and the scene category labels until the training is completed.
In particular, not only are the features of the two modalities complementary, but the global and local features are also complementary for scene recognition. As described above in detail with reference to the accompanying drawings,
Figure 72387DEST_PATH_IMAGE035
and
Figure 303648DEST_PATH_IMAGE036
respectively representing the final layer of feature maps of RGB and depth modes. Global features
Figure 580171DEST_PATH_IMAGE037
And
Figure 238685DEST_PATH_IMAGE038
is prepared by mixing
Figure 466404DEST_PATH_IMAGE039
And
Figure 360411DEST_PATH_IMAGE040
and respectively obtaining the global features through a full connection layer, and respectively using the two cross entropy loss functions for learning the global features. In addition, local features learned through the hierarchical graph model may be represented as
Figure 130921DEST_PATH_IMAGE041
. Further, local features
Figure 84971DEST_PATH_IMAGE042
And global features
Figure 321958DEST_PATH_IMAGE043
And
Figure 160601DEST_PATH_IMAGE044
is cascaded into
Figure 379093DEST_PATH_IMAGE045
Common for final scene recognition:
Figure 644989DEST_PATH_IMAGE046
where Cat represents a cascade operation.
Finally, the final scene recognition result can be predicted by an additional cross-entropy loss function, and the total loss comprises three parts: 1) loss of global features
Figure 581721DEST_PATH_IMAGE047
And
Figure 83110DEST_PATH_IMAGE048
(ii) a 2) Final classification loss
Figure 562633DEST_PATH_IMAGE049
(ii) a 3) Loss of similarity
Figure 625529DEST_PATH_IMAGE050
Total loss of
Figure 49557DEST_PATH_IMAGE051
This can be calculated by the following formula:
Figure 230003DEST_PATH_IMAGE052
wherein the content of the first and second substances,
Figure 423087DEST_PATH_IMAGE053
Figure 889840DEST_PATH_IMAGE054
Figure 942110DEST_PATH_IMAGE055
in a manner of calculation of
Figure 18257DEST_PATH_IMAGE056
The same is true.
It should be noted that in the testing phase we only use
Figure 331426DEST_PATH_IMAGE057
And carrying out a final scene recognition task.
EXAMPLE III
Referring to fig. 5, the present invention provides a pyramid attention-based scene recognition apparatus 500 according to an embodiment of the present invention, where the apparatus 500 includes:
a first image obtaining module 510, configured to obtain a color image and a depth image of a scene to be identified, and perform feature extraction respectively to obtain a corresponding color feature map and a corresponding depth feature map;
a global feature obtaining module 520, configured to perform feature transformation on the color feature map and the depth feature map, respectively, to obtain corresponding color global features and depth global features;
a local feature obtaining module 530 for obtaining the color feature map and the depth feature map respectivelyCarrying out pyramid layering, and calculating to obtain an attention diagram and attention output corresponding to each layer based on an attention mechanism; taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the up-sampling of the final feature map of the last layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer; respectively carrying out scale transformation on the attention diagram and the final characteristic diagram corresponding to each layer in the layers to obtain two new attention diagrams; taking the average value of the two new attention diagrams as a final attention diagram, and mapping the maximum value in the final attention diagramkObtaining the local characteristics of the layer by mapping the positions to the final characteristic diagram of the layer;
and the fusion and recognition module 540 is configured to fuse the color global features, the depth global features, and the local features of each layer to obtain multi-modal features of the scene to be recognized, and perform scene recognition based on the multi-modal features.
In this embodiment, please refer to the description in the first embodiment for the specific implementation of each module, which will not be repeated herein.
Example four
Referring to fig. 6, an embodiment of the present invention provides a training apparatus 600 for a scene recognition model, where the apparatus 600 includes:
a second image obtaining module 610, configured to obtain a training data set, where the training data set includes at least one set of color training images, depth training images, and scene category labels of a training scene;
a training module 620, configured to train a preset scene recognition model with the training data set to obtain a trained scene recognition model, where the trained scene recognition model is used to process the color training image and the depth training image according to the scene recognition method described in the first embodiment.
In this embodiment, please refer to the description in the second embodiment for the specific implementation of each module, which will not be repeated here.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (7)

1. A pyramid attention-based scene recognition method is characterized by comprising the following steps:
acquiring a color image and a depth image of a scene to be identified, and respectively extracting features to obtain a corresponding color feature map and a corresponding depth feature map;
respectively carrying out feature transformation on the color feature map and the depth feature map to obtain corresponding color global features and depth global features;
pyramid layering is carried out on the color feature map and the depth feature map respectively, and attention maps and attention outputs corresponding to the layers are obtained through calculation based on an attention mechanism; taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the up-sampling of the final feature map of the last layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer;
respectively carrying out scale transformation on the attention diagram and the final characteristic diagram corresponding to each layer in the layers to obtain two new attention diagrams; taking the average value of the two new attention diagrams as a final attention diagram, and mapping the maximum value in the final attention diagramkObtaining the local characteristics of the layer by mapping the positions to the final characteristic diagram of the layer;
and fusing the color global features, the depth global features and the local features of all layers to obtain multi-modal features of the scene to be recognized, and recognizing the scene based on the multi-modal features.
2. The method of claim 1, wherein the scaling the attention map and the final feature map corresponding to each of the layers respectively to obtain two new attention maps comprises:
summing the attention diagrams corresponding to each layer in the layers along the column direction and performing reshape operation to obtain a new attention diagram;
and performing two-dimensional convolution operation on the final characteristic diagram corresponding to each layer in the layers to obtain another new attention diagram.
3. The method according to claim 1 or 2, wherein the fusing the color global features, the depth global features and the local features of each layer to obtain multi-modal features of the scene to be recognized comprises:
performing semantic-based feature fusion on the local features of each layer by using a GCN algorithm to obtain final local features;
and fusing the color global feature, the depth global feature and the final local feature to obtain the multi-modal feature of the scene to be recognized.
4. The method according to claim 3, wherein the performing semantic-based feature fusion on the local features of each layer by using the GCN algorithm to obtain the final local feature comprises:
respectively constructing a color image structure and a depth image structure based on the local features of each layer of the color feature image and the local features of each layer of the depth feature image, wherein the color image structure is used for representing the position incidence relation among object nodes in the color image, and the depth image structure is used for representing the position incidence relation among the object nodes in the depth image;
according to the characteristics of the nodes in the color graph structure, connecting the nodes of each layer in the color graph structure by sparse connection, and obtaining a first local characteristic through the aggregation and updating operation of a GCN algorithm;
according to the characteristics of the nodes in the depth map structure, connecting the nodes of each layer in the depth map structure by sparse connection, and obtaining a second local characteristic through aggregation and update operations of a GCN algorithm;
according to the characteristics of the nodes in the color graph structure and the characteristics of the nodes in the depth graph structure, connecting the nodes of each layer in the color graph structure and the nodes of the corresponding layer in the depth graph structure by sparse connection, and obtaining a third local characteristic through the aggregation and update operations of a GCN algorithm;
and performing cascade processing and feature transformation on the first local feature, the second local feature and the third local feature to obtain a final local feature.
5. A pyramid attention-based scene recognition apparatus, comprising:
the first image acquisition module is used for acquiring a color image and a depth image of a scene to be identified, and respectively extracting features to obtain a corresponding color feature map and a corresponding depth feature map;
the global feature acquisition module is used for respectively carrying out feature transformation on the color feature map and the depth feature map to obtain corresponding color global features and depth global features;
the local feature acquisition module is used for respectively carrying out pyramid layering on the color feature map and the depth feature map, and obtaining an attention map corresponding to each layer and attention output based on attention mechanism calculation; taking the attention output of the last layer as the final feature map of the last layer, and adding the result obtained after the up-sampling of the final feature map of the last layer and the attention output of the current layer by the rest layers to obtain the final feature map of the current layer; respectively carrying out scale transformation on the attention diagram and the final characteristic diagram corresponding to each layer in the layers to obtain two new attention diagrams; taking the average value of the two new attention diagrams as a final attention diagram, and mapping the maximum value in the final attention diagramkObtaining the local characteristics of the layer by mapping the positions to the final characteristic diagram of the layer;
and the fusion and recognition module is used for fusing the color global features, the depth global features and the local features of all layers to obtain multi-modal features of the scene to be recognized, and recognizing the scene based on the multi-modal features.
6. An electronic device, comprising: a memory and at least one processor;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method of any one of claims 1-4.
7. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the method of any one of claims 1-4.
CN202111372903.9A 2021-11-19 2021-11-19 Pyramid attention-based scene recognition method, training method and device Active CN113822232B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111372903.9A CN113822232B (en) 2021-11-19 2021-11-19 Pyramid attention-based scene recognition method, training method and device
US17/835,361 US11514660B1 (en) 2021-11-19 2022-06-08 Scene recognition method, training method and device based on pyramid attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111372903.9A CN113822232B (en) 2021-11-19 2021-11-19 Pyramid attention-based scene recognition method, training method and device

Publications (2)

Publication Number Publication Date
CN113822232A CN113822232A (en) 2021-12-21
CN113822232B true CN113822232B (en) 2022-02-08

Family

ID=78919297

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111372903.9A Active CN113822232B (en) 2021-11-19 2021-11-19 Pyramid attention-based scene recognition method, training method and device

Country Status (2)

Country Link
US (1) US11514660B1 (en)
CN (1) CN113822232B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114494276A (en) * 2022-04-18 2022-05-13 成都理工大学 Two-stage multi-modal three-dimensional instance segmentation method
US11915474B2 (en) 2022-05-31 2024-02-27 International Business Machines Corporation Regional-to-local attention for vision transformers

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8165407B1 (en) * 2006-10-06 2012-04-24 Hrl Laboratories, Llc Visual attention and object recognition system
CN103679718A (en) * 2013-12-06 2014-03-26 河海大学 Fast scenario analysis method based on saliency
CN110110578B (en) * 2019-02-21 2023-09-29 北京工业大学 Indoor scene semantic annotation method
CN111062386B (en) * 2019-11-28 2023-12-29 大连交通大学 Natural scene text detection method based on depth pyramid attention and feature fusion
CN111680678B (en) * 2020-05-25 2022-09-16 腾讯科技(深圳)有限公司 Target area identification method, device, equipment and readable storage medium
CN112784779A (en) * 2021-01-28 2021-05-11 武汉大学 Remote sensing image scene classification method based on feature pyramid multilevel feature fusion
CN113408590B (en) * 2021-05-27 2022-07-15 华中科技大学 Scene recognition method, training method, device, electronic equipment and program product

Also Published As

Publication number Publication date
CN113822232A (en) 2021-12-21
US11514660B1 (en) 2022-11-29

Similar Documents

Publication Publication Date Title
CN111489358B (en) Three-dimensional point cloud semantic segmentation method based on deep learning
CN111858954B (en) Task-oriented text-generated image network model
Wu et al. Automatic road extraction from high-resolution remote sensing images using a method based on densely connected spatial feature-enhanced pyramid
CN113822232B (en) Pyramid attention-based scene recognition method, training method and device
CN109635883A (en) The Chinese word library generation method of the structural information guidance of network is stacked based on depth
CN111291212A (en) Zero sample sketch image retrieval method and system based on graph convolution neural network
CN109492666A (en) Image recognition model training method, device and storage medium
CN110728192A (en) High-resolution remote sensing image classification method based on novel characteristic pyramid depth network
CN101477529B (en) Three-dimensional object retrieval method and apparatus
CN110853057B (en) Aerial image segmentation method based on global and multi-scale full-convolution network
CN111627065A (en) Visual positioning method and device and storage medium
Li et al. Joint semantic-geometric learning for polygonal building segmentation
CN114419304A (en) Multi-modal document information extraction method based on graph neural network
CN114758337B (en) Semantic instance reconstruction method, device, equipment and medium
CN113240683B (en) Attention mechanism-based lightweight semantic segmentation model construction method
Luo et al. FloorplanGAN: Vector residential floorplan adversarial generation
CN114529757B (en) Cross-modal single-sample three-dimensional point cloud segmentation method
CN113989340A (en) Point cloud registration method based on distribution
Wang et al. Urban building extraction from high-resolution remote sensing imagery based on multi-scale recurrent conditional generative adversarial network
CN112418235A (en) Point cloud semantic segmentation method based on expansion nearest neighbor feature enhancement
CN116912708A (en) Remote sensing image building extraction method based on deep learning
CN115240079A (en) Multi-source remote sensing image depth feature fusion matching method
KR102083786B1 (en) Method and apparatus for identifying string and system for identifying displaing image using thereof
CN116933141B (en) Multispectral laser radar point cloud classification method based on multicore graph learning
CN116843832A (en) Single-view three-dimensional object reconstruction method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant