LU503090B1 - A semantic segmentation system and method based on dual feature fusion for iot sensing - Google Patents

A semantic segmentation system and method based on dual feature fusion for iot sensing Download PDF

Info

Publication number
LU503090B1
LU503090B1 LU503090A LU503090A LU503090B1 LU 503090 B1 LU503090 B1 LU 503090B1 LU 503090 A LU503090 A LU 503090A LU 503090 A LU503090 A LU 503090A LU 503090 B1 LU503090 B1 LU 503090B1
Authority
LU
Luxembourg
Prior art keywords
features
fusion
attention vector
feature
unit
Prior art date
Application number
LU503090A
Other languages
German (de)
Inventor
Wenxuan Tu
Xinzhong Zhu
Huiying Xu
Jianmin Zhao
Original Assignee
Univ Zhejiang Normal
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Zhejiang Normal filed Critical Univ Zhejiang Normal
Application granted granted Critical
Publication of LU503090B1 publication Critical patent/LU503090B1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

This application discloses a semantic segmentation system and method based on dual feature fusion for IoT perception, the method comprising steps: S1. Feature encoding of the original image to obtain different scale features; S2. Learning of features at different scales by two attention refinement blocks to obtain multi-level fusion features; S3. Dimensionality reduction of multi-level fusion features to obtain dimensionality reduction features; S4. Contextual encoding of downscaled features with deeply decomposable convolutions at different convolutional scales to obtain local features at different scales; S5. Global pooling of downscaled features with a global mean pooling layer to obtain global features; S6. Fusion of global features, and local features for channel stitching to obtain multi-scale contextual fusion features; S7. Fusion of downscaled features, multi-scale contextual fusion features for channel splicing to obtain spliced features; and S8. Output is obtained based on splicing features. The semantic discrepancy between multi-layer features is alleviated, and the information representation is enriched to improve the recognition accuracy.

Description

BL-5594 1
A SEMANTIC SEGMENTATION SYSTEM AND METHOD BASED ON DUAL FEATURE 0%
FUSION FOR IOT SENSING
Technical Field
The present application belongs to the field of computer vision technology and specifically relates to a system and method for semantic segmentation based on bipartite feature fusion for IoT perception.
Background Art
Semantic segmentation, which aims to densely assign each pixel to the corresponding predefined class, has been studied in the field of semantic segmentation with good performance in many IoT applications such as autonomous driving, diabetic retinopathy, and image analysis. Among them, two important factors, the way of feature fusion and the complexity of the network, significantly determine the performance of the semantic segmentation method.
Existing semantic segmentation methods can be broadly classified into two categories: accuracy- oriented and efficiency-oriented methods. In the early days, most previous works focused too much on the recognition accuracy of a single perspective: algorithm or the speed of algorithm execution efficiency. In the first class of approaches, the design idea of semantic segmentation models focuses on how to integrate diverse features to achieve high accuracy segmentation performance by designing a complex framework. For example, pyramidal structures have been proposed, the empty space pyramid pooling module (ASPP) or the contextual pyramid module (CPM), which encodes multi- scale contextual information at the tail end of the backbone network ResNet101 (2048 feature maps), for dealing with multiscale variations of targets. In addition, U - type networks directly fuse hierarchical features through long-hop connection operations to extract spatial information at different levels as much as possible, thus achieving accurate pixel segmentation. On the other hand, a typical asymmetric decoder structure has also been extensively studied. ENet and ESPNet networks substantially compress the network size by pruning operations to process large-scale images online at a very fast rate. To improve the overall performance of semantic segmentation methods, the recent semantic segmentation literature shows a tendency to uniformly balance the efficiency and effectiveness of segmentation networks when encoding multi-level features and multi-scale contextual information. Specifically, ERFNet employs a large number of decomposable convolutions
BL-5594 2 with different expansion rates in the decoder part to make the parameters less redundant and expand 090 the perceptual field at the same time. In addition, BiSenet, CANet, and ICNet have been proposed, which can process the input images individually by several lightweight sub-networks and then fuse the multi-layer features or depth context information together. Recently, CIFReNet encodes multi- layer and multi-scale information by introducing feature refinement and context integration modules to achieve accurate and efficient scene segmentation.
Although the existing semantic segmentation methods achieve better segmentation performance in terms of high accuracy or fast speed, there are at least the following problems: 1) relying on more time and computational complexity to complete the feature extraction process in the multi-level information fusion process, which leads to inefficient model learning and high computational cost; 2) direct fusion of multi-source information by element-level addition or cascade operations Little consideration is given to how to narrow the semantic gap between multi-layer features. As a result, the interaction between multiple information sources is hindered, resulting in suboptimal segmentation accuracy.
Summary of the Invention
To address the above-mentioned problems in the prior art, this application proposes a semantic segmentation system and method based on dual-feature fusion for IoT sensing, which achieves a balance of four comprehensive performances in terms of accuracy, speed, storage and computational complexity.
A Semantic segmentation method based on dual feature fusion for IoT sensing, comprising the steps of:
S1. Input the original image, and use the backbone network to encode the features of the original image to obtain the features at different scales;
S2. Learning of features at different scales by two attention refinement blocks to obtain multi- level fusion features;
S3. Perform dimensionality reduction on the multilevel fusion features to obtain the dimensionality reduction features;
S4. Contextual encoding of the downscaled features with depth-decomposable convolutions at different convolutional scales are used to obtain local features at different scales, respectively;
SS. Global pooling of the downscaled features with the global mean pooling layer to obtain the
BL-5594 3 global features; 7503090
S6. Fusing global features as well as local features of different scales for channel splicing to obtain multi-scale context fusion features;
S7. Perform channel splicing fusion of the downscaled features, and multi-scale context fusion features to obtain the splicing features;
S8. Downsampling and upsampling the spliced features to obtain the final output.
As a preferred embodiment, step S1 is specified as:
The original image is feature encoded using the backbone network to obtain the first, second, and third features, where the first feature scale is 1/4 of the original image scale, the second feature scale is 1/8 of the original image scale, and the third feature scale is 1/16 of the original image scale.
As a preferred solution, the following steps are included in step S2:
S2.1. Fusing the first features, second features by a first attention refinement block to output semantic features;
S22. The semantic features as well as the third features are fused by the second attention refinement block to obtain the multi-level fused features.
As a preferred embodiment, step S2.1 specifically includes the following steps:
S2.1.1. The first feature is mapped to the same scale as the second feature through the downsampling layer to obtain the first scale feature;
S2.1.2. Mapping the channel dimension of the first scale feature to coincide with the channel dimension of the second feature by a first 1*1 convolution layer to obtain the first channel feature;
S2.1.3. Fusion of first scale features with second features for channel stitching to obtain first fusion features;
S2.1.4. Inputting the first fused features into the first adaptive mean pooling layer and the first adaptive maximum pooling layer, respectively, to output the first attention vector, the second attention vector, respectively;
S2.1.5. Non-linear Mapping of the first attention vector and the second attention vector by the first multilayer perception layer to output the first mixed attention vector and the second mixed attention vector, and fusion of the first mixed attention vector and the second mixed attention vector to output the first fused mixed attention vector;
S2.1.6. Normalize the first fused mixed attention vector to obtain the first normalized mixed attention vector;
BL-5594 4
S2.1.7. Mapping a first channel feature with a first normalized mixed attention vector weighted 020 by a first normalized attention vector;
S2.1.8. Fusing the second features as well as the weighted first channel features to output semantic features.
As a preferred embodiment, step S2.2 specifically comprises the following step:
S2.2.1. The third feature is mapped to the same scale as the second feature through the upsampling layer to obtain the second scale feature;
S2.2.2. Mapping the channel dimension of the second scale feature by a second 1*1 convolution layer to coincide with the second feature channel dimension to obtain the second channel feature;
S2.2.3. Fusing the second scale features with the semantic features for channel splicing to obtain the second fusion features;
S2.2.4. Inputting the second fusion features into the second adaptive mean pooling layer and the second adaptive maximum pooling layer, respectively, to output the third attention vector, the fourth attention vector;, respectively
S2.2.5. Non-linear Mapping of the third attention vector and the fourth attention vector by the second multilayer perception layer to output the third mixed attention vector and the fourth mixed attention vector, and fusion of the third mixed attention vector and the fourth mixed attention vector to output the second fused mixed attention vector;
S2.2.6. Normalize the second fused mixed attention vector to obtain the second normalized mixed attention vector;
S2.2.7. Mapping the second channel features with a second normalized mixed attention vector weighted by the second normalized attention vector;
S2.2.8. Fusing the semantic features as well as the weighted second channel features to obtain multi-level fusion features.
As a preferred embodiment, the first fusion features described in step S2.1.4 are input into the first adaptive mean pooling layer and the first adaptive maximum pooling layer, respectively, to output the first attention vector, the second attention vector, respectively, by using the following equation:
V, = AAP, (C[F,> F,]) >
V, = AMP, (C[F,, F;]) >
BL-5594 5
Where V1, is the first attention vector, V» is the second attention vector, F1 is the first A feature, F» is the second feature, C[] denotes channel stitching fusion, AAP; () denotes the first adaptive mean pooling layer, and AMP; () denotes the first adaptive maximum pooling layer.
As a preferred option, the second fusion features described in step S2.2.4 are input into the second adaptive mean pooling layer and the second adaptive maximum pooling layer, respectively, to output the third attention vector and the fourth attention vector, respectively, by using the following equation:
Vz = AAP, (C[L,, L,]) >
V, = AMP, (C[L,, L,]) >
Where, V; is a third attention vector, V4 is a fourth attention vector, L1 is a second scale feature,
L, is a semantic feature, AAP; () denotes a second adaptive mean pooling layer, and AMP: () denotes a second adaptive maximum pooling layer.
As the preferred embodiment, the non-linear Mapping of the first attention vector, the second attention vector to output the first blended attention vector, the second blended attention vector by the first multilayer perception layer described in step S2.1.5, and the channel stitching and fusion of the first blended attention vector, the second blended attention vector to output the first fused blended attention vector, specifically using the following equation:
Va, = MLP, (C[Vy, V;]) »
Non-linear Mapping of the third attention vector, the fourth attention vector by the second multilayer perception layer as described in step S2.2.5 to output the third blended attention vector, the fourth blended attention vector, and fusing the third blended attention vector, the fourth blended attention vector to output the second fused blended attention vector, specifically using the following equation:
Vaz = MLPz(C|V3> V4).
Where, Vai is the first fused mixed attention vector, Var is the second fused mixed attention vector,
MLP; () is the first multilayer perception layer, and MLP; () is the second multilayer perception layer.
As the preferred scheme, step: S2.1.6. normalizes the first fused mixed attention vector to obtain the first normalized mixed attention vector, S2.1.7. maps the first channel features with the first normalized mixed attention vector weighted, and S2.1.8. fuses the second features as well as the
BL-5594 6 weighted first channel features to output the semantic features by using the following equation: 7505090
LS g(V) © FF,
Step: S2.2.6, normalizing the second fused mixed attention vector to obtain the second normalized mixed attention vector, S2.2.7, Mapping the second channel features with the second normalized mixed attention vector weighted by the second normalized mixed attention vector, S2.2.8, fusing the semantic features and the weighted second channel features to obtain the multi-level fused features by using the following equation:
LES (Vp © LOL
Where, L is a semantic feature, L is a multi-level fusion feature, Sig1 () denotes the first activation function, Sig» () denotes the second activation function, F, is a first channel feature, L is a second channel feature, and H denotes the height of the feature map, W denotes the width of the feature map, © denotes a pixel-level dot product operation, and ® denotes a pixel-level dot add operation.
Accordingly, a semantic segmentation system based on dual feature fusion for IoT sensing is also provided, including a connected multilayer feature fusion module and a lightweight semantic pyramid module;
The multi-layer feature fusion module includes a backbone network unit, a proofreading unit;
The lightweight semantic pyramid module includes the first downscaling unit, the second downscaling unit, the third downscaling unit, the context encoding unit, the global pooling unit, the first channel splicing fusion unit, the second channel splicing fusion unit, and the upsampling unit;
Among them, the backbone network unit is connected to the proofreading unit, the proofreading unit is connected to the first downscaling unit and the second downscaling unit, the first downscaling unit is connected to the context encoding unit and the global pooling unit, the context encoding unit and the global pooling unit are connected to the first channel splicing fusion unit, the second downscaling unit and the first channel splicing fusion unit are connected to the second channel splicing fusion unit, and the second channel splicing fusion unit is also connected to the third downsampling unit, and the upsampling unit is connected to the third downsampling unit;
The mentioned backbone network unit for feature encoding of the original image using the backbone network to obtain features at different scales;
BL-5594 7
The mentioned proofreading unit for learning features at different scales by two attention 0 refinement blocks to obtain multi-level fusion features;
Both the first dimensionality reduction unit as well as the second dimensionality reduction unit are used to reduce the dimensionality of the multi-level fusion features to output the first the second dimensionality reduction features; the first and the second dimensionality reduction feature are the same;
Context encoding unit for context encoding the first descending features by deep decomposable convolution at different convolutional scales to obtain local features at different scales, respectively;
A global pooling unit for globally pooling the first reduced-dimensional features by a global mean pooling layer to obtain global features;
A first channel stitching and fusion unit for channel stitching and fusion of global features as well as local features at different scales to obtain multi-scale contextual fusion features;
A second channel stitching fusion unit for channel stitching fusion of a second reduced dimensional feature, a multi-scale contextual fusion feature to obtain a stitching feature;
Third dimensionality reduction unit for dimensionality reduction of the spliced features;
Up-sampling unit for up-sampling the downsampled spliced features to obtain the final output.
The beneficial effects of this application are: (1) A multi-level feature fusion module (MFFM) is proposed which uses two recursive attention refinement blocks (ARBs) to effectively improve the effectiveness of multi-level feature fusion.
Under the condition that the computational cost of the piecewise can be controlled, the proposed ARB mitigates the semantic discrepancy among multi-level features by using the abstract semantic information of higher-order features to calibrate the spatial detail information in lower-order features. (2) A lightweight semantic pyramid module (LSPM) is proposed, which decomposes the convolutional operator thus reducing the computational overhead of encoding contextual information.
In addition, this module fuses multi-level fusion features with multi-scale context-specific diagnosis to enrich the representation of information, thus improving the recognition accuracy.
Illustration of the attached figure
In order to illustrate the technical solutions more clearly in the embodiments or prior art of the present application, the following is a brief description of the accompanying drawings that need to be used in the description of the embodiments or prior art. It is obvious that the accompanying drawings
BL-5594 8 in the following description are only some of the embodiments of the present application, and other 090 accompanying drawings can be obtained based on them without any creative work for a person of ordinary skill in the art.
Figure 1 is a flowchart of a semantic segmentation method based on dual feature fusion for IoT sensing as described in the present application;
Figure 2 is a schematic diagram of the structure of a semantic segmentation system based on dual feature fusion for IoT sensing as described in the present application;
Figure 3 is a schematic diagram of the structure of the attention refinement block described in this application.
Detailed Description of Embodiments
The following illustrates the embodiment of the steps of the application by specific concrete examples, and other advantages and efficacy of the present application can be readily understood by those skilled in the art as disclosed in this specification. The present application may also be implemented or applied by additionally different specific embodiments, and the details in this specification may also be modified or changed in various ways without departing from the spirit of the present application based on different views and applications. It is to be noted that the following embodiments and the features in the embodiments can be combined with each other without conflict.
Embodiment I:
Referring to Figs. 1, 2 and 3, this embodiment provides a semantic segmentation method based on bipartite feature fusion for [oT sensing, comprising the steps:
S1. Input the original image, and use the backbone network to encode the features of the original image to obtain the features at different scales;
S2. Learning of features at different scales by two attention refinement blocks to obtain multi- level fusion features;
S3. Perform dimensionality reduction on the multilevel fusion features to obtain the dimensionality reduction features;
S4. Contextual encoding of the downscaled features with depth-decomposable convolutions at different convolutional scales are used to obtain local features at different scales, respectively;
SS. Global pooling of the downscaled features with the global mean pooling layer to obtain the global features;
BL-5594 9
S6. Fusing global features as well as local features of different scales for channel splicing 293090 obtain multi-scale context fusion features;
S7. Perform channel splicing fusion of the downscaled features, multi-scale context fusion features to obtain the splicing features;
S8. Downsampling and upsampling the spliced features to obtain the final output.
Where, step S1 is specified as:
The original image is feature encoded using the backbone network to obtain the first, second, and third features, where the first feature scale is 1/4 of the original image scale, the second feature scale is 1/8 of the original image scale, and the third feature scale is 1/16 of the original image scale.
Each layer of the backbone network has different feature expression capabilities. Shallower layers contain more spatial details but lack semantic information; while deeper layers retain rich semantic information but lose a lot of spatial information. Intuitively, fusing multiple layers of information together has a gainful effect on learning differentiated and comprehensive feature representations.
Based on the above observations, we first obtain different scale features from the backbone network in order of notation as Ii, Is and 1116, and secondly unify the scale of all feature maps to 1/8 size to reduce information loss and resource utilization. Specifically, a 1x1 pooling layer is used to downsample [1/4 to obtain / ;g» and then a bilinear layer is used to upsample the higher-order feature map Ii/16 to obtain / ja Finally, the three are fused to obtain the multilevel fusion feature O.
The above process 1s expressed as follows:
Tys=T (GAR, = 414) )
T,,, =Upsampl e (I, 0
O=T Cle ® le D1,
Where, GAPk-2, s-2() denotes a global mean pooling layer with a scale of 2 steps, T() is defined as a channel transformation operation to change the number of feature maps. Upsample () denotes an upsampling layer, and ® denotes a pixel-level point-add operation.
Although the above feature fusion operations facilitate the mutual use of complementary information between multi-level features, direct integration of low-level features with high-level features may not be efficient and comprehensive due to the semantic variability in multi-level stages.
BL-5594 10
To solve this problem, this application designs a feature refinement strategy, called attention) proofreading block (ARB). Both focus on inter-channel relationship modeling with multi-level fusion features. With this approach, the model can emphasize the weights of neurons that are highly relevant to the target object when the current channel location contains features with valuable information.
That is, step S2 includes the following steps:
S2.1. Fusing the first features, and second features by a first attention refinement block to output semantic features;
S22. The semantic features as well as the third features are fused by the second attention refinement block to obtain the multi-level fused features.
Further, step S2.1 specifically comprises the step of:
S2.1.1. The first feature is mapped to the same scale as the second feature through the downsampling layer to obtain the first scale feature;
S2.1.2. Mapping the channel dimension of the first scale feature to coincide with the channel dimension of the second feature by a first 1*1 convolution layer to obtain the first channel feature;
S2.1.3. Fusion of first scale features with second features for channel stitching to obtain first fusion features:
S2.1.4. The first fused features are fed into the first adaptive average pooling layer (AAP) and the first adaptive maximum pooling layer (AMP) to output the first attention vector, the second attention vector; the adaptive average pooling layer (AAP) and the adaptive maximum pooling layer (AMP), respectively, all modeling the importance of each feature channel by assigning weights to all channels of the multilevel fused features. The higher the importance of the current feature channel, the higher the weight corresponding to that layer.
S2.1.5. The first attention vector, the second attention vector are nonlinearly mapped by the first multilayer perception layer to output the first blended attention vector, the second blended attention vector for enhancing the nonlinearity and robustness of the features, and the first blended attention vector, the second blended attention vector are fused to output the first fused blended attention vector;
S2.1.6. Normalize the first fused mixed attention vector to obtain the first normalized mixed attention vector;
S2.1.7. Mapping a first channel feature with a first normalized mixed attention vector weighted by a first normalized attention vector;
S2.1.8. Fusing the second features as well as the weighted first channel features to output
BL-5594 11 semantic features. 7505090
Further, step S2.2 specifically comprises the following steps:
S2.2.1. The third feature is mapped to the same scale as the second feature through the upsampling layer to obtain the second scale feature;
S2.2.2. Mapping the channel dimension of the second scale feature by a second 1*1 convolution layer to coincide with the second feature channel dimension to obtain the second channel feature;
S2.2.3. Fusing the second scale features with the semantic features for channel splicing to obtain the second fusion features;
S2.2.4. Inputting the second fusion features into the second adaptive mean pooling layer and the second adaptive maximum pooling layer, respectively, to output the third attention vector, the fourth attention vector;, respectively
S2.2.5. Non-linear Mapping of the third attention vector and the fourth attention vector by the second multilayer perception layer to output the third mixed attention vector and the fourth mixed attention vector, and fusion of the third mixed attention vector and the fourth mixed attention vector to output the second fused mixed attention vector;
S2.2.6, Normalizing the second fused mixed attention vector to obtain the second normalized mixed attention vector;
S2.2.7. Mapping the second channel features with a second normalized mixed attention vector weighted by the second normalized attention vector;
S2.2.8. Fusing the semantic features as well as the weighted second channel features to obtain multi-level fusion features.
Further, the first fusion features described in step S2.1.4 are input into the first adaptive mean pooling layer and the first adaptive maximum pooling layer, respectively, to output the first attention vector, the second attention vector, respectively, by using the following equation:
V, = AAP, (C[F,> F,]) >
V, = AMP, (C[F,, F;]) >
Where V1 is the first attention vector, V2 is the second attention vector, F1 is the first scale feature,
F2 is the second feature, C[] denotes channel stitching fusion, AAP1 () denotes the first adaptive mean pooling layer, and AMP1 () denotes the first adaptive maximum pooling layer.
The second fused features described in step S2.2.4 are input into the second adaptive mean
BL-5594 12 pooling layer and the second adaptive maximum pooling layer, respectively, to output the third 7505090 attention vector, and the fourth attention vector, respectively, using the following equation:
Vz = AAP, (C[L,, L,]) >
V, = AMP, (C[L,, L,]) >
Wherein, V; is a third attention vector, V4 is a fourth attention vector, L1 is a second scale feature, Lo is a semantic feature, AAP; () denotes a second adaptive mean pooling layer, and AMP, () denotes a second adaptive maximum pooling layer.
Further, the non-linear Mapping of the first attention vector and the second attention vector to output the first mixed attention vector and the second mixed attention vector by the first multilayer sensing layer as described in step S2.1.5, and the channel stitching and fusion of the first mixed attention vector and the second mixed attention vector to output the first fused mixed attention vector, by using the following equation:
Va, = MLP, (C[Vy, V;]) »
Non-linear Mapping of the third attention vector, the fourth attention vector by the second multilayer perception layer as described in step S2.2.5 to output the third blended attention vector, the fourth blended attention vector, and fusing the third blended attention vector, the fourth blended attention vector to output the second fused blended attention vector, specifically using the following equation:
Vaz = MLP,(C[V5, V,])-
Where, Vai is the first fused mixed attention vector, Var is the second fused mixed attention vector,
MLP; () is the first multilayer perception layer, and MLP-; () is the second multilayer perception layer.
Further, step: S2.1.6. normalizes the first fused mixed attention vector to obtain the first normalized mixed attention vector, S2.1.7. maps the first channel features with the first normalized mixed attention vector weighted by the first normalized mixed attention vector, and S2.1.8. fuses the second features as well as the weighted first channel features to output the semantic features using the following equation:
LS g(V) © FF,
Step: S2.2.6, normalizing the second fused mixed attention vector to obtain the second normalized mixed attention vector, S2.2.7, Mapping the second channel features with the second
BL-5594 13 normalized mixed attention vector weighted by the second normalized mixed attention vector, 2389090 fusing the semantic features and the weighted second channel features to obtain the multi-level fused features by using the following equation:
LES (Vp © LOL
Where, L is a semantic feature, L is a multi-level fusion feature, Sig () denotes the first activation function, Sig, () denotes the second activation function, A is a first channel feature, L is a second channel feature, H denotes the height of the feature map, W denotes the width of the feature map, © denotes a pixel-level dot product operation, and ® denotes a pixel-level dot add operation.
Technically speaking, the design of ARB can be regarded as an information proofreading strategy where two attention-based paths predict the importance of each channel in a complementary manner, thus transferring more semantic information to the lower-level features to alleviate the semantic variability between different levels of features, thus achieving effective feature fusion. The experimental results in the following sections verify the effectiveness of the setup. It is worth noting that the ARB has only 0.03M number of parameters in total, and the entire multi-level feature fusion maintains a lightweight computational scale.
Further, to enhance the computational efficiency of the context extraction module, this application proposes the Deep Decomposable Convolution (DFC) operation to replace the standard convolution layer. Inspired by deep separable convolution and decomposed convolution, a major idea of lightweight feature extraction is to integrate the ideas of the above two techniques. First, the regularization layer and activation function are used as two preprocessing steps to enhance the regularity of the convolutional layer; Second, the 3x3 depth convolution is decomposed to obtain two sets of one-dimensional depth-separable convolutional layers with scales of 3x1 and 1x3, respectively.
With the above approach, the dense convolution kernels on all channels will be consistently sparse, thus reducing the computational complexity and resource overhead of convolution. Finally, the local features of all scales are fused with the global features to obtain multi-scale contextual fusion features.
After encoding the multi-scale contexts, the final segmentation results are further predicted using the reduced-dimensional multi-level fusion features combined with global features as well as local features at different scales. The above design has two advantages: On the one hand, multi-level
BL-5594 14 information and multi-scale contextual information are integrated in a unified system for moe 020 efficient feature representation, On the other hand, the use of jump connections can encourage information transfer and gradient conduction of multi-level information at the front layer, thus improving the efficiency of recognition.
The key technical points of this application are: (1) A novel IoT-oriented dual-feature fusion real-time semantic segmentation network (DFFNet) is invented. Compared with state-of-the-art methods, DFFNet reduces FLOPs by about 2.5 times and improves model execution speed by 1.8 times, while obtaining better accuracy. (2) A multi-level feature fusion module (MFFM) is proposed that employs two recursive attentional proofreading blocks (ARBs) to efficiently improve the effectiveness of multi-level feature fusion. Under the condition that the computational cost of the piecewise can be controlled, the proposed ARB mitigates the semantic discrepancy among multi-level features by using the abstract semantic information of higher-order features to calibrate the spatial detail information in lower-order features. (3) A lightweight semantic pyramid module (LSPM) is proposed, which decomposes the convolutional operator thus reducing the computational overhead of encoding contextual information.
In addition, this module fuses multi-level fusion features with multi-scale context-specific diagnosis to enrich the representation of information, thus enhancing the recognition accuracy.
Further, this embodiment is also compared with existing methods on multiple datasets to validate the effectiveness of the present application.
Dataset: The dataset used in this application is the recognized standard scene perception dataset
Cityscapes, consisting of 25,000 annotated 2048x1024 images with resolution. The annotation set contains 30 classes, 19 of which are used for training and evaluation. In the experiments of this application, only 5000 images with fine annotations were used. There are 2975 images for training, 500 images for validation, and 1525 images for testing.
Parameter settings: All experiments were performed on an NVIDIA 1080Ti GPU card. We randomly scale the images by a factor of 0.5 to 1.5 and randomly apply a random left-right flip operation to all training sets. In addition, the initial learning rate is set to 0.005 and the learning rate is decayed using the poly strategy. The network is trained by minimizing the pixel cross-entropy loss using a stochastic gradient descent optimization algorithm, where the momentum is 0.9 and the weights decay to Se-4. Finally, batch normalization layers are applied before all regular or dilated
BL-5594 15 convolutional layers to achieve fast convergence. 0503080
Evaluation Metrics: This application uses four evaluation metrics: recognized in the field of semantic segmentation: segmentation accuracy, inference speed, network parameters, and computational complexity.
Multi-level feature fusion module ablation experiment:
As shown in Table 1, this application compares four multi-level feature fusion models with the benchmark model: elemental additive fusion (EAF), average pool attention refinement (AAR), maximized attention refinement (MAR), and combined use of AAR and MAR. As described in the table, the performance of EAF is only 1.12% higher than the benchmark network, which indicates that direct fusion of multilevel features is a suboptimal solution. Compared with the benchmark network, AAR and MAR achieve 2.61% mloU and 2.54% mloU performance improvement, which indicates that the interdependence between modeling channels can reduce the semantic variability between multi-level features. The bilateral pooling attention strategy proposed in this application compensates the saliency information and global information with each other. Therefore, MFFM achieves a further improvement of 0.55% mloU and 0.62% mloU compared to AAR and MAR.
Moreover, the additional computation added by the proposed MFFM is negligible (only 0.06M parameters and 0.11GFLOPs), and the above results validate the efficiency and effectiveness of the proposed module.
Model Speed (ms) Number of FLOPs(G) MIoU(%) parameters (M)
Table 1 Ablation learning of multi-level feature fusion module
Experiments on ablation of lightweight semantic pyramid modules:
The experiment evaluates the performance of a lightweight semantic pyramid module. SC-SPM,
FC-SPM, DC-SPM and DFC-SPM denote methods with four semantic pyramid modules built on regular convolution, decomposed convolution, deep convolution and deep decomposable convolution,
BL-5594 16 respectively. As shown in Table 2,1) the semantic segmentation method with semantic pyramid 790 module can improve the mIOU segmentation accuracy by about 1.11%-2.70% compared with the benchmark model EAF, indicating that extracting local and global contextual information can significantly improve the learning ability of the model. 2) Although SC-SPM, FC-SPM, DC-SPM and DFC-SPM achieve similar accuracy performance, building semantic pyramid modules based on efficient convolution achieves better efficiency (faster and less computational complexity) than building modules based on conventional convolution. DFC-SPM obtains 71.02% IU with only 0.05M additional parameters with 0.20G FLOPs. (3) LSPM integrates contextual information and multi- level feature information by designing a short-range feature learning operation, which is used to encourage information transfer and gradient conduction of multi-level information in the front layer.
As a result, the accuracy performance of the DFC-SPM method is improved from 71.02% mloU to 71.65% mloU. The above results demonstrate the efficiency and effectiveness of the proposed LSPM. parameters (M)
Table 2 Ablation learning of lightweight semantic pyramid modules
Evaluation on a benchmark dataset:
The DFFNet is compared with other existing semantic segmentation methods on the Cityscapes dataset. The "-" indicates that the method does not publish the corresponding performance values.
BL-5594 17
L
Number of U503090
Model Resolution Speed (FPS) FL.OPsits parameters M) foliose
SerNet- ALO = 360 14 Ge 2280 = 29,55 36.10
Hkiy-MebileNet & 1024 > 3130 SH 15.4 « Ide 62.40
CHE Reet 1G28 SEE 34,5> 16.50 Re FAP
LW-BetimeNetit1 « 312+ Sie SE Qs 320 = 4625 TIA
FINS 1924 = M12 Le 136.3 134.56 63.30
Diistion LO: IGI4 Size ANZ wi 130 Be ST
DRE-C-26- 1024 x $12 = 358.30 20.65 68.0%
Deeplab Vas 1024 # $120 ORE 487.80 $4.80 Me
RefineNet- 1024 x S12- 23e 115, 1- wi FG
FensgASPPI2I» 1924 - S32e + 1&3. 28.60 TE 2e 644k x Aid 83.30 3.1.
DEFNet- 713 = THR 85.80 6,80 18 TE 1824 BYR #23: 6.8 182 ~ THE 23,40 158.5.
Table 3 Comprehensive performance of the methods in this chapter and the comparison methods on the Cityscapes dataset
As shown in Table 3, SegNet and ENET improve the speed by significantly compressing the model size at the expense of segmentation accuracy. LW-Refine Net and ERFNet design an asymmetric codec structure to maintain a balance of accuracy and efficiency. BeSiNet, CANet and
ICNet use a multi-branch structure to achieve a good balance between accuracy and speed, but introduce more additional learning parameters. On the contrary, DFFNet achieves better accuracy and efficiency performance, especially reflected in the reduction of network parameters (1.9M number of parameters) and computational complexity (3.1 GFLOPs). In addition, FCN and Dilation10 use computationally expensive VGG backbone networks (e.g., VGG16 and VGG19) as feature extractors that require more than 2 seconds to process images. DRN, DeepLabV,, RefinineNet, and PSPNet employ deep ResNet backbone networks (e.g., ResNet50 and ResNet101) for enhanced multi-scale feature representation, which requires significant computational cost and memory usage. Compared with these accuracy-oriented methods, the present application method processes an image with 640x360 resolution in only 12ms while achieving 71.0% mIoU segmentation accuracy.
BL-5594 18
In summary, this application method achieves a comprehensive segmentation performance 392090 terms of accuracy and efficiency (inference speed, network parameters, and computational complexity), which makes it more potential for deployment on resource-limited IoT devices.
Embodiment II:
Referring to Fig. 3, the present case provides a semantic segmentation system based on dual feature fusion for IoT sensing, including a connected multilayer feature fusion module and a lightweight semantic pyramid module;
The multi-layer feature fusion module includes a backbone network unit, a proofreading unit;
The lightweight semantic pyramid module includes the first downscaling unit, the second downscaling unit, the third downscaling unit, the context encoding unit, the global pooling unit, the first channel splicing fusion unit, the second channel splicing fusion unit, and the upsampling unit;
Among them, the backbone network unit is connected to the proofreading unit, the proofreading unit is connected to the first downscaling unit and the second downscaling unit, the first downscaling unit is connected to the context encoding unit and the global pooling unit, the context encoding unit and the global pooling unit are connected to the first channel splicing fusion unit, the second downscaling unit and the first channel splicing fusion unit are connected to the second channel splicing fusion unit, and the second channel splicing fusion unit is also connected to the third downsampling unit, and the upsampling unit is connected to the third downsampling unit;
The mentioned backbone network unit for feature encoding of the original image using the backbone network to obtain features at different scales;
The mentioned proofreading unit for learning features at different scales by two attention refinement blocks to obtain multi-level fusion features;
Both the first dimensionality reduction unit as well as the second dimensionality reduction unit are used to reduce the dimensionality of the multi-level fusion features to output the first dimensionality reduction feature, the second dimensionality reduction feature, and the first dimensionality reduction feature as well as the second dimensionality reduction feature are the same;
Context encoding unit for context encoding the first descending features by deep decomposable convolution at different convolutional scales to obtain local features at different scales, respectively;
A global pooling unit for globally pooling the first reduced-dimensional features by a global mean pooling layer to obtain global features;
A first channel stitching and fusion unit for channel stitching and fusion of global features as
BL-5594 19 well as local features at different scales to obtain multi-scale contextual fusion features; 4503090
A second channel stitching fusion unit for channel stitching fusion of a second reduced dimensional feature, a multi-scale contextual fusion feature to obtain a stitching feature;
Third dimensionality reduction unit for dimensionality reduction of the spliced features;
Up-sampling unit for up-sampling the downsampled spliced features to obtain the final output.
It should be noted that this embodiment provides a semantic segmentation system based on dual- feature fusion for IoT perception, which is similar to embodiment 1 and will not be described here.
The above-described embodiments are only a description of the preferred embodiment of the present application, not a limitation of the scope of the present application. Without departing from the spirit of the design of the present application, all kinds of deformations and improvements made to the technical solution of the present application by a person of ordinary skill in the art shall fall within the scope of protection of the present application.

Claims (10)

BL-5594 20 Claims LU503090
1. A semantic segmentation method based on dual feature fusion for IoT sensing, whose features are as follows:
S1. Input the original image, and use the backbone network to encode the features of the original image to obtain the features at different scales;
S2. Learning of features at different scales by two attention refinement blocks to obtain multi- level fusion features;
S3. Perform dimensionality reduction on the multilevel fusion features to obtain the dimensionality reduction features;
S4. Contextual encoding of the downscaled features with depth-decomposable convolutions at different convolutional scales are used to obtain local features at different scales, respectively;
SS. Global pooling of the downscaled features with the global mean pooling layer to obtain the global features;
S6. Fusing global features as well as local features of different scales for channel splicing to obtain multi-scale context fusion features;
S7. Perform channel splicing fusion of the downscaled features, and multi-scale context fusion features to obtain the splicing features;
S8. Downsampling and upsampling the spliced features to obtain the final output.
2. A semantic segmentation method based on dual feature fusion for IoT sensing according to claim 1, characterized in that step S1 is specified as: The original image is feature encoded using the backbone network to obtain the first, second, and third features, where the first feature scale is 1/4 of the original image scale, the second feature scale is 1/8 of the original image scale, and the third feature scale is 1/16 of the original image scale.
3. A semantic segmentation method based on dual feature fusion for IoT sensing according to claim 2, characterized in that step S2 includes the following steps:
S2.1. Fusing the first features, second features by a first attention refinement block to output semantic features;
S2.2. The semantic features as well as the third features are fused by the second attention refinement block to obtain the multi-level fused features.
4. À semantic segmentation method based on dual feature fusion for IoT sensing according to claim 3, specifically shown in the step S2.1:
BL-5594 21
S2.1.1. The first feature is mapped to the same scale as the second feature throught Tha C0 downsampling layer to obtain the first scale feature;
S2.1.2. Mapping the channel dimension of the first scale feature to coincide with the channel dimension of the second feature by a first 1*1 convolution layer to obtain the first channel feature;
S2.1.3. Fusion ofthe first scale features with second features for channel stitching to obtain first fusion features;
S2.1.4. Inputting the first fused features into the first adaptive mean pooling layer and the first adaptive maximum pooling layer, respectively, to output the first attention vector, and the second attention vector, respectively;
S2.1.5. Non-linear Mapping of the first attention vector and the second attention vector by the first multilayer perception layer to output the first mixed attention vector and the second mixed attention vector, and fusion of the first mixed attention vector and the second mixed attention vector to output the first fused mixed attention vector;
S2.1.6. Normalize the first fused mixed attention vector to obtain the first normalized mixed attention vector;
S2.1.7. Mapping a first channel feature with a first normalized mixed attention vector weighted by a first normalized attention vector;
S2.1.8. Fusing the second features as well as the weighted first channel features to output semantic features.
5. A semantic segmentation method based on dual feature fusion for IoT perception according to claim 4, characterized in that step S2.2 specifically comprises the following steps:
S2.2.1. The third feature is mapped to the same scale as the second feature through the upsampling layer to obtain the second scale feature;
S2.2.2. Mapping the channel dimension of the second scale feature by a second 1*1 convolution layer to coincide with the second feature channel dimension to obtain the second channel feature;
S2.2.3. Fusing the second scale features with the semantic features for channel splicing to obtain the second fusion features;
S2.2.4. Inputting the second fusion features into the second adaptive mean pooling layer and the second adaptive maximum pooling layer, respectively, to output the third attention vector, the fourth attention vector respectively
S2.2.5. Non-linear Mapping of the third attention vector and the fourth attention vector by the
BL-5594 22 second multilayer perception layer to output the third mixed attention vector and the fourth mixed 090 attention vector, and fusion of the third mixed attention vector and the fourth mixed attention vector to output the second fused mixed attention vector;
S2.2.6. Normalize the second fused mixed attention vector to obtain the second normalized mixed attention vector;
S2.2.7. Mapping the second channel features with a second normalized mixed attention vector weighted by the second normalized attention vector;
S2.2.8, Fusion semantic features as well as weighted second channel features to obtain multi- level fusion features.
6. A semantic segmentation method based on dual feature fusion for IoT perception according to claim 5, is characterized in that the first fused features described in step S2.1.4 are input into the first adaptive mean pooling layer and the first adaptive maximum pooling layer, respectively, to output the first attention vector, the second attention vector, respectively, using the following equation: V, = AAP, (C[F,> F,]) > V, = AMP, (C[F,, F;]) > Where, V1 is the first attention vector, V2 is the second attention vector, F1 is the first scale feature, F2 is the second feature, C[] denotes channel stitching fusion, AAP1 () denotes the first adaptive mean pooling layer, and AMP1 () denotes the first adaptive maximum pooling layer.
7. A semantic segmentation method based on dual feature fusion for IoT sensing according to claim 6, is characterized in that the second fused features described in step S2.2.4 are input into the second adaptive mean pooling layer, and the second adaptive maximum pooling layer respectively, to output the third attention vector, the fourth attention vector, respectively, by using the following equation: Vz = AAP, (C[L,, L,]) > V, = AMP, (C[L,, L,]) > Where, V; is a third attention vector, V4 is a fourth attention vector, L1 is a second scale feature, La is a semantic feature, AAP, () denotes a second adaptive mean pooling layer, and AMP; () denotes a second adaptive maximum pooling layer.
8. A semantic segmentation method based on dual feature fusion for IoT perception according
BL-5594 23 to claim 7, is characterized by that 7503090 The non-linear Mapping of the first attention vector, and the second attention vector by the first multilayer perception layer is described in step S2.1.5 to output the first mixed attention vector, the second mixed attention vector, and the channel stitching and fusion of the first mixed attention vector, the second mixed attention vector to output the first fused mixed attention vector, using the following equation: Va, = MLP, (C[Vy, V;]) » Non-linear Mapping of the third attention vector, the fourth attention vector by the second multilayer perception layer as described in step S2.2.5 to output the third blended attention vector, the fourth blended attention vector, and fusing the third blended attention vector, the fourth blended attention vector to output the second fused blended attention vector, specifically using the following equation: Vaz = MLP,(C[V5, V,])- Where, Vai is the first fused mixed attention vector, Var is the second fused mixed attention vector, MLP; () is the first multilayer perception layer, and MLP; () is the second multilayer perception layer.
9. A semantic segmentation method based on dual feature fusion for IoT sensing according to claim 8, characterized in that step: S2.1.6, normalizing the first fused hybrid attention vector to obtain the first normalized hybrid attention vector, S2.1.7, Mapping the first channel features with the first normalized hybrid attention vector weighted, S2.1.8, fusing the second features as well as the weighted first channel features to output semantic features by using the following equation: LES 94) © FE, Step: S2.2.6, normalizing the second fused mixed attention vector to obtain the second normalized mixed attention vector, S2.2.7, Mapping the second channel features with the second normalized mixed attention vector weighted by the second normalized mixed attention vector, S2.2.8, fusing the semantic features and the weighted second channel features to obtain the multi-level fused features by using the following equation: LES g,(\) ® LBL", Where, L is a semantic feature, L is a multi-level fusion feature, Sig1 () denotes the first activation function, Sig () denotes the second activation function, F, is a first channel feature, L
BL-5594 24 is a second channel feature, H denotes the height of the feature map, W denotes the width of the 0% feature map, © denotes a pixel-level dot product operation, and ® denotes a pixel-level dot add operation.
10. A semantic segmentation system based on bipartite feature fusion for IoT sensing, characterized in that it includes a connected multilayer feature fusion module and a lightweight semantic pyramid module; The multi-layer feature fusion module includes a backbone network unit, a proofreading unit; The lightweight semantic pyramid module includes the first downscaling unit, the second downscaling unit, the third downscaling unit, the context encoding unit, the global pooling unit, the first channel splicing fusion unit, the second channel splicing fusion unit, and the upsampling unit; Among them, the backbone network unit is connected to the proofreading unit, the proofreading unit is connected to the first downscaling unit and the second downscaling unit, the first downscaling unit is connected to the context encoding unit and the global pooling unit, the context encoding unit and the global pooling unit are connected to the first channel splicing fusion unit, the second downscaling unit, and the first channel splicing fusion unit are connected to the second channel splicing fusion unit, and the second channel splicing fusion unit is also connected to the third downsampling unit, and the upsampling unit is connected to the third downsampling unit; The mentioned backbone network unit for feature encoding of the original image using the backbone network to obtain features at different scales; The mentioned proofreading unit for learning features at different scales by two attention refinement blocks to obtain multi-level fusion features; Both the first dimensionality reduction unit as well as the second dimensionality reduction unit are used to reduce the dimensionality of the multi-level fusion features to output the first and the second dimensionality reduction features; the first and the second dimensionality reduction features are the same; Context encoding unit for context encoding the first descending features by deep decomposable convolution at different convolutional scales to obtain local features at different scales, respectively; A global pooling unit for globally pooling the first reduced-dimensional features by a global mean pooling layer to obtain global features; A first channel stitching and fusion unit for channel stitching and fusion of global features as
BL-5594 25 well as local features at different scales to obtain multi-scale contextual fusion features; 4503090 A second channel stitching fusion unit for channel stitching fusion of a second reduced dimensional feature, a multi-scale contextual fusion feature to obtain a stitching feature; Third dimensionality reduction unit for dimensionality reduction of the spliced features; Up-sampling unit for up-sampling the downsampled spliced features to obtain the final output.
LU503090A 2021-04-25 2022-03-17 A semantic segmentation system and method based on dual feature fusion for iot sensing LU503090B1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110446945.6A CN113221969A (en) 2021-04-25 2021-04-25 Semantic segmentation system and method based on Internet of things perception and based on dual-feature fusion

Publications (1)

Publication Number Publication Date
LU503090B1 true LU503090B1 (en) 2023-03-22

Family

ID=77088741

Family Applications (1)

Application Number Title Priority Date Filing Date
LU503090A LU503090B1 (en) 2021-04-25 2022-03-17 A semantic segmentation system and method based on dual feature fusion for iot sensing

Country Status (4)

Country Link
CN (1) CN113221969A (en)
LU (1) LU503090B1 (en)
WO (1) WO2022227913A1 (en)
ZA (1) ZA202207731B (en)

Families Citing this family (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221969A (en) * 2021-04-25 2021-08-06 浙江师范大学 Semantic segmentation system and method based on Internet of things perception and based on dual-feature fusion
CN114913325B (en) * 2022-03-24 2024-05-10 北京百度网讯科技有限公司 Semantic segmentation method, semantic segmentation device and computer program product
CN114445430B (en) * 2022-04-08 2022-06-21 暨南大学 Real-time image semantic segmentation method and system for lightweight multi-scale feature fusion
CN116229065B (en) * 2023-02-14 2023-12-01 湖南大学 Multi-branch fusion-based robotic surgical instrument segmentation method
CN116342884B (en) * 2023-03-28 2024-02-06 阿里云计算有限公司 Image segmentation and model training method and server
CN116052007B (en) * 2023-03-30 2023-08-11 山东锋士信息技术有限公司 Remote sensing image change detection method integrating time and space information
CN116205928B (en) * 2023-05-06 2023-07-18 南方医科大学珠江医院 Image segmentation processing method, device and equipment for laparoscopic surgery video and medium
CN116580241B (en) * 2023-05-22 2024-05-14 内蒙古农业大学 Image processing method and system based on double-branch multi-scale semantic segmentation network
CN116630386B (en) * 2023-06-12 2024-02-20 新疆生产建设兵团医院 CTA scanning image processing method and system thereof
CN116721351B (en) * 2023-07-06 2024-06-18 内蒙古电力(集团)有限责任公司内蒙古超高压供电分公司 Remote sensing intelligent extraction method for road environment characteristics in overhead line channel
CN116559778B (en) * 2023-07-11 2023-09-29 海纳科德(湖北)科技有限公司 Vehicle whistle positioning method and system based on deep learning
CN116612124B (en) * 2023-07-21 2023-10-20 国网四川省电力公司电力科学研究院 Transmission line defect detection method based on double-branch serial mixed attention
CN116721420B (en) * 2023-08-10 2023-10-20 南昌工程学院 Semantic segmentation model construction method and system for ultraviolet image of electrical equipment
CN116740866B (en) * 2023-08-11 2023-10-27 上海银行股份有限公司 Banknote loading and clearing system and method for self-service machine
CN117115443B (en) * 2023-08-18 2024-06-11 中南大学 Segmentation method for identifying infrared small targets
CN117636165A (en) * 2023-11-30 2024-03-01 电子科技大学 Multi-task remote sensing semantic change detection method based on token mixing
CN117809294B (en) * 2023-12-29 2024-07-19 天津大学 Text detection method based on feature correction and difference guiding attention
CN117876929B (en) * 2024-01-12 2024-06-21 天津大学 Sequential target positioning method for progressive multi-scale context learning
CN117593633B (en) * 2024-01-19 2024-06-14 宁波海上鲜信息技术股份有限公司 Ocean scene-oriented image recognition method, system, equipment and storage medium
CN117745745B (en) * 2024-02-18 2024-05-10 湖南大学 CT image segmentation method based on context fusion perception
CN118037664B (en) * 2024-02-20 2024-10-01 成都天兴山田车用部品有限公司 Deep hole surface defect detection and CV size calculation method
CN117789153B (en) * 2024-02-26 2024-05-03 浙江驿公里智能科技有限公司 Automobile oil tank outer cover positioning system and method based on computer vision
CN117828280B (en) * 2024-03-05 2024-06-07 山东新科建工消防工程有限公司 Intelligent fire information acquisition and management method based on Internet of things
CN118052739A (en) * 2024-03-08 2024-05-17 东莞理工学院 Deep learning-based traffic image defogging method and intelligent traffic image processing system
CN117993442B (en) * 2024-03-21 2024-10-18 济南大学 Hybrid neural network method and system for fusing local and global information
CN118072357B (en) * 2024-04-16 2024-07-02 南昌理工学院 Control method and system of intelligent massage robot
CN118429808A (en) * 2024-05-10 2024-08-02 北京信息科技大学 Remote sensing image road extraction method and system based on lightweight network structure
CN118230175B (en) * 2024-05-23 2024-08-13 济南市勘察测绘研究院 Real estate mapping data processing method and system based on artificial intelligence
CN118366000A (en) * 2024-06-14 2024-07-19 陕西天润科技股份有限公司 Cultural relic health management method based on digital twinning
CN118397298B (en) * 2024-06-28 2024-09-06 杭州安脉盛智能技术有限公司 Self-attention space pyramid pooling method based on mixed pooling and related components
CN118429335B (en) * 2024-07-02 2024-09-24 新疆胜新复合材料有限公司 Online defect detection system and method for carbon fiber sucker rod
CN118470679B (en) * 2024-07-10 2024-09-24 山东省计算中心(国家超级计算济南中心) Lightweight lane line segmentation recognition method and system
CN118485835B (en) * 2024-07-16 2024-10-01 杭州电子科技大学 Multispectral image semantic segmentation method based on modal divergence difference fusion

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150104102A1 (en) * 2013-10-11 2015-04-16 Universidade De Coimbra Semantic segmentation method with second-order pooling
CN111210432B (en) * 2020-01-12 2023-07-25 湘潭大学 Image semantic segmentation method based on multi-scale multi-level attention mechanism
CN111915619A (en) * 2020-06-05 2020-11-10 华南理工大学 Full convolution network semantic segmentation method for dual-feature extraction and fusion
CN111932553B (en) * 2020-07-27 2022-09-06 北京航空航天大学 Remote sensing image semantic segmentation method based on area description self-attention mechanism
CN112651973B (en) * 2020-12-14 2022-10-28 南京理工大学 Semantic segmentation method based on cascade of feature pyramid attention and mixed attention
CN113221969A (en) * 2021-04-25 2021-08-06 浙江师范大学 Semantic segmentation system and method based on Internet of things perception and based on dual-feature fusion

Also Published As

Publication number Publication date
WO2022227913A1 (en) 2022-11-03
ZA202207731B (en) 2022-07-27
CN113221969A (en) 2021-08-06

Similar Documents

Publication Publication Date Title
LU503090B1 (en) A semantic segmentation system and method based on dual feature fusion for iot sensing
WO2021233342A1 (en) Neural network construction method and system
CN107480206A (en) A kind of picture material answering method based on multi-modal low-rank bilinearity pond
CN109712108B (en) Visual positioning method for generating network based on diversity discrimination candidate frame
CN109522945A (en) One kind of groups emotion identification method, device, smart machine and storage medium
CN112991350A (en) RGB-T image semantic segmentation method based on modal difference reduction
CN114463545A (en) Image semantic segmentation algorithm and system based on multi-channel depth weighted aggregation
CN113486190A (en) Multi-mode knowledge representation method integrating entity image information and entity category information
CN111401151B (en) Accurate three-dimensional hand posture estimation method
CN116244473B (en) Multi-mode emotion recognition method based on feature decoupling and graph knowledge distillation
CN113961736A (en) Method and device for generating image by text, computer equipment and storage medium
CN113516133A (en) Multi-modal image classification method and system
CN111985597A (en) Model compression method and device
CN117033609A (en) Text visual question-answering method, device, computer equipment and storage medium
CN117292704A (en) Voice-driven gesture action generation method and device based on diffusion model
CN112052945B (en) Neural network training method, neural network training device and electronic equipment
CN111709415A (en) Target detection method, target detection device, computer equipment and storage medium
Lv et al. An inverted residual based lightweight network for object detection in sweeping robots
CN113657272B (en) Micro video classification method and system based on missing data completion
CN113887501A (en) Behavior recognition method and device, storage medium and electronic equipment
CN113836319A (en) Knowledge completion method and system for fusing entity neighbors
CN116612288B (en) Multi-scale lightweight real-time semantic segmentation method and system
CN118038032A (en) Point cloud semantic segmentation model based on super point embedding and clustering and training method thereof
CN115375922B (en) Light-weight significance detection method based on multi-scale spatial attention
WO2023071658A1 (en) Ai model processing method and apparatus, and ai model computing method and apparatus

Legal Events

Date Code Title Description
FG Patent granted

Effective date: 20230322