LU503090B1

LU503090B1 - A semantic segmentation system and method based on dual feature fusion for iot sensing

Info

Publication number: LU503090B1
Application number: LU503090A
Authority: LU
Inventors: Wenxuan Tu; Xinzhong Zhu; Huiying Xu; Jianmin Zhao
Original assignee: Univ Zhejiang Normal
Priority date: 2021-04-25
Filing date: 2022-03-17
Publication date: 2023-03-22
Also published as: WO2022227913A1; ZA202207731B; CN113221969A

Abstract

This application discloses a semantic segmentation system and method based on dual feature fusion for IoT perception, the method comprising steps: S1. Feature encoding of the original image to obtain different scale features; S2. Learning of features at different scales by two attention refinement blocks to obtain multi-level fusion features; S3. Dimensionality reduction of multi-level fusion features to obtain dimensionality reduction features; S4. Contextual encoding of downscaled features with deeply decomposable convolutions at different convolutional scales to obtain local features at different scales; S5. Global pooling of downscaled features with a global mean pooling layer to obtain global features; S6. Fusion of global features, and local features for channel stitching to obtain multi-scale contextual fusion features; S7. Fusion of downscaled features, multi-scale contextual fusion features for channel splicing to obtain spliced features; and S8. Output is obtained based on splicing features. The semantic discrepancy between multi-layer features is alleviated, and the information representation is enriched to improve the recognition accuracy.

Description

BL-5594 1

A SEMANTIC SEGMENTATION SYSTEM AND METHOD BASED ON DUAL FEATURE 0%

FUSION FOR IOT SENSING

Technical Field

The present application belongs to the field of computer vision technology and specifically relates to a system and method for semantic segmentation based on bipartite feature fusion for IoT perception.

Background Art

Semantic segmentation, which aims to densely assign each pixel to the corresponding predefined class, has been studied in the field of semantic segmentation with good performance in many IoT applications such as autonomous driving, diabetic retinopathy, and image analysis. Among them, two important factors, the way of feature fusion and the complexity of the network, significantly determine the performance of the semantic segmentation method.

Existing semantic segmentation methods can be broadly classified into two categories: accuracy- oriented and efficiency-oriented methods. In the early days, most previous works focused too much on the recognition accuracy of a single perspective: algorithm or the speed of algorithm execution efficiency. In the first class of approaches, the design idea of semantic segmentation models focuses on how to integrate diverse features to achieve high accuracy segmentation performance by designing a complex framework. For example, pyramidal structures have been proposed, the empty space pyramid pooling module (ASPP) or the contextual pyramid module (CPM), which encodes multi- scale contextual information at the tail end of the backbone network ResNet101 (2048 feature maps), for dealing with multiscale variations of targets. In addition, U - type networks directly fuse hierarchical features through long-hop connection operations to extract spatial information at different levels as much as possible, thus achieving accurate pixel segmentation. On the other hand, a typical asymmetric decoder structure has also been extensively studied. ENet and ESPNet networks substantially compress the network size by pruning operations to process large-scale images online at a very fast rate. To improve the overall performance of semantic segmentation methods, the recent semantic segmentation literature shows a tendency to uniformly balance the efficiency and effectiveness of segmentation networks when encoding multi-level features and multi-scale contextual information. Specifically, ERFNet employs a large number of decomposable convolutions

BL-5594 2 with different expansion rates in the decoder part to make the parameters less redundant and expand 090 the perceptual field at the same time. In addition, BiSenet, CANet, and ICNet have been proposed, which can process the input images individually by several lightweight sub-networks and then fuse the multi-layer features or depth context information together. Recently, CIFReNet encodes multi- layer and multi-scale information by introducing feature refinement and context integration modules to achieve accurate and efficient scene segmentation.

Although the existing semantic segmentation methods achieve better segmentation performance in terms of high accuracy or fast speed, there are at least the following problems: 1) relying on more time and computational complexity to complete the feature extraction process in the multi-level information fusion process, which leads to inefficient model learning and high computational cost; 2) direct fusion of multi-source information by element-level addition or cascade operations Little consideration is given to how to narrow the semantic gap between multi-layer features. As a result, the interaction between multiple information sources is hindered, resulting in suboptimal segmentation accuracy.

Summary of the Invention

To address the above-mentioned problems in the prior art, this application proposes a semantic segmentation system and method based on dual-feature fusion for IoT sensing, which achieves a balance of four comprehensive performances in terms of accuracy, speed, storage and computational complexity.

A Semantic segmentation method based on dual feature fusion for IoT sensing, comprising the steps of:

S1. Input the original image, and use the backbone network to encode the features of the original image to obtain the features at different scales;

S2. Learning of features at different scales by two attention refinement blocks to obtain multi- level fusion features;

S3. Perform dimensionality reduction on the multilevel fusion features to obtain the dimensionality reduction features;

S4. Contextual encoding of the downscaled features with depth-decomposable convolutions at different convolutional scales are used to obtain local features at different scales, respectively;

SS. Global pooling of the downscaled features with the global mean pooling layer to obtain the

BL-5594 3 global features; 7503090

S6. Fusing global features as well as local features of different scales for channel splicing to obtain multi-scale context fusion features;

S7. Perform channel splicing fusion of the downscaled features, and multi-scale context fusion features to obtain the splicing features;

S8. Downsampling and upsampling the spliced features to obtain the final output.

As a preferred embodiment, step S1 is specified as:

The original image is feature encoded using the backbone network to obtain the first, second, and third features, where the first feature scale is 1/4 of the original image scale, the second feature scale is 1/8 of the original image scale, and the third feature scale is 1/16 of the original image scale.

As a preferred solution, the following steps are included in step S2:

S2.1. Fusing the first features, second features by a first attention refinement block to output semantic features;

S22. The semantic features as well as the third features are fused by the second attention refinement block to obtain the multi-level fused features.

As a preferred embodiment, step S2.1 specifically includes the following steps:

S2.1.1. The first feature is mapped to the same scale as the second feature through the downsampling layer to obtain the first scale feature;

S2.1.2. Mapping the channel dimension of the first scale feature to coincide with the channel dimension of the second feature by a first 1*1 convolution layer to obtain the first channel feature;

S2.1.3. Fusion of first scale features with second features for channel stitching to obtain first fusion features;

S2.1.4. Inputting the first fused features into the first adaptive mean pooling layer and the first adaptive maximum pooling layer, respectively, to output the first attention vector, the second attention vector, respectively;

S2.1.5. Non-linear Mapping of the first attention vector and the second attention vector by the first multilayer perception layer to output the first mixed attention vector and the second mixed attention vector, and fusion of the first mixed attention vector and the second mixed attention vector to output the first fused mixed attention vector;

S2.1.6. Normalize the first fused mixed attention vector to obtain the first normalized mixed attention vector;

BL-5594 4

S2.1.7. Mapping a first channel feature with a first normalized mixed attention vector weighted 020 by a first normalized attention vector;

S2.1.8. Fusing the second features as well as the weighted first channel features to output semantic features.

As a preferred embodiment, step S2.2 specifically comprises the following step:

S2.2.1. The third feature is mapped to the same scale as the second feature through the upsampling layer to obtain the second scale feature;

S2.2.2. Mapping the channel dimension of the second scale feature by a second 1*1 convolution layer to coincide with the second feature channel dimension to obtain the second channel feature;

S2.2.3. Fusing the second scale features with the semantic features for channel splicing to obtain the second fusion features;

S2.2.4. Inputting the second fusion features into the second adaptive mean pooling layer and the second adaptive maximum pooling layer, respectively, to output the third attention vector, the fourth attention vector;, respectively

S2.2.5. Non-linear Mapping of the third attention vector and the fourth attention vector by the second multilayer perception layer to output the third mixed attention vector and the fourth mixed attention vector, and fusion of the third mixed attention vector and the fourth mixed attention vector to output the second fused mixed attention vector;

S2.2.6. Normalize the second fused mixed attention vector to obtain the second normalized mixed attention vector;

S2.2.7. Mapping the second channel features with a second normalized mixed attention vector weighted by the second normalized attention vector;

S2.2.8. Fusing the semantic features as well as the weighted second channel features to obtain multi-level fusion features.

As a preferred embodiment, the first fusion features described in step S2.1.4 are input into the first adaptive mean pooling layer and the first adaptive maximum pooling layer, respectively, to output the first attention vector, the second attention vector, respectively, by using the following equation:

V, = AAP, (C[F,> F,]) >

V, = AMP, (C[F,, F;]) >

BL-5594 5

Where V1, is the first attention vector, V» is the second attention vector, F1 is the first A feature, F» is the second feature, C[] denotes channel stitching fusion, AAP; () denotes the first adaptive mean pooling layer, and AMP; () denotes the first adaptive maximum pooling layer.

As a preferred option, the second fusion features described in step S2.2.4 are input into the second adaptive mean pooling layer and the second adaptive maximum pooling layer, respectively, to output the third attention vector and the fourth attention vector, respectively, by using the following equation:

Vz = AAP, (C[L,, L,]) >

V, = AMP, (C[L,, L,]) >

Where, V; is a third attention vector, V4 is a fourth attention vector, L1 is a second scale feature,

L, is a semantic feature, AAP; () denotes a second adaptive mean pooling layer, and AMP: () denotes a second adaptive maximum pooling layer.

As the preferred embodiment, the non-linear Mapping of the first attention vector, the second attention vector to output the first blended attention vector, the second blended attention vector by the first multilayer perception layer described in step S2.1.5, and the channel stitching and fusion of the first blended attention vector, the second blended attention vector to output the first fused blended attention vector, specifically using the following equation:

Va, = MLP, (C[Vy, V;]) »

Non-linear Mapping of the third attention vector, the fourth attention vector by the second multilayer perception layer as described in step S2.2.5 to output the third blended attention vector, the fourth blended attention vector, and fusing the third blended attention vector, the fourth blended attention vector to output the second fused blended attention vector, specifically using the following equation:

Vaz = MLPz(C|V3> V4).

Where, Vai is the first fused mixed attention vector, Var is the second fused mixed attention vector,

MLP; () is the first multilayer perception layer, and MLP; () is the second multilayer perception layer.

As the preferred scheme, step: S2.1.6. normalizes the first fused mixed attention vector to obtain the first normalized mixed attention vector, S2.1.7. maps the first channel features with the first normalized mixed attention vector weighted, and S2.1.8. fuses the second features as well as the

BL-5594 6 weighted first channel features to output the semantic features by using the following equation: 7505090

LS g(V) © FF,

Step: S2.2.6, normalizing the second fused mixed attention vector to obtain the second normalized mixed attention vector, S2.2.7, Mapping the second channel features with the second normalized mixed attention vector weighted by the second normalized mixed attention vector, S2.2.8, fusing the semantic features and the weighted second channel features to obtain the multi-level fused features by using the following equation:

LES (Vp © LOL

Where, L is a semantic feature, L is a multi-level fusion feature, Sig1 () denotes the first activation function, Sig» () denotes the second activation function, F, is a first channel feature, L is a second channel feature, and H denotes the height of the feature map, W denotes the width of the feature map, © denotes a pixel-level dot product operation, and ® denotes a pixel-level dot add operation.

Accordingly, a semantic segmentation system based on dual feature fusion for IoT sensing is also provided, including a connected multilayer feature fusion module and a lightweight semantic pyramid module;

The multi-layer feature fusion module includes a backbone network unit, a proofreading unit;

The lightweight semantic pyramid module includes the first downscaling unit, the second downscaling unit, the third downscaling unit, the context encoding unit, the global pooling unit, the first channel splicing fusion unit, the second channel splicing fusion unit, and the upsampling unit;

Among them, the backbone network unit is connected to the proofreading unit, the proofreading unit is connected to the first downscaling unit and the second downscaling unit, the first downscaling unit is connected to the context encoding unit and the global pooling unit, the context encoding unit and the global pooling unit are connected to the first channel splicing fusion unit, the second downscaling unit and the first channel splicing fusion unit are connected to the second channel splicing fusion unit, and the second channel splicing fusion unit is also connected to the third downsampling unit, and the upsampling unit is connected to the third downsampling unit;

The mentioned backbone network unit for feature encoding of the original image using the backbone network to obtain features at different scales;

BL-5594 7

The mentioned proofreading unit for learning features at different scales by two attention 0 refinement blocks to obtain multi-level fusion features;

Both the first dimensionality reduction unit as well as the second dimensionality reduction unit are used to reduce the dimensionality of the multi-level fusion features to output the first the second dimensionality reduction features; the first and the second dimensionality reduction feature are the same;

Context encoding unit for context encoding the first descending features by deep decomposable convolution at different convolutional scales to obtain local features at different scales, respectively;

A global pooling unit for globally pooling the first reduced-dimensional features by a global mean pooling layer to obtain global features;

A first channel stitching and fusion unit for channel stitching and fusion of global features as well as local features at different scales to obtain multi-scale contextual fusion features;

A second channel stitching fusion unit for channel stitching fusion of a second reduced dimensional feature, a multi-scale contextual fusion feature to obtain a stitching feature;

Third dimensionality reduction unit for dimensionality reduction of the spliced features;

Up-sampling unit for up-sampling the downsampled spliced features to obtain the final output.

The beneficial effects of this application are: (1) A multi-level feature fusion module (MFFM) is proposed which uses two recursive attention refinement blocks (ARBs) to effectively improve the effectiveness of multi-level feature fusion.

Under the condition that the computational cost of the piecewise can be controlled, the proposed ARB mitigates the semantic discrepancy among multi-level features by using the abstract semantic information of higher-order features to calibrate the spatial detail information in lower-order features. (2) A lightweight semantic pyramid module (LSPM) is proposed, which decomposes the convolutional operator thus reducing the computational overhead of encoding contextual information.

In addition, this module fuses multi-level fusion features with multi-scale context-specific diagnosis to enrich the representation of information, thus improving the recognition accuracy.

Illustration of the attached figure

In order to illustrate the technical solutions more clearly in the embodiments or prior art of the present application, the following is a brief description of the accompanying drawings that need to be used in the description of the embodiments or prior art. It is obvious that the accompanying drawings

BL-5594 8 in the following description are only some of the embodiments of the present application, and other 090 accompanying drawings can be obtained based on them without any creative work for a person of ordinary skill in the art.

Figure 1 is a flowchart of a semantic segmentation method based on dual feature fusion for IoT sensing as described in the present application;

Figure 2 is a schematic diagram of the structure of a semantic segmentation system based on dual feature fusion for IoT sensing as described in the present application;

Figure 3 is a schematic diagram of the structure of the attention refinement block described in this application.

Detailed Description of Embodiments

The following illustrates the embodiment of the steps of the application by specific concrete examples, and other advantages and efficacy of the present application can be readily understood by those skilled in the art as disclosed in this specification. The present application may also be implemented or applied by additionally different specific embodiments, and the details in this specification may also be modified or changed in various ways without departing from the spirit of the present application based on different views and applications. It is to be noted that the following embodiments and the features in the embodiments can be combined with each other without conflict.

Embodiment I:

Referring to Figs. 1, 2 and 3, this embodiment provides a semantic segmentation method based on bipartite feature fusion for [oT sensing, comprising the steps:

SS. Global pooling of the downscaled features with the global mean pooling layer to obtain the global features;

BL-5594 9

S6. Fusing global features as well as local features of different scales for channel splicing 293090 obtain multi-scale context fusion features;

S7. Perform channel splicing fusion of the downscaled features, multi-scale context fusion features to obtain the splicing features;

Where, step S1 is specified as:

Each layer of the backbone network has different feature expression capabilities. Shallower layers contain more spatial details but lack semantic information; while deeper layers retain rich semantic information but lose a lot of spatial information. Intuitively, fusing multiple layers of information together has a gainful effect on learning differentiated and comprehensive feature representations.

Based on the above observations, we first obtain different scale features from the backbone network in order of notation as Ii, Is and 1116, and secondly unify the scale of all feature maps to 1/8 size to reduce information loss and resource utilization. Specifically, a 1x1 pooling layer is used to downsample [1/4 to obtain / ;g» and then a bilinear layer is used to upsample the higher-order feature map Ii/16 to obtain / ja Finally, the three are fused to obtain the multilevel fusion feature O.

The above process 1s expressed as follows:

Tys=T (GAR, = 414) )

T,,, =Upsampl e (I, 0

O=T Cle ® le D1,

Where, GAPk-2, s-2() denotes a global mean pooling layer with a scale of 2 steps, T() is defined as a channel transformation operation to change the number of feature maps. Upsample () denotes an upsampling layer, and ® denotes a pixel-level point-add operation.

Although the above feature fusion operations facilitate the mutual use of complementary information between multi-level features, direct integration of low-level features with high-level features may not be efficient and comprehensive due to the semantic variability in multi-level stages.

BL-5594 10

To solve this problem, this application designs a feature refinement strategy, called attention) proofreading block (ARB). Both focus on inter-channel relationship modeling with multi-level fusion features. With this approach, the model can emphasize the weights of neurons that are highly relevant to the target object when the current channel location contains features with valuable information.

That is, step S2 includes the following steps:

S2.1. Fusing the first features, and second features by a first attention refinement block to output semantic features;

Further, step S2.1 specifically comprises the step of:

S2.1.3. Fusion of first scale features with second features for channel stitching to obtain first fusion features:

S2.1.4. The first fused features are fed into the first adaptive average pooling layer (AAP) and the first adaptive maximum pooling layer (AMP) to output the first attention vector, the second attention vector; the adaptive average pooling layer (AAP) and the adaptive maximum pooling layer (AMP), respectively, all modeling the importance of each feature channel by assigning weights to all channels of the multilevel fused features. The higher the importance of the current feature channel, the higher the weight corresponding to that layer.

S2.1.5. The first attention vector, the second attention vector are nonlinearly mapped by the first multilayer perception layer to output the first blended attention vector, the second blended attention vector for enhancing the nonlinearity and robustness of the features, and the first blended attention vector, the second blended attention vector are fused to output the first fused blended attention vector;

S2.1.7. Mapping a first channel feature with a first normalized mixed attention vector weighted by a first normalized attention vector;

S2.1.8. Fusing the second features as well as the weighted first channel features to output

BL-5594 11 semantic features. 7505090

Further, step S2.2 specifically comprises the following steps:

S2.2.6, Normalizing the second fused mixed attention vector to obtain the second normalized mixed attention vector;

Further, the first fusion features described in step S2.1.4 are input into the first adaptive mean pooling layer and the first adaptive maximum pooling layer, respectively, to output the first attention vector, the second attention vector, respectively, by using the following equation:

V, = AAP, (C[F,> F,]) >

V, = AMP, (C[F,, F;]) >

Where V1 is the first attention vector, V2 is the second attention vector, F1 is the first scale feature,

F2 is the second feature, C[] denotes channel stitching fusion, AAP1 () denotes the first adaptive mean pooling layer, and AMP1 () denotes the first adaptive maximum pooling layer.

The second fused features described in step S2.2.4 are input into the second adaptive mean

BL-5594 12 pooling layer and the second adaptive maximum pooling layer, respectively, to output the third 7505090 attention vector, and the fourth attention vector, respectively, using the following equation:

Vz = AAP, (C[L,, L,]) >

V, = AMP, (C[L,, L,]) >

Wherein, V; is a third attention vector, V4 is a fourth attention vector, L1 is a second scale feature, Lo is a semantic feature, AAP; () denotes a second adaptive mean pooling layer, and AMP, () denotes a second adaptive maximum pooling layer.

Further, the non-linear Mapping of the first attention vector and the second attention vector to output the first mixed attention vector and the second mixed attention vector by the first multilayer sensing layer as described in step S2.1.5, and the channel stitching and fusion of the first mixed attention vector and the second mixed attention vector to output the first fused mixed attention vector, by using the following equation:

Va, = MLP, (C[Vy, V;]) »

Vaz = MLP,(C[V5, V,])-

MLP; () is the first multilayer perception layer, and MLP-; () is the second multilayer perception layer.

Further, step: S2.1.6. normalizes the first fused mixed attention vector to obtain the first normalized mixed attention vector, S2.1.7. maps the first channel features with the first normalized mixed attention vector weighted by the first normalized mixed attention vector, and S2.1.8. fuses the second features as well as the weighted first channel features to output the semantic features using the following equation:

LS g(V) © FF,

Step: S2.2.6, normalizing the second fused mixed attention vector to obtain the second normalized mixed attention vector, S2.2.7, Mapping the second channel features with the second

BL-5594 13 normalized mixed attention vector weighted by the second normalized mixed attention vector, 2389090 fusing the semantic features and the weighted second channel features to obtain the multi-level fused features by using the following equation:

LES (Vp © LOL

Where, L is a semantic feature, L is a multi-level fusion feature, Sig () denotes the first activation function, Sig, () denotes the second activation function, A is a first channel feature, L is a second channel feature, H denotes the height of the feature map, W denotes the width of the feature map, © denotes a pixel-level dot product operation, and ® denotes a pixel-level dot add operation.

Technically speaking, the design of ARB can be regarded as an information proofreading strategy where two attention-based paths predict the importance of each channel in a complementary manner, thus transferring more semantic information to the lower-level features to alleviate the semantic variability between different levels of features, thus achieving effective feature fusion. The experimental results in the following sections verify the effectiveness of the setup. It is worth noting that the ARB has only 0.03M number of parameters in total, and the entire multi-level feature fusion maintains a lightweight computational scale.

Further, to enhance the computational efficiency of the context extraction module, this application proposes the Deep Decomposable Convolution (DFC) operation to replace the standard convolution layer. Inspired by deep separable convolution and decomposed convolution, a major idea of lightweight feature extraction is to integrate the ideas of the above two techniques. First, the regularization layer and activation function are used as two preprocessing steps to enhance the regularity of the convolutional layer; Second, the 3x3 depth convolution is decomposed to obtain two sets of one-dimensional depth-separable convolutional layers with scales of 3x1 and 1x3, respectively.

With the above approach, the dense convolution kernels on all channels will be consistently sparse, thus reducing the computational complexity and resource overhead of convolution. Finally, the local features of all scales are fused with the global features to obtain multi-scale contextual fusion features.

After encoding the multi-scale contexts, the final segmentation results are further predicted using the reduced-dimensional multi-level fusion features combined with global features as well as local features at different scales. The above design has two advantages: On the one hand, multi-level

BL-5594 14 information and multi-scale contextual information are integrated in a unified system for moe 020 efficient feature representation, On the other hand, the use of jump connections can encourage information transfer and gradient conduction of multi-level information at the front layer, thus improving the efficiency of recognition.

The key technical points of this application are: (1) A novel IoT-oriented dual-feature fusion real-time semantic segmentation network (DFFNet) is invented. Compared with state-of-the-art methods, DFFNet reduces FLOPs by about 2.5 times and improves model execution speed by 1.8 times, while obtaining better accuracy. (2) A multi-level feature fusion module (MFFM) is proposed that employs two recursive attentional proofreading blocks (ARBs) to efficiently improve the effectiveness of multi-level feature fusion. Under the condition that the computational cost of the piecewise can be controlled, the proposed ARB mitigates the semantic discrepancy among multi-level features by using the abstract semantic information of higher-order features to calibrate the spatial detail information in lower-order features. (3) A lightweight semantic pyramid module (LSPM) is proposed, which decomposes the convolutional operator thus reducing the computational overhead of encoding contextual information.

In addition, this module fuses multi-level fusion features with multi-scale context-specific diagnosis to enrich the representation of information, thus enhancing the recognition accuracy.

Further, this embodiment is also compared with existing methods on multiple datasets to validate the effectiveness of the present application.

Dataset: The dataset used in this application is the recognized standard scene perception dataset

Cityscapes, consisting of 25,000 annotated 2048x1024 images with resolution. The annotation set contains 30 classes, 19 of which are used for training and evaluation. In the experiments of this application, only 5000 images with fine annotations were used. There are 2975 images for training, 500 images for validation, and 1525 images for testing.

Parameter settings: All experiments were performed on an NVIDIA 1080Ti GPU card. We randomly scale the images by a factor of 0.5 to 1.5 and randomly apply a random left-right flip operation to all training sets. In addition, the initial learning rate is set to 0.005 and the learning rate is decayed using the poly strategy. The network is trained by minimizing the pixel cross-entropy loss using a stochastic gradient descent optimization algorithm, where the momentum is 0.9 and the weights decay to Se-4. Finally, batch normalization layers are applied before all regular or dilated

BL-5594 15 convolutional layers to achieve fast convergence. 0503080

Evaluation Metrics: This application uses four evaluation metrics: recognized in the field of semantic segmentation: segmentation accuracy, inference speed, network parameters, and computational complexity.

Multi-level feature fusion module ablation experiment:

As shown in Table 1, this application compares four multi-level feature fusion models with the benchmark model: elemental additive fusion (EAF), average pool attention refinement (AAR), maximized attention refinement (MAR), and combined use of AAR and MAR. As described in the table, the performance of EAF is only 1.12% higher than the benchmark network, which indicates that direct fusion of multilevel features is a suboptimal solution. Compared with the benchmark network, AAR and MAR achieve 2.61% mloU and 2.54% mloU performance improvement, which indicates that the interdependence between modeling channels can reduce the semantic variability between multi-level features. The bilateral pooling attention strategy proposed in this application compensates the saliency information and global information with each other. Therefore, MFFM achieves a further improvement of 0.55% mloU and 0.62% mloU compared to AAR and MAR.

Moreover, the additional computation added by the proposed MFFM is negligible (only 0.06M parameters and 0.11GFLOPs), and the above results validate the efficiency and effectiveness of the proposed module.

Model Speed (ms) Number of FLOPs(G) MIoU(%) parameters (M)

Table 1 Ablation learning of multi-level feature fusion module

Experiments on ablation of lightweight semantic pyramid modules:

The experiment evaluates the performance of a lightweight semantic pyramid module. SC-SPM,

FC-SPM, DC-SPM and DFC-SPM denote methods with four semantic pyramid modules built on regular convolution, decomposed convolution, deep convolution and deep decomposable convolution,

BL-5594 16 respectively. As shown in Table 2,1) the semantic segmentation method with semantic pyramid 790 module can improve the mIOU segmentation accuracy by about 1.11%-2.70% compared with the benchmark model EAF, indicating that extracting local and global contextual information can significantly improve the learning ability of the model. 2) Although SC-SPM, FC-SPM, DC-SPM and DFC-SPM achieve similar accuracy performance, building semantic pyramid modules based on efficient convolution achieves better efficiency (faster and less computational complexity) than building modules based on conventional convolution. DFC-SPM obtains 71.02% IU with only 0.05M additional parameters with 0.20G FLOPs. (3) LSPM integrates contextual information and multi- level feature information by designing a short-range feature learning operation, which is used to encourage information transfer and gradient conduction of multi-level information in the front layer.

As a result, the accuracy performance of the DFC-SPM method is improved from 71.02% mloU to 71.65% mloU. The above results demonstrate the efficiency and effectiveness of the proposed LSPM. parameters (M)

Table 2 Ablation learning of lightweight semantic pyramid modules

Evaluation on a benchmark dataset:

The DFFNet is compared with other existing semantic segmentation methods on the Cityscapes dataset. The "-" indicates that the method does not publish the corresponding performance values.

BL-5594 17

L

Number of U503090

Model Resolution Speed (FPS) FL.OPsits parameters M) foliose

SerNet- ALO = 360 14 Ge 2280 = 29,55 36.10

Hkiy-MebileNet & 1024 > 3130 SH 15.4 « Ide 62.40

CHE Reet 1G28 SEE 34,5> 16.50 Re FAP

LW-BetimeNetit1 « 312+ Sie SE Qs 320 = 4625 TIA

FINS 1924 = M12 Le 136.3 134.56 63.30

Diistion LO: IGI4 Size ANZ wi 130 Be ST

DRE-C-26- 1024 x $12 = 358.30 20.65 68.0%

Deeplab Vas 1024 # $120 ORE 487.80 $4.80 Me

RefineNet- 1024 x S12- 23e 115, 1- wi FG

FensgASPPI2I» 1924 - S32e + 1&3. 28.60 TE 2e 644k x Aid 83.30 3.1.

DEFNet- 713 = THR 85.80 6,80 18 TE 1824 BYR #23: 6.8 182 ~ THE 23,40 158.5.

Table 3 Comprehensive performance of the methods in this chapter and the comparison methods on the Cityscapes dataset

As shown in Table 3, SegNet and ENET improve the speed by significantly compressing the model size at the expense of segmentation accuracy. LW-Refine Net and ERFNet design an asymmetric codec structure to maintain a balance of accuracy and efficiency. BeSiNet, CANet and

ICNet use a multi-branch structure to achieve a good balance between accuracy and speed, but introduce more additional learning parameters. On the contrary, DFFNet achieves better accuracy and efficiency performance, especially reflected in the reduction of network parameters (1.9M number of parameters) and computational complexity (3.1 GFLOPs). In addition, FCN and Dilation10 use computationally expensive VGG backbone networks (e.g., VGG16 and VGG19) as feature extractors that require more than 2 seconds to process images. DRN, DeepLabV,, RefinineNet, and PSPNet employ deep ResNet backbone networks (e.g., ResNet50 and ResNet101) for enhanced multi-scale feature representation, which requires significant computational cost and memory usage. Compared with these accuracy-oriented methods, the present application method processes an image with 640x360 resolution in only 12ms while achieving 71.0% mIoU segmentation accuracy.

BL-5594 18

In summary, this application method achieves a comprehensive segmentation performance 392090 terms of accuracy and efficiency (inference speed, network parameters, and computational complexity), which makes it more potential for deployment on resource-limited IoT devices.

Embodiment II:

Referring to Fig. 3, the present case provides a semantic segmentation system based on dual feature fusion for IoT sensing, including a connected multilayer feature fusion module and a lightweight semantic pyramid module;

The mentioned proofreading unit for learning features at different scales by two attention refinement blocks to obtain multi-level fusion features;

Both the first dimensionality reduction unit as well as the second dimensionality reduction unit are used to reduce the dimensionality of the multi-level fusion features to output the first dimensionality reduction feature, the second dimensionality reduction feature, and the first dimensionality reduction feature as well as the second dimensionality reduction feature are the same;

A first channel stitching and fusion unit for channel stitching and fusion of global features as

BL-5594 19 well as local features at different scales to obtain multi-scale contextual fusion features; 4503090

It should be noted that this embodiment provides a semantic segmentation system based on dual- feature fusion for IoT perception, which is similar to embodiment 1 and will not be described here.

The above-described embodiments are only a description of the preferred embodiment of the present application, not a limitation of the scope of the present application. Without departing from the spirit of the design of the present application, all kinds of deformations and improvements made to the technical solution of the present application by a person of ordinary skill in the art shall fall within the scope of protection of the present application.

Claims

BL-5594 20 Claims LU503090

1. A semantic segmentation method based on dual feature fusion for IoT sensing, whose features are as follows:

2. A semantic segmentation method based on dual feature fusion for IoT sensing according to claim 1, characterized in that step S1 is specified as: The original image is feature encoded using the backbone network to obtain the first, second, and third features, where the first feature scale is 1/4 of the original image scale, the second feature scale is 1/8 of the original image scale, and the third feature scale is 1/16 of the original image scale.

3. A semantic segmentation method based on dual feature fusion for IoT sensing according to claim 2, characterized in that step S2 includes the following steps:

S2.2. The semantic features as well as the third features are fused by the second attention refinement block to obtain the multi-level fused features.

4. À semantic segmentation method based on dual feature fusion for IoT sensing according to claim 3, specifically shown in the step S2.1:

BL-5594 21

S2.1.1. The first feature is mapped to the same scale as the second feature throught Tha C0 downsampling layer to obtain the first scale feature;

S2.1.3. Fusion ofthe first scale features with second features for channel stitching to obtain first fusion features;

S2.1.4. Inputting the first fused features into the first adaptive mean pooling layer and the first adaptive maximum pooling layer, respectively, to output the first attention vector, and the second attention vector, respectively;

5. A semantic segmentation method based on dual feature fusion for IoT perception according to claim 4, characterized in that step S2.2 specifically comprises the following steps:

S2.2.4. Inputting the second fusion features into the second adaptive mean pooling layer and the second adaptive maximum pooling layer, respectively, to output the third attention vector, the fourth attention vector respectively

S2.2.5. Non-linear Mapping of the third attention vector and the fourth attention vector by the

BL-5594 22 second multilayer perception layer to output the third mixed attention vector and the fourth mixed 090 attention vector, and fusion of the third mixed attention vector and the fourth mixed attention vector to output the second fused mixed attention vector;

S2.2.8, Fusion semantic features as well as weighted second channel features to obtain multi- level fusion features.

6. A semantic segmentation method based on dual feature fusion for IoT perception according to claim 5, is characterized in that the first fused features described in step S2.1.4 are input into the first adaptive mean pooling layer and the first adaptive maximum pooling layer, respectively, to output the first attention vector, the second attention vector, respectively, using the following equation: V, = AAP, (C[F,> F,]) > V, = AMP, (C[F,, F;]) > Where, V1 is the first attention vector, V2 is the second attention vector, F1 is the first scale feature, F2 is the second feature, C[] denotes channel stitching fusion, AAP1 () denotes the first adaptive mean pooling layer, and AMP1 () denotes the first adaptive maximum pooling layer.

7. A semantic segmentation method based on dual feature fusion for IoT sensing according to claim 6, is characterized in that the second fused features described in step S2.2.4 are input into the second adaptive mean pooling layer, and the second adaptive maximum pooling layer respectively, to output the third attention vector, the fourth attention vector, respectively, by using the following equation: Vz = AAP, (C[L,, L,]) > V, = AMP, (C[L,, L,]) > Where, V; is a third attention vector, V4 is a fourth attention vector, L1 is a second scale feature, La is a semantic feature, AAP, () denotes a second adaptive mean pooling layer, and AMP; () denotes a second adaptive maximum pooling layer.

8. A semantic segmentation method based on dual feature fusion for IoT perception according

BL-5594 23 to claim 7, is characterized by that 7503090 The non-linear Mapping of the first attention vector, and the second attention vector by the first multilayer perception layer is described in step S2.1.5 to output the first mixed attention vector, the second mixed attention vector, and the channel stitching and fusion of the first mixed attention vector, the second mixed attention vector to output the first fused mixed attention vector, using the following equation: Va, = MLP, (C[Vy, V;]) » Non-linear Mapping of the third attention vector, the fourth attention vector by the second multilayer perception layer as described in step S2.2.5 to output the third blended attention vector, the fourth blended attention vector, and fusing the third blended attention vector, the fourth blended attention vector to output the second fused blended attention vector, specifically using the following equation: Vaz = MLP,(C[V5, V,])- Where, Vai is the first fused mixed attention vector, Var is the second fused mixed attention vector, MLP; () is the first multilayer perception layer, and MLP; () is the second multilayer perception layer.

9. A semantic segmentation method based on dual feature fusion for IoT sensing according to claim 8, characterized in that step: S2.1.6, normalizing the first fused hybrid attention vector to obtain the first normalized hybrid attention vector, S2.1.7, Mapping the first channel features with the first normalized hybrid attention vector weighted, S2.1.8, fusing the second features as well as the weighted first channel features to output semantic features by using the following equation: LES 94) © FE, Step: S2.2.6, normalizing the second fused mixed attention vector to obtain the second normalized mixed attention vector, S2.2.7, Mapping the second channel features with the second normalized mixed attention vector weighted by the second normalized mixed attention vector, S2.2.8, fusing the semantic features and the weighted second channel features to obtain the multi-level fused features by using the following equation: LES g,(\) ® LBL", Where, L is a semantic feature, L is a multi-level fusion feature, Sig1 () denotes the first activation function, Sig () denotes the second activation function, F, is a first channel feature, L

BL-5594 24 is a second channel feature, H denotes the height of the feature map, W denotes the width of the 0% feature map, © denotes a pixel-level dot product operation, and ® denotes a pixel-level dot add operation.

10. A semantic segmentation system based on bipartite feature fusion for IoT sensing, characterized in that it includes a connected multilayer feature fusion module and a lightweight semantic pyramid module; The multi-layer feature fusion module includes a backbone network unit, a proofreading unit; The lightweight semantic pyramid module includes the first downscaling unit, the second downscaling unit, the third downscaling unit, the context encoding unit, the global pooling unit, the first channel splicing fusion unit, the second channel splicing fusion unit, and the upsampling unit; Among them, the backbone network unit is connected to the proofreading unit, the proofreading unit is connected to the first downscaling unit and the second downscaling unit, the first downscaling unit is connected to the context encoding unit and the global pooling unit, the context encoding unit and the global pooling unit are connected to the first channel splicing fusion unit, the second downscaling unit, and the first channel splicing fusion unit are connected to the second channel splicing fusion unit, and the second channel splicing fusion unit is also connected to the third downsampling unit, and the upsampling unit is connected to the third downsampling unit; The mentioned backbone network unit for feature encoding of the original image using the backbone network to obtain features at different scales; The mentioned proofreading unit for learning features at different scales by two attention refinement blocks to obtain multi-level fusion features; Both the first dimensionality reduction unit as well as the second dimensionality reduction unit are used to reduce the dimensionality of the multi-level fusion features to output the first and the second dimensionality reduction features; the first and the second dimensionality reduction features are the same; Context encoding unit for context encoding the first descending features by deep decomposable convolution at different convolutional scales to obtain local features at different scales, respectively; A global pooling unit for globally pooling the first reduced-dimensional features by a global mean pooling layer to obtain global features; A first channel stitching and fusion unit for channel stitching and fusion of global features as

BL-5594 25 well as local features at different scales to obtain multi-scale contextual fusion features; 4503090 A second channel stitching fusion unit for channel stitching fusion of a second reduced dimensional feature, a multi-scale contextual fusion feature to obtain a stitching feature; Third dimensionality reduction unit for dimensionality reduction of the spliced features; Up-sampling unit for up-sampling the downsampled spliced features to obtain the final output.