CN116797789A

CN116797789A - A scene semantic segmentation method based on attention architecture

Info

Publication number: CN116797789A
Application number: CN202310698684.6A
Authority: CN
Inventors: 黄丹丹; 王贵贤; 王英志; 陈广秋; 许鹤; 白昱; 薛泓垚
Original assignee: Changchun University of Science and Technology
Current assignee: Changchun University of Science and Technology
Priority date: 2023-06-13
Filing date: 2023-06-13
Publication date: 2023-09-22

Abstract

The invention belongs to the technical field of computer vision, in particular to a scene semantic segmentation method based on an attention architecture, which comprises the following steps: step one: data preprocessing, namely providing data preparation for subsequent network model training; step two: training the model, namely training the constructed network model, and training parameters of the network model by utilizing mixed loss supervision in the whole training process, so as to obtain the optimal network weight of the scene semantic segmentation method based on the attention architecture by continuously reducing the loss optimization network model parameters; step three: the test of the model uses the network weight obtained through training to test the effect of semantic segmentation by inputting novel image data acquired by an external sensor. In order to enhance the characteristic representation capability of pixels, the invention utilizes the dual-attention module to model context information in the space dimension and the channel dimension respectively, thereby improving the characteristic representation capability of the whole model.

Description

A scene semantic segmentation method based on attention architecture

技术领域Technical field

本发明涉及计算机视觉技术领域，具体为一种基于注意力架构的场景语义分割方法。The invention relates to the field of computer vision technology, specifically a scene semantic segmentation method based on attention architecture.

背景技术Background technique

在人工智能行业快速发展的时代，自动驾驶技术也越来越贴近人们的生活。自动驾驶技术中，借助计算机来帮助汽车理解其所处的场景是非常重要的，只有自动驾驶系统能感知到周围环境中的物和人，其才能正确的做出安全的决策，如果系统对环境中的人和物有误判，那可能导致非常严重的后果。In an era of rapid development in the artificial intelligence industry, autonomous driving technology is becoming more and more relevant to people's lives. In autonomous driving technology, it is very important to use computers to help the car understand the scene it is in. Only when the autonomous driving system can perceive the objects and people in the surrounding environment can it make correct and safe decisions. If the system is aware of the environment Misjudgment of people and things in the game may lead to very serious consequences.

基于传统算法的自动驾驶技术首先通过各种传感器采集周围环境的数据，然后通过传统算法进行数据分析，最后做出决策对车辆进行控制。因此传统算法存在效率低下、无法端到端执行、精度低等缺点。最近几年，随着神经网络的发展和计算机算力的提升，基于深度学习的自动驾驶技术得以飞速发展。首先通过摄像头采集周围环境数据，然后利用深度学习算法把特征提取、图像分割以及车辆决策端到端执行，提高处理速度的同时也大大地提升了精度。相对于昂贵的激光雷达传感器来说，价格低廉的摄像头采集到的图片可以大幅度降低成本，进一步推进自动驾驶技术落地。为保证车辆行驶安全，自动驾驶技术对周围环境有较高的精度要求。Autonomous driving technology based on traditional algorithms first collects data on the surrounding environment through various sensors, then analyzes the data through traditional algorithms, and finally makes decisions to control the vehicle. Therefore, traditional algorithms have shortcomings such as low efficiency, inability to execute end-to-end, and low accuracy. In recent years, with the development of neural networks and improvements in computer computing power, autonomous driving technology based on deep learning has developed rapidly. First, the surrounding environment data is collected through the camera, and then the deep learning algorithm is used to perform feature extraction, image segmentation and vehicle decision-making end-to-end, which not only improves the processing speed but also greatly improves the accuracy. Compared with expensive lidar sensors, images collected by cheap cameras can significantly reduce costs and further promote the implementation of autonomous driving technology. In order to ensure vehicle driving safety, autonomous driving technology has high precision requirements for the surrounding environment.

图像语义分割的目的就是针对不同的像素根据其语义范畴进行分类，与传统分割相比语义分割即是达到像素级别的分类。图像的语义分割结果中不但包含了所属语义类别的位置信息还有详细的边界和姿态信息，因此这样精细结果能够使车辆的可行驶区域的判断更加精准、物体类别和形状判断更加精准，现如今自动驾驶领域的主要场景是城市场景，因此城市场景语义分割是一个重要的领域。The purpose of image semantic segmentation is to classify different pixels according to their semantic categories. Compared with traditional segmentation, semantic segmentation achieves pixel-level classification. The semantic segmentation result of the image not only contains the position information of the semantic category but also the detailed boundary and posture information. Therefore, such fine results can make the judgment of the vehicle's drivable area more accurate, and the object category and shape judgment more accurate. Nowadays, The main scene in the field of autonomous driving is urban scenes, so urban scene semantic segmentation is an important field.

目前语义分割主流框架基本都是基于全卷机神经网络演化而来的，然而在自动驾驶系统中使用图像语义分割算法时，仍然存在一些问题：The current mainstream semantic segmentation frameworks are basically evolved based on full-volume machine neural networks. However, there are still some problems when using image semantic segmentation algorithms in autonomous driving systems:

(1)自动驾驶场景的物体尺寸变化比较大，现有的算法对不同尺寸的目标分割的精度不同，不适用于小目标物体。(1) The size of objects in autonomous driving scenes changes greatly. Existing algorithms have different accuracy in segmenting targets of different sizes and are not suitable for small target objects.

(2)自动驾驶场景复杂，存在光照明暗相差大、相互之间存在大量的遮挡等问题，目标识别困难，目标边缘模糊，当前许多算法都不适用于检测目标边缘。(2) The autonomous driving scene is complex, with problems such as large differences between light and dark, and a large number of occlusions. Target recognition is difficult and target edges are blurred. Many current algorithms are not suitable for detecting target edges.

因此，本作品致力于利用先进的注意力机制解决上述问题，进而提高语义分割的精度，为自动驾驶技术提供新型解决方案。Therefore, this work is committed to using advanced attention mechanisms to solve the above problems, thereby improving the accuracy of semantic segmentation and providing new solutions for autonomous driving technology.

发明内容Contents of the invention

(一)解决的技术问题(1) Technical problems solved

针对现有技术的不足，本发明提供了一种基于注意力架构的场景语义分割方法，解决了上述背景技术中所提出的问题。In view of the shortcomings of the existing technology, the present invention provides a scene semantic segmentation method based on attention architecture, which solves the problems raised in the above background technology.

(二)技术方案(2) Technical solutions

本发明为了实现上述目的具体采用以下技术方案：In order to achieve the above object, the present invention specifically adopts the following technical solutions:

一种基于注意力架构的场景语义分割方法，该方法包括以下步骤：A scene semantic segmentation method based on attention architecture, which includes the following steps:

步骤一：数据预处理，为后续的网络模型训练提供数据准备；Step 1: Data preprocessing to provide data preparation for subsequent network model training;

步骤二：模型的训练，将构造好的网络模型进行训练，在整个训练过程中利用混合损失监督网络模型参数的训练，通过不断地降低损失优化网络模型参数，从而获得基于注意力架构的场景语义分割方法的最佳网络权重；Step 2: Model training. The constructed network model is trained. During the entire training process, hybrid loss is used to supervise the training of network model parameters. The network model parameters are optimized by continuously reducing the loss to obtain scene semantics based on the attention architecture. Optimal network weights for segmentation methods;

步骤三：模型的测试，通过输入外部传感器采集的新型图像数据，使用通过训练而获得的网络权重，来测试语义分割的效果。Step 3: Test the model by inputting new image data collected by external sensors and using the network weights obtained through training to test the effect of semantic segmentation.

进一步地，所述步骤一中的数据预处理包括：Further, the data preprocessing in step one includes:

通过数据预处理工作对原始输入数据进行随机和任意地裁剪以进行数据扩充，然后放置于重新生成的文件夹里，文件夹中全是裁剪后用于训练的样本图片，最终裁剪大小为768x 768。The original input data is randomly and arbitrarily cropped through data preprocessing work for data expansion, and then placed in a regenerated folder. The folder is full of cropped sample images used for training. The final cropped size is 768x 768 .

进一步地，所述步骤二中模型的训练包括下列步骤：Further, the training of the model in step two includes the following steps:

将准备好的样本图片送入到网络模型中进行训练，此网络模型包括三个部分：一个是使用具有扩张策略的残差网络Resnet，一个是包含了通道注意力和空间注意力的轻量级对称双注意力模块，一个是将低层特征与高层特征进行融合的自适应选择交互模块。The prepared sample images are fed into the network model for training. This network model includes three parts: one is using the residual network Resnet with expansion strategy, and the other is a lightweight network that includes channel attention and spatial attention. Symmetric dual attention module, one is an adaptive selection interaction module that fuses low-level features with high-level features.

进一步地，所述步骤三模型的测试包括：Further, the testing of the step three model includes:

将训练好的权重参数，在新的传感器采集图像中测试分割效果。Use the trained weight parameters to test the segmentation effect in the new sensor-collected images.

(三)有益效果(3) Beneficial effects

与现有技术相比，本发明提供了一种基于注意力架构的场景语义分割方法，具备以下有益效果：Compared with the existing technology, the present invention provides a scene semantic segmentation method based on attention architecture, which has the following beneficial effects:

本发明为了增强像素的特征表示能力，利用双注意力模块分别在空间维度和通道维度建模上下文信息，提升模型整体的特征表达能力。In order to enhance the feature representation ability of pixels, the present invention uses dual attention modules to model context information in the spatial dimension and channel dimension respectively, thereby improving the overall feature expression ability of the model.

本发明利用高层特征图虽然分辨率低，但它们总是包含丰富的语义信息，因此可以为生成具有更多语义信息的低级特征图提供指导；此外，低层特征图比高层特征图具有更多的空间信息可以为高层特征图提供空间引导，通过有效的特征融合可以进一步提高语义分割效果。This invention uses high-level feature maps to have low resolution, but they always contain rich semantic information, so it can provide guidance for generating low-level feature maps with more semantic information; in addition, low-level feature maps have more features than high-level feature maps. Spatial information can provide spatial guidance for high-level feature maps, and the semantic segmentation effect can be further improved through effective feature fusion.

附图说明Description of the drawings

图1为本发明整体网络结构图；Figure 1 is an overall network structure diagram of the present invention;

图2为为本发明特征融合模块结构图；Figure 2 is a structural diagram of the feature fusion module of the present invention;

图3为本发明轻量化双注意力模块结构图；Figure 3 is a structural diagram of the lightweight dual attention module of the present invention;

图4为本发明本发明各个模块在数据集上的分割效果展示图；Figure 4 is a diagram showing the segmentation effect of each module of the present invention on the data set;

图5为本发明对Cityscapes测试数据集的评价图。Figure 5 is an evaluation diagram of the present invention on the Cityscapes test data set.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

实施例Example

如图1-5所示，本发明一个实施例提出的一种基于注意力架构的场景语义分割方法，该方法包括以下步骤：As shown in Figures 1-5, one embodiment of the present invention proposes a scene semantic segmentation method based on attention architecture. The method includes the following steps:

本发明所述数据预处理具体操作包括下列步骤：The specific operations of data preprocessing according to the present invention include the following steps:

对原始输入数据进行随机和任意地裁剪以进行数据扩充，然后放置于重新生成的文件夹里，文件夹中全是裁剪后用于训练的样本图片，最终裁剪大小为768x 768。The original input data is randomly and arbitrarily cropped for data expansion, and then placed in a regenerated folder. The folder is full of cropped sample images used for training, and the final cropped size is 768x 768.

本发明所述步骤二中模型的训练包括下列步骤：The training of the model in step two of the present invention includes the following steps:

将准备好的样本图片送入到网络模型中进行训练，此网络模型包括三个部分：一个是使用具有扩张策略的残差网络Resnet，一个是包含了通道注意力和空间注意力的轻量级对称双注意力模块，一个是将低层特征与高层特征进行融合的自适应选择交互模块；The prepared sample images are fed into the network model for training. This network model includes three parts: one is using the residual network Resnet with expansion strategy, and the other is a lightweight network that includes channel attention and spatial attention. Symmetric dual attention module, one is an adaptive selection interaction module that fuses low-level features with high-level features;

第一部分是用于特征提取的残差网络Resnet，本发明在原始残差网络的基础上利用扩张策略，去除了原有网络最后两层的下采样操作以保留更多细节以利于语义分割的结果，使得最终的特征提取网络输出特征图是图的原始1/8，通过对输入原始图像信息进行特征提取，并最终获得从Res-1到Res-4的4级特征；The first part is the residual network Resnet used for feature extraction. This invention uses the expansion strategy based on the original residual network to remove the down-sampling operation of the last two layers of the original network to retain more details to facilitate the results of semantic segmentation. , so that the final feature extraction network output feature map is 1/8 of the original image. By extracting features from the input original image information, 4-level features from Res-1 to Res-4 are finally obtained;

第二部分是用于增强高级特征表示的轻量化对称双注意力结构，通过特征提取骨干网络Resnet的最高层的输出被传递到空间和通道维度的混合注意力机制，充分利用高级特征之间的对应关系，将空间信息和语义信息适当地整合在一起；该模型利用平均池化和最大池化取代计算所有位置像素之间的相关性的自注意力方法，避免了自注意力带来的高昂计算成本和GPU内存占用，降低模型对于硬件的依赖；The second part is a lightweight symmetric dual attention structure used to enhance high-level feature representation. The output of the highest layer of the feature extraction backbone network Resnet is passed to a hybrid attention mechanism of spatial and channel dimensions, making full use of the interaction between high-level features. Correspondence, appropriately integrating spatial information and semantic information; this model uses average pooling and maximum pooling to replace the self-attention method that calculates the correlation between pixels at all positions, avoiding the high cost of self-attention Computational costs and GPU memory usage reduce the model’s dependence on hardware;

其具体如下：The details are as follows:

空间池化注意力模型首先通过并行的自适应全局平均池化和自适应全局最大池化获得两个新的特征向量The spatial pooling attention model first obtains two new feature vectors through parallel adaptive global average pooling and adaptive global maximum pooling.

和/> and/>

然后将二则沿着信道维度将两者拼接以获得融合特征Then the two are spliced along the channel dimension to obtain fusion features

接下来，通过1×1卷积将特征通道缩减为1，再通过sigmoid激活函数获得空间注意力权重图Next, the feature channel is reduced to 1 through 1×1 convolution, and then the spatial attention weight map is obtained through the sigmoid activation function.

最后，通过空间注意力权重图对每个位置的判别特征进行加权，然后将加权运算的结果与X相加，得到最终输出E_spatial；具体运算定义如下：Finally, the discriminative features of each position are weighted through the spatial attention weight map, and then the result of the weighted operation is added to X to obtain the final output E _spatial ; the specific operation is defined as follows:

另一方面，通道池化注意力模型以较低的计算复杂度有效地提取有判别力的信道信息。首先输入特征图X∈R^C×H×W，同样通过在空间维度中使用自适应平均池和自适应最大池来生成两个新的特征向量和/> On the other hand, the channel pooling attention model effectively extracts discriminative channel information with lower computational complexity. First, input the feature map X∈R ^C×H×W , and also generate two new feature vectors by using adaptive average pooling and adaptive maximum pooling in the spatial dimension. and/>

然后，为了减少计算量，使用1x1卷积→Rule→1x1卷积先降维然后升维，最终得到两个新的特征图和/> Then, in order to reduce the amount of calculation, 1x1 convolution→Rule→1x1 convolution is used to first reduce the dimension and then increase the dimension, and finally obtain two new feature maps. and/>

然后，在叠加两个通道特征图之后，通过sigmoid激活来获得通道注意力权重图最后，通过通道注意力权重图测量X的每个通道的判别特征，然后将权重运算的结果与X相加，得到最终输出E_channel，如下所示：Then, after superimposing the two channel feature maps, the channel attention weight map is obtained through sigmoid activation Finally, the discriminative characteristics of each channel of X are measured through the channel attention weight map, and then the result of the weight operation is added to X to obtain the final output E _channel , as shown below:

最后，由通道注意力路径生成的特征图连接空间注意力路径，由于通道注意力关注它是什么，空间注意力关注它在哪里，双注意力模块利用了基于上述两者的混合注意力机制，使其有效地利用高层语义信息增强网络像素表示。Finally, the feature map generated by the channel attention path connects the spatial attention path. Since the channel attention focuses on what it is and the spatial attention focuses on where it is, the dual attention module utilizes a hybrid attention mechanism based on the above two, This enables it to effectively utilize high-level semantic information to enhance network pixel representation.

第三部分是自适应选择交互结构，旨在更有效地将深层语义信息与浅层轮廓信息相结合，生成更“语义”的低级特征图和更准确的高级特征图。它包含了一个通道注意力模块，有效的弥合了具有大量语义信息的高级特征图和具有精确空间信息的低级特征图之间的差距。The third part is the adaptive selection interaction structure, which aims to more effectively combine deep semantic information with shallow contour information to generate more "semantic" low-level feature maps and more accurate high-level feature maps. It contains a channel attention module that effectively bridges the gap between high-level feature maps with extensive semantic information and low-level feature maps with precise spatial information.

该模块可以学习每个特征通道的重要性，以补偿高低层特征之间的差异，从而更好地指导这些特征的融合功能，而不是简单地将它们相加。首先利用上采样使高级和低级特征图的大小相同。全局平均池为全局上下文信息提供了可接受的最大域。全局平均池通过压缩特征向量的特征来映射特征向量，并将其反馈到完全连接层，以了解每个特征通道的权重。将融合的特征图用sigmoid函数归一化为0和1，以获得通道注意力权重图。通过将原始特征图与获得的通道注意力权重相乘，可以自适应地选择重要信息，从而使低级特征图具有更多的高级语义信息，从而获得增强的低级语义特征图。同时，在通道注意力权重值的影响下，可以利用低级特征图的几何信息来引导高级特征，然后逐步恢复边缘细节特征。最后，通过逐像素求和来组合增强的高级和低级特征。This module can learn the importance of each feature channel to compensate for the difference between high- and low-level features, thereby better guiding the fusion function of these features instead of simply adding them. First, upsampling is utilized to make the high-level and low-level feature maps the same size. Global average pooling provides the largest acceptable domain for global context information. Global average pooling maps feature vectors by compressing their features and feeds it back to the fully connected layer to learn the weight of each feature channel. The fused feature map is normalized to 0 and 1 using the sigmoid function to obtain the channel attention weight map. By multiplying the original feature map with the obtained channel attention weight, important information can be adaptively selected so that the low-level feature map has more high-level semantic information, thereby obtaining an enhanced low-level semantic feature map. At the same time, under the influence of the channel attention weight value, the geometric information of the low-level feature map can be used to guide high-level features, and then gradually restore the edge detail features. Finally, the enhanced high-level and low-level features are combined by pixel-wise summation.

步骤三：模型的测试，通过输入外部传感器采集的新型图像数据，使用通过训练而获得的网络权重，来测试语义分割的效果；Step 3: Test the model by inputting new image data collected by external sensors and using the network weights obtained through training to test the effect of semantic segmentation;

本发明模型的测试包括；Testing of the model of the present invention includes;

将训练好的权重参数，在新的图像数据中测试分割效果。Use the trained weight parameters to test the segmentation effect on new image data.

由于自动驾驶场景多样性，不同场景下不同物体目标的尺寸变化大、相互遮挡多、目标辨识难，因此其结果表现也极大地依赖于分割场景的复杂度情况，并且针现有语义分割算法由于一系列卷积和池化操作造成“分辨率丢失”进而导致小目标物体分割精度不够以及“分割边缘模糊”的问题，首先，本发明利用带有空间注意力和通道注意力的双注意力模块分别在空间维度和通道维度建模上下文信息，提升模型整体的特征表达能力，其次，本发明利用高层特征语义丰富却缺乏几何空间细节信息而低层特征虽然包含精准的细节轮廓却缺乏语义的特点，利用一个自适应选择交互模块更加充分指导高层和低层特征的融合，最终在测试对比实验中，证明了本发明不仅可以提取长期相关的上下文信息，而且可以更有效地促进语义信息和轮廓信息的结合，最终实现分割精度和模型复杂性之间的平衡。Due to the diversity of autonomous driving scenes, the sizes of different objects in different scenes vary greatly, they block each other a lot, and target identification is difficult. Therefore, the result performance also greatly depends on the complexity of the segmented scene, and the existing semantic segmentation algorithms are due to A series of convolution and pooling operations cause "resolution loss", which leads to insufficient segmentation accuracy of small target objects and "blurred segmentation edges". First, the present invention uses a dual attention module with spatial attention and channel attention. Contextual information is modeled in the spatial dimension and channel dimension respectively to improve the overall feature expression ability of the model. Secondly, the present invention utilizes the characteristics that high-level features are rich in semantics but lack geometric spatial detail information, while low-level features contain accurate detailed outlines but lack semantics. An adaptive selection interaction module is used to more fully guide the fusion of high-level and low-level features. Finally, in the test comparison experiment, it is proved that the present invention can not only extract long-term relevant contextual information, but also more effectively promote the combination of semantic information and contour information. , ultimately achieving a balance between segmentation accuracy and model complexity.

利用自动驾驶领域常用的数据集Cityscapes进行网络模型训练，根据Cityscapes数据集的评测工具，测试该方法的训练效果。从表1中的数据可以发现，本发明所提出的场景语义分割算法比其他算法训练出来的权重参数，在该数据测试数据上具有更优的表现。Cityscapes, a commonly used data set in the field of autonomous driving, is used for network model training, and the training effect of this method is tested based on the evaluation tool of the Cityscapes data set. From the data in Table 1, it can be found that the scene semantic segmentation algorithm proposed by the present invention has better performance on the test data than the weight parameters trained by other algorithms.

图5是对Cityscapes测试数据集的评价；Figure 5 is the evaluation of the Cityscapes test data set;

为了更加直观的感受本发明相对于现有算法的有效性，进一步可视化了本发明中的轻量化双注意力结构和自适应选择交互结构的分割结果，如图4所示。由于双注意力结构可以更有效地对上下信息进行建模，并增强像素之间的相关性，因此在原有网络的基础上添加双注意力结构可以有效地避免一些类内分类错误。当进一步添加自适应选择交互结构后，通过抑制非相关信道来增强低级和高级特征表示，从而实现更好的融合效果。因此从图4中第四行可以看到添加自适应选择交互结构融合了低层特征后，使得丢失的细节得以恢复，不同类别间边缘分割更加细化。In order to more intuitively feel the effectiveness of the present invention compared to existing algorithms, the segmentation results of the lightweight dual attention structure and adaptive selection interaction structure in the present invention are further visualized, as shown in Figure 4. Since the dual attention structure can model the upper and lower information more effectively and enhance the correlation between pixels, adding the dual attention structure on the basis of the original network can effectively avoid some intra-class classification errors. When the adaptive selection interaction structure is further added, low-level and high-level feature representations are enhanced by suppressing non-correlated channels, thereby achieving better fusion effects. Therefore, it can be seen from the fourth row in Figure 4 that adding an adaptive selection interaction structure to fuse low-level features allows the lost details to be restored and the edge segmentation between different categories to be more refined.

最后应说明的是：以上所述仅为本发明的优选实施例而已，并不用于限制本发明，尽管参照前述实施例对本发明进行了详细的说明，对于本领域的技术人员来说，其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。Finally, it should be noted that the above are only preferred embodiments of the present invention and are not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, for those skilled in the art, it is still The technical solutions described in the foregoing embodiments may be modified, or some of the technical features may be equivalently replaced. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection scope of the present invention.

Claims

1. A scene semantic segmentation method based on attention architecture, characterized in that: the method includes the following steps:

Step 1: Data preprocessing to provide data preparation for subsequent network model training;

Step 2: Model training. The constructed network model is trained. During the entire training process, hybrid loss is used to supervise the training of network model parameters. The network model parameters are optimized by continuously reducing the loss to obtain scene semantics based on the attention architecture. Optimal network weights for segmentation methods;

Step 3: Test the model by inputting new image data collected by external sensors and using the network weights obtained through training to test the effect of semantic segmentation.

2. A scene semantic segmentation method based on attention architecture according to claim 1, characterized in that: the data preprocessing in step one includes:

The original input data is randomly and arbitrarily cropped through data preprocessing work for data expansion, and then placed in a regenerated folder. The folder is full of cropped sample images used for training, and the final cropped size is 768x768.

3. A scene semantic segmentation method based on attention architecture according to claim 1, characterized in that: the training of the model in step 2 includes the following steps:

The prepared sample images are fed into the network model for training. This network model includes three parts: one is using the residual network Resnet with expansion strategy, and the other is a lightweight network that includes channel attention and spatial attention. Symmetric dual attention module, one is an adaptive selection interaction module that fuses low-level features with high-level features.

4. A scene semantic segmentation method based on attention architecture according to claim 1, characterized in that: the test of the step three model includes:

Use the trained weight parameters to test the segmentation effect in the new sensor-collected images.