CN116740364A

CN116740364A - Image semantic segmentation method based on reference mechanism

Info

Publication number: CN116740364A
Application number: CN202311029652.3A
Authority: CN
Inventors: 李念峰; 申向峰; 李昕原; 刘钱; 孙立岩; 丁天娇; 王春湘; 关彤; 柴滕飞; 肖治国
Original assignee: Changchun University
Current assignee: Changchun University
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-09-12
Anticipated expiration: 2043-08-16
Also published as: CN116740364B

Abstract

The application relates to an image semantic segmentation method based on a reference mechanism, belonging to the field of computer vision; the image semantic segmentation method based on the reference mechanism comprises the following steps: combining a global feature extraction module, a spatial feature extraction module and an answer reference module, and referring to the global feature extraction module, the spatial feature extraction module and the answer reference module in image semantic segmentation, wherein the method specifically comprises the following steps: preprocessing a Cityscape data set, respectively sending images to a spatial feature extraction module, a global feature extraction module and a reference answer module to extract features, and then sending the feature images extracted by the three modules to a feature fusion module after upsampling, wherein the features of each stage are optimized by using a focus refinement module in the global feature extraction module and the reference answer module. In the image segmentation task, the segmentation precision and the segmentation speed are better balanced, and other segmentation models can also refer to the method, so that the performance of the segmentation model is improved.

Description

Image semantic segmentation method based on reference mechanism

Technical Field

The application belongs to the field of computer vision, and particularly relates to an image semantic segmentation method based on a reference mechanism.

Background

The background of image semantic segmentation can be traced back to the evolutionary course of the fields of computer vision and pattern recognition. Early computer vision tasks focused primarily on image classification, i.e., classifying the entire image into different categories, such as identifying animals, vehicles, people, etc. in the image. However, image classification ignores pixel level details in an image and does not provide semantic information for each pixel in the image.

To better understand the semantic information and local structure of images, image semantic segmentation techniques have evolved. Its main goal is to assign each pixel in the image to a corresponding semantic category, thereby enabling semantic understanding at the pixel level. This means that each pixel is given a semantic label, making the understanding of the image finer and more accurate.

Early image semantic segmentation methods relied primarily on manually designed features and traditional image processing techniques. Although these methods achieve some success in some scenarios, they are not as effective as complex image structures and semantic differentiation.

With the advent of deep learning, and in particular the successful application of Convolutional Neural Networks (CNNs) to image classification tasks, researchers began exploring the application of deep learning methods to image semantic segmentation. The strong characterization capability and the feature learning capability of the deep learning make the image semantic segmentation significantly advanced.

In 2014, the proposal of a full convolution network (Fully Convolutional Networks, FCN) marks the image semantic segmentation into a brand new stage. The FCN applies the convolutional neural network to the semantic segmentation of the pixel level for the first time, and the full-connection layer is replaced by the full-convolutional layer, so that the network can accept an input image with any size and output a semantic segmentation result with the same size. The method lays a foundation for efficient and accurate implementation of image semantic segmentation.

Subsequently, many improved models based on FCNs, such as U-Net, deepLab, segNet, etc., have emerged. These models all structurally employ an encoder-decoder architecture, where image features are extracted by the encoder and then predicted by the decoder for pixel-level semantic segmentation. Meanwhile, mechanisms such as spatial attention, jump connection and the like are introduced, and the precision and stability of segmentation are further improved.

Despite the great progress made in image semantic segmentation models, there are significant problems, and current models often have a large number of parameters, which results in very large models. In particular, some advanced semantic segmentation models, such as deep lab and U-Net, require large computational resources and video memory for training and reasoning. The model is large, so that the model cannot meet the real-time performance.

Disclosure of Invention

The application aims to provide an image semantic segmentation method based on a reference mechanism, which aims to solve the technical problems of complex segmentation model, low segmentation precision, low segmentation speed and the like in the prior art.

In order to achieve the above purpose, the specific technical scheme of the image semantic segmentation method based on the reference mechanism of the application is as follows:

in order to achieve the above purpose/solve the above technical problems, the present application is realized by adopting the following steps of:

step one: acquiring a data set, and preprocessing the data set;

specifically, a Cityscape data set is obtained, a training set, a testing set and a verification set are divided, and data preprocessing is carried out on the training set;

step two: building a segmentation model, which comprises five modules: the system comprises a spatial feature extraction module, a global feature extraction module, a reference answer module, an attention refinement module and a feature fusion module, wherein the spatial feature extraction module, the global feature extraction module and the reference answer module respectively provide three different features, and then the three provided different features are subjected to the feature fusion module to obtain a segmentation prediction graph;

step three: and selecting a proper loss function, setting the training round number to 150 by back propagation optimization parameters, and only storing the optimal (minimum) model structure and model parameters in the training process.

Further, step one, wherein the data set is a Cityscape, and the division of the Cityscape data set comprises 2975 training set images and labels, 1525 test set images and labels, and 500 verification set images and labels; because the image labels in the Cityscape dataset not only contain color images, but also contain instance label images and depth images, the color images are used as labels, and the labels are processed independently; note that only the color map is taken as the image label. After the image is labeled correspondingly, data preprocessing is performed, firstly, the image is cut into 736 x 736, and then data enhancement is performed, including: random scaling, 0.5 brightness, 0.5 contrast operation, and recoding the cityscapes label from the original 35 class to the 19 class.

Further, the step one of preprocessing the data of the training set comprises reading the data from the folder, performing center clipping according to the size, encoding the tag by data enhancement, converting the Numpy array into a Tensor, and instantiating the data class;

the data enhancement here specifically includes random scaling, adjusting image contrast and brightness.

Further, the step two of constructing a segmentation model mainly comprises five modules, wherein the first module is a spatial feature extraction module and is used for extracting spatial features of an image; the second module is a global feature extraction module for extracting global features of the image; the third module is a reference answer module and is used for extracting all the characteristics of the image; the fourth module is an attention refinement module, which is used for optimizing the characteristics of different stages; and the fifth module is a feature fusion module for fusing the spatial feature extraction module, the global feature extraction module and the reference answer module. And finally, restoring the feature map to the original map size through an upsampling operation to obtain a prediction segmentation map.

Further, the first module is a spatial feature extraction module, which is used for extracting the spatial feature of the image, the image is subjected to five convolution blocks to obtain the spatial feature, and the obtained spatial feature is 1/8 of the size of the original input image and is marked as spatial feature. Specifically, an image enters a spatial feature extraction module after data preprocessing, all the convolution blocks are composed of a convolution layer, a normalization layer and a ReLU activation layer, the convolution input channel in_channels of the first convolution block is 3, the output channel out_channels is 16, the convolution kernel_size is 3*3, the step size stride is 2, the filling padding is 1, the normalization layer of the convolution block 1 uses BatchNorm2d, the activation layer of the convolution block 1 uses a ReLU activation function, the obtained feature space SP1 is (1, 16, 368, 368), and the image obtained through the convolution block 1 is 1/2 of the original image size; the spatial feature SP2 obtained by the feature SP2 through the convolution block two (input channel is 16, output channel is 32, convolution kernel size is 3*3, step size is 2, and filling is 1) is (1, 32, 184, 184), the feature space SP3 obtained by the feature SP2 through the convolution block three (input channel is 32, output channel is 64, convolution kernel size is 3*3, step size is 2, filling is 1) is (1, 64, 92, 92), the spatial feature SP4 obtained by the feature SP3 through the convolution block four (input channel is 64, output channel is 128, convolution kernel size is 3*3, step size is 1, filling is 1) is (1, 128, 92, 92), the spatial feature SP4 obtained by the feature SP4 through the convolution block five (input channel is 128, output channel 256, convolution kernel size is 3*3, step size is 1, filling is 1) is (1, 256, 92, 92), and SP5 is the feature map extracted by the spatial feature extraction module.

Further, the second module is a global feature extraction module, which is configured to extract global features of the image. The global feature extraction module structure is similar to a U-shaped network structure, five times of downsampling are carried out through convolution to respectively obtain feature images with the sizes of 1/2, 1/4, 1/8, 1/16 and 1/32, the feature images with the sizes of 1/8, 1/16 and 1/32 are reserved and are respectively marked as feature2, feature3 and feature4, and the feature images obtained by the last downsampling are subjected to global average pooling and marked as tail, so that a large enough receptive field is reserved.

Specifically, the image is subjected to data preprocessing and then enters a global feature extraction module, a global feature CP1 is obtained by a convolution layer (input channel is 3, output channel is 64, convolution kernel size is 7*7, step size is 2, and filling is 3) to obtain a global feature CP2 (1, 64, 368, 368), a feature CP1 is obtained by a pooling layer (convolution kernel size is 3*3, step size is 2, filling is 1) to obtain a global feature CP2 (1, 64, 184, 184), CP2 is obtained by a convolution layer (input channel is 64, output channel is 64, convolution kernel size is 3*3, step size is 1, filling is 1) to obtain a feature1 (1, 64, 184, 184), the feature1 is obtained by a convolution layer (input channel is 64, output channel 128, convolution kernel size 3*3, step size 2, fill 1) to yield feature2 as (1, 128, 92, 92), feature2 was twice convolved (input channel 128, output channel 256, convolution kernel size 3*3, step size 2, fill 1) to yield feature3 as (1, 256, 46, 46), feature3 was twice convolved (input channel 256, output channel 512, convolution kernel size 3*3, step size 2, fill 1) to yield feature4 as (1, 512, 32, 32), and feature4 was globally averaged pooled (output channel (1, 1)) to yield feature4 as (1, 512,1,1). The features 2, 3 and 4 are spliced with the tail after passing through the attention thinning module, so as to obtain the global feature CP.

Furthermore, the third module is a reference answer module, which is used for extracting all the features of the image, and the module is composed of an image encoder and an image decoder, the function of the module is equivalent to that of the reference answer, and the obtained prediction graph can strengthen the learning ability of the model through the subsequent feature fusion module. And thirdly, providing a reference answer by using a trained large model, wherein the module is used during training, and is not used during testing and verification.

Specifically, the image is preprocessed and then enters an image encoder, the image encoder is realized by using MAE pre-training ViT, and the image is embedded with the size of 1/16 after entering the encoder; the method comprises the following detailed steps: the original image (736 ) is scaled to 1024, then the vector of 1 x 64 x 768 is obtained after passing through convolution layers (convolution kernel size is 16 x 16 and step size is 16), the vector is straightened on W and C and then enters a multi-layer transform encoder, the vector output by ViT passes through two convolution layers (convolution kernel sizes are 1*1 and 3*3 respectively, and each layer is normalized) to obtain 256-dimensional feature vector, namely the image encoder obtains the feature of (256 x 64). The features obtained by the image encoder are firstly self-attention calculated by the decoder, and then a prediction result, namely a reference answer RA, is obtained after the image encoder passes through a layer of multi-layer perceptron. The decoder uses a transducer decoder, which can effectively embed and map the image into the mask. And obtaining a prediction graph with the same size as the original input image through a reference answer module.

Further, the fourth module is an attention refinement module, which is used for optimizing the characteristics of different stages. The attention refinement module is used in global features, guides the features of different stages by global average pooling, and consists of an adaptive avgpool2d pooling layer, a convolution layer with a convolution kernel size of 1*1, a normalization layer and a Sigmoid activation layer. Features 2, 3, 4 obtained by the global feature extraction module and features obtained by the reference answer module all need to be optimized by the attention refinement module. The attention refinement module is divided into two paths, and the first path is multiplied by the second path after global pooling, 1*1 convolution, normalization and activation operation, so that global features of different stages are optimized, and the second path does not do any operation.

Further, the fifth module is a feature fusion module, configured to fuse features obtained by the spatial feature extraction module, the global feature extraction module, and the reference answer module. In the module, firstly, the spatial feature SP5, the global feature CP and the reference answer RA are spliced, then the spatial feature SP5, the global feature CP and the reference answer RA are divided into three paths after passing through a convolution block, the first path of features are multiplied by the second path of features after global pooling, 1*1 convolution, reLU activation, 1*1 convolution and SigMoid activation, and then the third path of features are added to obtain a final model feature, and the features are up-sampled and restored after 8 to obtain a prediction result.

Further, in the third step, the log_softmax loss function is selected, the training round number is set to 150 by back propagation of the optimization parameters, and only the optimal model structure and model parameters are saved in the training process.

Further, in the third step, the loss is calculated by using the log_softmax function in the torch. Nn. Functional library, and the cross entropy loss is calculated with the real tag, and the specific calculation formula is as follows:

the original score is converted into a probability distribution using a softmax function:

（1）

applying the logits to the softmax function yields the probability distribution:

（2）

the probability is converted to a logarithmic probability (log probabilities) using natural logarithms (log functions):

（3）

applying a log function to the probability distribution probs obtained in the formula (2) to obtain a log probability:

（4）

calculating cross entropy loss with real tags:

（5）

in the above formulas (1) to (5), assuming that the model output is logits, batch_size is the number of samples in a batch, exp (x) means to index each element in x, sum (exp (x)) means to sum all index terms, and y_true is a real label. In the application, the training turn of the segmentation model is set to be 150 turns, and model parameters are continuously optimized through a log_softmax loss function.

The image semantic segmentation method based on the reference mechanism has the following advantages:

(1) Aiming at the problems of complex existing image semantic segmentation model, low segmentation speed, low segmentation precision and the like, the application provides an image semantic segmentation method based on a reference mechanism, which adds a module on the original BiSeNet model architecture and changes a network layer, thereby effectively improving the segmentation precision and speed.

(2) Aiming at the problems of overlarge calculated amount, long reasoning time and the like of the large image semantic segmentation model, the method provided by the application is added with a reference answer module, the module provides a segmentation prediction graph through a trained large model as a reference answer, the module is only used during training, and the module is not required to calculate in the reasoning stage, so that the model accuracy is improved, and the instantaneity is ensured.

(3) In the model, a convolution layer is added in the space feature extraction module and the global feature extraction module, so that the module can better acquire fine-granularity features, meanwhile, the global feature extraction module uses global average pooling operation to optimize global features of different stages in the downsampling process, and a feature map obtained in the last downsampling retains a large enough receptive field through global average pooling, so that the calculated amount of the model is not large, and meanwhile, the segmentation precision is improved.

Drawings

Fig. 1 is a network architecture diagram of an image semantic segmentation method based on a reference mechanism.

Fig. 2 is a block diagram of a spatial feature extraction module of an image semantic segmentation method based on a reference mechanism.

Fig. 3 is a block diagram of a global feature extraction module of the image semantic segmentation method based on a reference mechanism.

Fig. 4 is a reference answer module structure diagram of an image semantic segmentation method based on a reference mechanism.

Fig. 5 is a diagram of a attention refinement module of an image semantic segmentation method based on a reference mechanism.

Fig. 6 is a block diagram of a feature fusion module of the image semantic segmentation method based on a reference mechanism.

Description of the embodiments

In order to better understand the purpose, structure and function of the present application, the image semantic segmentation method based on the reference mechanism is described in further detail below with reference to the accompanying drawings.

As shown in fig. 1, the segmentation method based on the reference mechanism ensures the prompt of spatial feature and global feature extraction, additionally obtains features for model learning, reduces the complexity of the model, enhances the learning capacity of the model, and better balances the segmentation speed and the segmentation precision of the model.

Examples

As shown in fig. 1, an architecture diagram of an image semantic segmentation method based on a reference mechanism according to an embodiment of the present application is provided, and it should be noted that the architecture diagram only shows a logic sequence of the method according to the embodiment, and on the premise that the logic sequence does not conflict with each other, in other possible embodiments of the present application, the steps shown or described may be completed in a sequence different from that shown in fig. 1.

Referring to fig. 1, the image semantic segmentation method based on the reference mechanism specifically includes the following steps:

step one: and acquiring a data set, and dividing and preprocessing the data set.

The data set in the first step uses a Cityscapes data set, wherein a subset used for image semantic segmentation in the data set is specially designed for an image semantic segmentation task, the data set contains high-quality images from different city streets, and each image is provided with detailed pixel-level labeling information for identifying semantic categories to which each pixel in the image belongs. The images in the training set and the testing set are subjected to random clipping, scaling, contrast adjustment and the like through data preprocessing, and the labels are recoded and are coded into 19 categories from the original 35 categories. The size of the image after data preprocessing is 736×736.

Step two: building a segmentation model, which comprises five modules: the system comprises a spatial feature extraction module, a global feature extraction module, a reference answer module, an attention refinement module and a feature fusion module, wherein the spatial feature extraction module, the global feature extraction module and the reference answer module respectively provide three different features, and a segmentation prediction graph is obtained after the features are fused;

the first module in the second step is a spatial feature extraction module, which is used for extracting the spatial features of the image. The image enters a spatial feature extraction module after data preprocessing, firstly, the image passes through a convolution block, the convolution block consists of a convolution layer, a normalization layer and a ReLU activation layer, the convolution input channel of the first convolution block is 3, the output channel is 16, the convolution kernel size is 3*3, the step size is 2, the filling is 1, the obtained feature space SP1 is (1, 16, 368, 368), the spatial feature SP2 obtained by the feature SP1 passing through a convolution block II (the input channel is 16, the output channel is 32, the convolution kernel size is 3*3, the step size is 2, the filling is 1) is (1, 32, 184, the feature SP2 passes through a convolution block III (the input channel is 32, the output channel is 64, the convolution kernel size is 3*3, the step size is 2, the filling is 1) is (1, 64, 92, the feature SP3 passes through a convolution block IV (the input channel is 64, the output channel is 128, the step size is 3*3, the step size is 1, the filling is 1), the spatial feature SP4 obtained by the convolution block II (1, the convolution kernel size is 32, the convolution kernel size is 184, the filling is 1), the feature SP2 passes through the convolution block III (input channel is 32, the convolution kernel size is 256), the feature SP2 is obtained by the convolution block III (32, the step size is 256), the feature SP4 is 5, the filling is 5, namely the feature is 5, the feature is obtained by the extraction module is 5, and the feature is 5, the feature is obtained by the space is 256 is 5.

The second module is a global feature extraction module, and is used for extracting global features of the image. The image is preprocessed by data and enters a global feature extraction module, firstly, global features CP1 (1, 64, 368, 368) are obtained through a convolution layer (input channel is 3, output channel is 64, convolution kernel size is 7*7, step length is 2, and filling is 3), the features CP1 are obtained through a pooling layer (convolution kernel size is 3*3, step length is 2, filling is 1) to obtain global features CP2 (1, 64, 184, 184), the CP2 is processed through two convolution layers (input channel is 64, output channel is 64, convolution kernel size is 3*3, step length is 1, filling is 1) to obtain features 1 (1, 64, 184, 184), the features 1 are processed through two convolution layers (input channel is 64, output channel 128, convolution kernel size 3*3, step size 2, fill 1) to yield feature2 as (1, 128, 92, 92), feature2 was twice convolved (input channel 128, output channel 256, convolution kernel size 3*3, step size 2, fill 1) to yield feature3 as (1, 256, 46, 46), feature3 was twice convolved (input channel 256, output channel 512, convolution kernel size 3*3, step size 2, fill 1) to yield feature4 as (1, 512, 32, 32), and feature4 was globally averaged pooled (output channel (1, 1)) to yield feature4 as (1, 512,1,1). The features 2, 3 and 4 are spliced with the tail after passing through the attention thinning module, so as to obtain the global feature CP.

And step three, a reference answer module is used for extracting all the characteristics of the image. An image is preprocessed by an image encoder implemented using MAE pre-training ViT, the original image (736 ) is scaled equally to 1024, and then a vector of 1 x 64 x 768 is obtained after passing through a convolution layer (16 x 16 for the convolution kernel size, 16 for the step size), the vector is straightened on W and C and then enters a multi-layer transducer encoder, the vector output by ViT passes through two convolution layers (the convolution kernel sizes are 1*1 and 3*3 respectively, and each layer is output and then normalized) to obtain 256-dimensional feature vectors, namely the image encoder obtains the features of (256×64×64). The features obtained by the image encoder are firstly self-attention calculated by the decoder, and then a prediction result, namely a reference answer RA, is obtained after the image encoder passes through a layer of multi-layer perceptron.

And a fourth module in the second step is an attention refinement module, which is used for optimizing the characteristics of different stages. The attention refinement module is divided into two paths, and the first path is multiplied by the second path after the operations of global pooling, 1*1 convolution, normalization and activation, so that the global features of different stages are optimized.

And step two, the module five is a feature fusion module used for fusing the spatial features, the global features and the reference answers. In the module, firstly, the spatial feature SP5, the global feature CP and the reference answer RA are spliced, then the spatial feature SP5, the global feature CP and the reference answer RA are divided into three paths after passing through a convolution block, the first path of features are multiplied by the second path of features after global pooling, 1*1 convolution, reLU activation, 1*1 convolution and SigMoid activation, and then the third path of features are added to obtain a final model feature, and the features are up-sampled and restored after 8 to obtain a prediction result.

Step three: and selecting a proper loss function, setting the training round number to 150 by back propagation optimization parameters, and only storing the optimal model structure and model parameters in the training process.

And thirdly, selecting a proper loss function as a log_softmax loss function, setting the training round number to 150 by back propagation optimization parameters, and only storing the optimal model structure and model parameters in the training process, namely only storing the model structure and model parameters with the minimum loss values.

It will be understood that the application has been described in terms of several embodiments, and that various changes and equivalents may be made to these features and embodiments by those skilled in the art without departing from the spirit and scope of the application. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the application without departing from the essential scope thereof. Therefore, it is intended that the application not be limited to the particular embodiment disclosed, but that the application will include all embodiments falling within the scope of the appended claims.

Claims

1. The image semantic segmentation method based on the reference mechanism is characterized by comprising the following steps of:

step one, acquiring a Cityscape data set, dividing a training set, a testing set and a verification set, and preprocessing data of the training set;

step two, constructing a segmentation model by using a Pytorch framework, wherein the segmentation model comprises a spatial feature extraction module, a global feature extraction module, a reference answer module, an attention refinement module and a feature fusion module; the space feature extraction module, the global feature extraction module and the reference answer module respectively provide three different features, and the three different features are subjected to a feature fusion module to obtain a segmentation prediction graph; the reference answer module provides a reference answer by using a trained large model, the module is used during training, and the module is not used during testing and verification;

selecting a proper loss function, training the segmentation model through back propagation optimization parameters, calculating loss by using a log_softmax function in a torch.nn.functional library, and determining whether to save the model and parameters obtained through training by comparing the magnitude of the loss value; only the model structure and the model parameters with the minimum loss value are saved in the training process.

2. The image semantic segmentation method based on a reference mechanism according to claim 1, wherein the image labels in the cityscape dataset not only comprise color images, but also comprise instance label images and depth images, the instance label images are processed independently, and only the color images are taken as image labels;

after the image is labeled, the image is cut 736 first, then the operation of random scaling, 0.5 times brightness and 0.5 times contrast is carried out, and the label of the Cityscape is recoded from the original 35 classification to the 19 classification.

3. The method of claim 1, wherein the step of preprocessing the training set data includes reading data from a folder, performing center clipping according to a size, enhancing the data, encoding a tag, converting a Numpy array into a Tensor, and instantiating a class of data.

4. The image semantic segmentation method based on the reference mechanism according to claim 1, wherein in the second step, a segmentation model is built, wherein the first module is a spatial feature extraction module for extracting spatial features of an image; the second module is a global feature extraction module for extracting global features of the image; the third module is a reference answer module and is used for extracting all the characteristics of the image; the fourth module is an attention refinement module, which is used for optimizing the characteristics of different stages; the fifth module is a feature fusion module, which is used for fusing the spatial feature extraction module, the global feature extraction module and the reference answer module; finally, obtaining a prediction segmentation map through an up-sampling operation.

5. The image semantic segmentation method based on the reference mechanism according to claim 1, wherein the spatial feature extraction module is used for extracting spatial features of the image; the image is subjected to five convolution blocks to obtain spatial characteristics, wherein the convolution blocks consist of a convolution layer, a normalization layer and an activation layer, and the obtained spatial characteristics are 1/8 of the size of the original input image and are marked as spatial_feature.

6. The image semantic segmentation method based on the reference mechanism according to claim 1, wherein the global feature extraction module is configured to extract global features of an image, the global feature extraction module is similar to a U-shaped network structure, and is configured to perform five downsampling by convolution to obtain feature images with sizes of 1/2, 1/4, 1/8, 1/16 and 1/32, and reserve feature images with sizes of 1/8, 1/16 and 1/32, and the feature images obtained by last downsampling are respectively denoted as feature2, feature3 and feature4, and perform global average pooling to obtain feature images denoted as tail.

7. The reference mechanism-based image semantic segmentation method according to claim 1, wherein a reference answer module is used for extracting all features of an image, and the module is composed of an image encoder and a decoder; the image encoder uses MAE pre-training ViT as an encoder, and the image enters the encoder to output embedding with the size of 1/16; the decoder uses a transducer decoder; and obtaining a prediction graph with the same size as the original input image through a reference answer module.

8. The reference mechanism-based image semantic segmentation method according to claim 6, wherein the attention refinement module is used for optimizing features of different stages; the attention refinement module is used in global features, guides the features of different stages by global average pooling, and consists of an adaptive AvgPool2d pooling layer, a convolution layer with a convolution kernel size of 1*1, a normalization layer and a Sigmoid activation layer;

features 2, 3, 4 obtained by the global feature extraction module and features obtained by the reference answer module all need to be optimized by the attention refinement module.

9. The image semantic segmentation method based on a reference mechanism according to claim 1, wherein the feature fusion module is used for fusing features obtained by the spatial feature extraction module, the global feature extraction module and the reference answer module; the feature fusion module is used for firstly splicing the spatial feature SP5, the global feature CP and the reference answer RA, then dividing the spatial feature SP5, the global feature CP and the reference answer RA into three paths after passing through a convolution block, multiplying the first path of features by the second path of features after global pooling, 1*1 convolution, reLU activation, 1*1 convolution and SigMoid activation, and then adding the first path of features to the third path of features to obtain the final model features.

10. The image semantic segmentation method based on the reference mechanism according to claim 1, wherein in the third step, the loss is calculated by using a log_softmax function in a torch.nn.functional library, and the cross entropy loss is calculated with a real label, and the specific calculation formula is as follows:

（1）

（2）

the probability is converted to a logarithmic probability using natural logarithms:

（3）

（4）

calculating cross entropy loss with real tags:

（5）

in equations (1) to (5), assuming that the model output is logits, batch_size is the number of samples in the batch, exp (x) represents indexing each element in x, sum (exp (x)) represents summing all index terms, y_true is a true label, model parameters are continuously optimized by log_softmax loss function.