CN116740364A - Image semantic segmentation method based on reference mechanism - Google Patents

Image semantic segmentation method based on reference mechanism Download PDF

Info

Publication number
CN116740364A
CN116740364A CN202311029652.3A CN202311029652A CN116740364A CN 116740364 A CN116740364 A CN 116740364A CN 202311029652 A CN202311029652 A CN 202311029652A CN 116740364 A CN116740364 A CN 116740364A
Authority
CN
China
Prior art keywords
module
image
features
feature extraction
global
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311029652.3A
Other languages
Chinese (zh)
Other versions
CN116740364B (en
Inventor
李念峰
申向峰
李昕原
刘钱
孙立岩
丁天娇
王春湘
关彤
柴滕飞
肖治国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changchun University
Original Assignee
Changchun University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changchun University filed Critical Changchun University
Priority to CN202311029652.3A priority Critical patent/CN116740364B/en
Publication of CN116740364A publication Critical patent/CN116740364A/en
Application granted granted Critical
Publication of CN116740364B publication Critical patent/CN116740364B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to an image semantic segmentation method based on a reference mechanism, belonging to the field of computer vision; the image semantic segmentation method based on the reference mechanism comprises the following steps: combining a global feature extraction module, a spatial feature extraction module and an answer reference module, and referring to the global feature extraction module, the spatial feature extraction module and the answer reference module in image semantic segmentation, wherein the method specifically comprises the following steps: preprocessing a Cityscape data set, respectively sending images to a spatial feature extraction module, a global feature extraction module and a reference answer module to extract features, and then sending the feature images extracted by the three modules to a feature fusion module after upsampling, wherein the features of each stage are optimized by using a focus refinement module in the global feature extraction module and the reference answer module. In the image segmentation task, the segmentation precision and the segmentation speed are better balanced, and other segmentation models can also refer to the method, so that the performance of the segmentation model is improved.

Description

Image semantic segmentation method based on reference mechanism
Technical Field
The application belongs to the field of computer vision, and particularly relates to an image semantic segmentation method based on a reference mechanism.
Background
The background of image semantic segmentation can be traced back to the evolutionary course of the fields of computer vision and pattern recognition. Early computer vision tasks focused primarily on image classification, i.e., classifying the entire image into different categories, such as identifying animals, vehicles, people, etc. in the image. However, image classification ignores pixel level details in an image and does not provide semantic information for each pixel in the image.
To better understand the semantic information and local structure of images, image semantic segmentation techniques have evolved. Its main goal is to assign each pixel in the image to a corresponding semantic category, thereby enabling semantic understanding at the pixel level. This means that each pixel is given a semantic label, making the understanding of the image finer and more accurate.
Early image semantic segmentation methods relied primarily on manually designed features and traditional image processing techniques. Although these methods achieve some success in some scenarios, they are not as effective as complex image structures and semantic differentiation.
With the advent of deep learning, and in particular the successful application of Convolutional Neural Networks (CNNs) to image classification tasks, researchers began exploring the application of deep learning methods to image semantic segmentation. The strong characterization capability and the feature learning capability of the deep learning make the image semantic segmentation significantly advanced.
In 2014, the proposal of a full convolution network (Fully Convolutional Networks, FCN) marks the image semantic segmentation into a brand new stage. The FCN applies the convolutional neural network to the semantic segmentation of the pixel level for the first time, and the full-connection layer is replaced by the full-convolutional layer, so that the network can accept an input image with any size and output a semantic segmentation result with the same size. The method lays a foundation for efficient and accurate implementation of image semantic segmentation.
Subsequently, many improved models based on FCNs, such as U-Net, deepLab, segNet, etc., have emerged. These models all structurally employ an encoder-decoder architecture, where image features are extracted by the encoder and then predicted by the decoder for pixel-level semantic segmentation. Meanwhile, mechanisms such as spatial attention, jump connection and the like are introduced, and the precision and stability of segmentation are further improved.
Despite the great progress made in image semantic segmentation models, there are significant problems, and current models often have a large number of parameters, which results in very large models. In particular, some advanced semantic segmentation models, such as deep lab and U-Net, require large computational resources and video memory for training and reasoning. The model is large, so that the model cannot meet the real-time performance.
Disclosure of Invention
The application aims to provide an image semantic segmentation method based on a reference mechanism, which aims to solve the technical problems of complex segmentation model, low segmentation precision, low segmentation speed and the like in the prior art.
In order to achieve the above purpose, the specific technical scheme of the image semantic segmentation method based on the reference mechanism of the application is as follows:
in order to achieve the above purpose/solve the above technical problems, the present application is realized by adopting the following steps of:
step one: acquiring a data set, and preprocessing the data set;
specifically, a Cityscape data set is obtained, a training set, a testing set and a verification set are divided, and data preprocessing is carried out on the training set;
step two: building a segmentation model, which comprises five modules: the system comprises a spatial feature extraction module, a global feature extraction module, a reference answer module, an attention refinement module and a feature fusion module, wherein the spatial feature extraction module, the global feature extraction module and the reference answer module respectively provide three different features, and then the three provided different features are subjected to the feature fusion module to obtain a segmentation prediction graph;
step three: and selecting a proper loss function, setting the training round number to 150 by back propagation optimization parameters, and only storing the optimal (minimum) model structure and model parameters in the training process.
Further, step one, wherein the data set is a Cityscape, and the division of the Cityscape data set comprises 2975 training set images and labels, 1525 test set images and labels, and 500 verification set images and labels; because the image labels in the Cityscape dataset not only contain color images, but also contain instance label images and depth images, the color images are used as labels, and the labels are processed independently; note that only the color map is taken as the image label. After the image is labeled correspondingly, data preprocessing is performed, firstly, the image is cut into 736 x 736, and then data enhancement is performed, including: random scaling, 0.5 brightness, 0.5 contrast operation, and recoding the cityscapes label from the original 35 class to the 19 class.
Further, the step one of preprocessing the data of the training set comprises reading the data from the folder, performing center clipping according to the size, encoding the tag by data enhancement, converting the Numpy array into a Tensor, and instantiating the data class;
the data enhancement here specifically includes random scaling, adjusting image contrast and brightness.
Further, the step two of constructing a segmentation model mainly comprises five modules, wherein the first module is a spatial feature extraction module and is used for extracting spatial features of an image; the second module is a global feature extraction module for extracting global features of the image; the third module is a reference answer module and is used for extracting all the characteristics of the image; the fourth module is an attention refinement module, which is used for optimizing the characteristics of different stages; and the fifth module is a feature fusion module for fusing the spatial feature extraction module, the global feature extraction module and the reference answer module. And finally, restoring the feature map to the original map size through an upsampling operation to obtain a prediction segmentation map.
Further, the first module is a spatial feature extraction module, which is used for extracting the spatial feature of the image, the image is subjected to five convolution blocks to obtain the spatial feature, and the obtained spatial feature is 1/8 of the size of the original input image and is marked as spatial feature. Specifically, an image enters a spatial feature extraction module after data preprocessing, all the convolution blocks are composed of a convolution layer, a normalization layer and a ReLU activation layer, the convolution input channel in_channels of the first convolution block is 3, the output channel out_channels is 16, the convolution kernel_size is 3*3, the step size stride is 2, the filling padding is 1, the normalization layer of the convolution block 1 uses BatchNorm2d, the activation layer of the convolution block 1 uses a ReLU activation function, the obtained feature space SP1 is (1, 16, 368, 368), and the image obtained through the convolution block 1 is 1/2 of the original image size; the spatial feature SP2 obtained by the feature SP2 through the convolution block two (input channel is 16, output channel is 32, convolution kernel size is 3*3, step size is 2, and filling is 1) is (1, 32, 184, 184), the feature space SP3 obtained by the feature SP2 through the convolution block three (input channel is 32, output channel is 64, convolution kernel size is 3*3, step size is 2, filling is 1) is (1, 64, 92, 92), the spatial feature SP4 obtained by the feature SP3 through the convolution block four (input channel is 64, output channel is 128, convolution kernel size is 3*3, step size is 1, filling is 1) is (1, 128, 92, 92), the spatial feature SP4 obtained by the feature SP4 through the convolution block five (input channel is 128, output channel 256, convolution kernel size is 3*3, step size is 1, filling is 1) is (1, 256, 92, 92), and SP5 is the feature map extracted by the spatial feature extraction module.
Further, the second module is a global feature extraction module, which is configured to extract global features of the image. The global feature extraction module structure is similar to a U-shaped network structure, five times of downsampling are carried out through convolution to respectively obtain feature images with the sizes of 1/2, 1/4, 1/8, 1/16 and 1/32, the feature images with the sizes of 1/8, 1/16 and 1/32 are reserved and are respectively marked as feature2, feature3 and feature4, and the feature images obtained by the last downsampling are subjected to global average pooling and marked as tail, so that a large enough receptive field is reserved.
Specifically, the image is subjected to data preprocessing and then enters a global feature extraction module, a global feature CP1 is obtained by a convolution layer (input channel is 3, output channel is 64, convolution kernel size is 7*7, step size is 2, and filling is 3) to obtain a global feature CP2 (1, 64, 368, 368), a feature CP1 is obtained by a pooling layer (convolution kernel size is 3*3, step size is 2, filling is 1) to obtain a global feature CP2 (1, 64, 184, 184), CP2 is obtained by a convolution layer (input channel is 64, output channel is 64, convolution kernel size is 3*3, step size is 1, filling is 1) to obtain a feature1 (1, 64, 184, 184), the feature1 is obtained by a convolution layer (input channel is 64, output channel 128, convolution kernel size 3*3, step size 2, fill 1) to yield feature2 as (1, 128, 92, 92), feature2 was twice convolved (input channel 128, output channel 256, convolution kernel size 3*3, step size 2, fill 1) to yield feature3 as (1, 256, 46, 46), feature3 was twice convolved (input channel 256, output channel 512, convolution kernel size 3*3, step size 2, fill 1) to yield feature4 as (1, 512, 32, 32), and feature4 was globally averaged pooled (output channel (1, 1)) to yield feature4 as (1, 512,1,1). The features 2, 3 and 4 are spliced with the tail after passing through the attention thinning module, so as to obtain the global feature CP.
Furthermore, the third module is a reference answer module, which is used for extracting all the features of the image, and the module is composed of an image encoder and an image decoder, the function of the module is equivalent to that of the reference answer, and the obtained prediction graph can strengthen the learning ability of the model through the subsequent feature fusion module. And thirdly, providing a reference answer by using a trained large model, wherein the module is used during training, and is not used during testing and verification.
Specifically, the image is preprocessed and then enters an image encoder, the image encoder is realized by using MAE pre-training ViT, and the image is embedded with the size of 1/16 after entering the encoder; the method comprises the following detailed steps: the original image (736 ) is scaled to 1024, then the vector of 1 x 64 x 768 is obtained after passing through convolution layers (convolution kernel size is 16 x 16 and step size is 16), the vector is straightened on W and C and then enters a multi-layer transform encoder, the vector output by ViT passes through two convolution layers (convolution kernel sizes are 1*1 and 3*3 respectively, and each layer is normalized) to obtain 256-dimensional feature vector, namely the image encoder obtains the feature of (256 x 64). The features obtained by the image encoder are firstly self-attention calculated by the decoder, and then a prediction result, namely a reference answer RA, is obtained after the image encoder passes through a layer of multi-layer perceptron. The decoder uses a transducer decoder, which can effectively embed and map the image into the mask. And obtaining a prediction graph with the same size as the original input image through a reference answer module.
Further, the fourth module is an attention refinement module, which is used for optimizing the characteristics of different stages. The attention refinement module is used in global features, guides the features of different stages by global average pooling, and consists of an adaptive avgpool2d pooling layer, a convolution layer with a convolution kernel size of 1*1, a normalization layer and a Sigmoid activation layer. Features 2, 3, 4 obtained by the global feature extraction module and features obtained by the reference answer module all need to be optimized by the attention refinement module. The attention refinement module is divided into two paths, and the first path is multiplied by the second path after global pooling, 1*1 convolution, normalization and activation operation, so that global features of different stages are optimized, and the second path does not do any operation.
Further, the fifth module is a feature fusion module, configured to fuse features obtained by the spatial feature extraction module, the global feature extraction module, and the reference answer module. In the module, firstly, the spatial feature SP5, the global feature CP and the reference answer RA are spliced, then the spatial feature SP5, the global feature CP and the reference answer RA are divided into three paths after passing through a convolution block, the first path of features are multiplied by the second path of features after global pooling, 1*1 convolution, reLU activation, 1*1 convolution and SigMoid activation, and then the third path of features are added to obtain a final model feature, and the features are up-sampled and restored after 8 to obtain a prediction result.
Further, in the third step, the log_softmax loss function is selected, the training round number is set to 150 by back propagation of the optimization parameters, and only the optimal model structure and model parameters are saved in the training process.
Further, in the third step, the loss is calculated by using the log_softmax function in the torch. Nn. Functional library, and the cross entropy loss is calculated with the real tag, and the specific calculation formula is as follows:
the original score is converted into a probability distribution using a softmax function:
(1)
applying the logits to the softmax function yields the probability distribution:
(2)
the probability is converted to a logarithmic probability (log probabilities) using natural logarithms (log functions):
(3)
applying a log function to the probability distribution probs obtained in the formula (2) to obtain a log probability:
(4)
calculating cross entropy loss with real tags:
(5)
in the above formulas (1) to (5), assuming that the model output is logits, batch_size is the number of samples in a batch, exp (x) means to index each element in x, sum (exp (x)) means to sum all index terms, and y_true is a real label. In the application, the training turn of the segmentation model is set to be 150 turns, and model parameters are continuously optimized through a log_softmax loss function.
The image semantic segmentation method based on the reference mechanism has the following advantages:
(1) Aiming at the problems of complex existing image semantic segmentation model, low segmentation speed, low segmentation precision and the like, the application provides an image semantic segmentation method based on a reference mechanism, which adds a module on the original BiSeNet model architecture and changes a network layer, thereby effectively improving the segmentation precision and speed.
(2) Aiming at the problems of overlarge calculated amount, long reasoning time and the like of the large image semantic segmentation model, the method provided by the application is added with a reference answer module, the module provides a segmentation prediction graph through a trained large model as a reference answer, the module is only used during training, and the module is not required to calculate in the reasoning stage, so that the model accuracy is improved, and the instantaneity is ensured.
(3) In the model, a convolution layer is added in the space feature extraction module and the global feature extraction module, so that the module can better acquire fine-granularity features, meanwhile, the global feature extraction module uses global average pooling operation to optimize global features of different stages in the downsampling process, and a feature map obtained in the last downsampling retains a large enough receptive field through global average pooling, so that the calculated amount of the model is not large, and meanwhile, the segmentation precision is improved.
Drawings
Fig. 1 is a network architecture diagram of an image semantic segmentation method based on a reference mechanism.
Fig. 2 is a block diagram of a spatial feature extraction module of an image semantic segmentation method based on a reference mechanism.
Fig. 3 is a block diagram of a global feature extraction module of the image semantic segmentation method based on a reference mechanism.
Fig. 4 is a reference answer module structure diagram of an image semantic segmentation method based on a reference mechanism.
Fig. 5 is a diagram of a attention refinement module of an image semantic segmentation method based on a reference mechanism.
Fig. 6 is a block diagram of a feature fusion module of the image semantic segmentation method based on a reference mechanism.
Description of the embodiments
In order to better understand the purpose, structure and function of the present application, the image semantic segmentation method based on the reference mechanism is described in further detail below with reference to the accompanying drawings.
As shown in fig. 1, the segmentation method based on the reference mechanism ensures the prompt of spatial feature and global feature extraction, additionally obtains features for model learning, reduces the complexity of the model, enhances the learning capacity of the model, and better balances the segmentation speed and the segmentation precision of the model.
Examples
As shown in fig. 1, an architecture diagram of an image semantic segmentation method based on a reference mechanism according to an embodiment of the present application is provided, and it should be noted that the architecture diagram only shows a logic sequence of the method according to the embodiment, and on the premise that the logic sequence does not conflict with each other, in other possible embodiments of the present application, the steps shown or described may be completed in a sequence different from that shown in fig. 1.
Referring to fig. 1, the image semantic segmentation method based on the reference mechanism specifically includes the following steps:
step one: and acquiring a data set, and dividing and preprocessing the data set.
The data set in the first step uses a Cityscapes data set, wherein a subset used for image semantic segmentation in the data set is specially designed for an image semantic segmentation task, the data set contains high-quality images from different city streets, and each image is provided with detailed pixel-level labeling information for identifying semantic categories to which each pixel in the image belongs. The images in the training set and the testing set are subjected to random clipping, scaling, contrast adjustment and the like through data preprocessing, and the labels are recoded and are coded into 19 categories from the original 35 categories. The size of the image after data preprocessing is 736×736.
Step two: building a segmentation model, which comprises five modules: the system comprises a spatial feature extraction module, a global feature extraction module, a reference answer module, an attention refinement module and a feature fusion module, wherein the spatial feature extraction module, the global feature extraction module and the reference answer module respectively provide three different features, and a segmentation prediction graph is obtained after the features are fused;
the first module in the second step is a spatial feature extraction module, which is used for extracting the spatial features of the image. The image enters a spatial feature extraction module after data preprocessing, firstly, the image passes through a convolution block, the convolution block consists of a convolution layer, a normalization layer and a ReLU activation layer, the convolution input channel of the first convolution block is 3, the output channel is 16, the convolution kernel size is 3*3, the step size is 2, the filling is 1, the obtained feature space SP1 is (1, 16, 368, 368), the spatial feature SP2 obtained by the feature SP1 passing through a convolution block II (the input channel is 16, the output channel is 32, the convolution kernel size is 3*3, the step size is 2, the filling is 1) is (1, 32, 184, the feature SP2 passes through a convolution block III (the input channel is 32, the output channel is 64, the convolution kernel size is 3*3, the step size is 2, the filling is 1) is (1, 64, 92, the feature SP3 passes through a convolution block IV (the input channel is 64, the output channel is 128, the step size is 3*3, the step size is 1, the filling is 1), the spatial feature SP4 obtained by the convolution block II (1, the convolution kernel size is 32, the convolution kernel size is 184, the filling is 1), the feature SP2 passes through the convolution block III (input channel is 32, the convolution kernel size is 256), the feature SP2 is obtained by the convolution block III (32, the step size is 256), the feature SP4 is 5, the filling is 5, namely the feature is 5, the feature is obtained by the extraction module is 5, and the feature is 5, the feature is obtained by the space is 256 is 5.
The second module is a global feature extraction module, and is used for extracting global features of the image. The image is preprocessed by data and enters a global feature extraction module, firstly, global features CP1 (1, 64, 368, 368) are obtained through a convolution layer (input channel is 3, output channel is 64, convolution kernel size is 7*7, step length is 2, and filling is 3), the features CP1 are obtained through a pooling layer (convolution kernel size is 3*3, step length is 2, filling is 1) to obtain global features CP2 (1, 64, 184, 184), the CP2 is processed through two convolution layers (input channel is 64, output channel is 64, convolution kernel size is 3*3, step length is 1, filling is 1) to obtain features 1 (1, 64, 184, 184), the features 1 are processed through two convolution layers (input channel is 64, output channel 128, convolution kernel size 3*3, step size 2, fill 1) to yield feature2 as (1, 128, 92, 92), feature2 was twice convolved (input channel 128, output channel 256, convolution kernel size 3*3, step size 2, fill 1) to yield feature3 as (1, 256, 46, 46), feature3 was twice convolved (input channel 256, output channel 512, convolution kernel size 3*3, step size 2, fill 1) to yield feature4 as (1, 512, 32, 32), and feature4 was globally averaged pooled (output channel (1, 1)) to yield feature4 as (1, 512,1,1). The features 2, 3 and 4 are spliced with the tail after passing through the attention thinning module, so as to obtain the global feature CP.
And step three, a reference answer module is used for extracting all the characteristics of the image. An image is preprocessed by an image encoder implemented using MAE pre-training ViT, the original image (736 ) is scaled equally to 1024, and then a vector of 1 x 64 x 768 is obtained after passing through a convolution layer (16 x 16 for the convolution kernel size, 16 for the step size), the vector is straightened on W and C and then enters a multi-layer transducer encoder, the vector output by ViT passes through two convolution layers (the convolution kernel sizes are 1*1 and 3*3 respectively, and each layer is output and then normalized) to obtain 256-dimensional feature vectors, namely the image encoder obtains the features of (256×64×64). The features obtained by the image encoder are firstly self-attention calculated by the decoder, and then a prediction result, namely a reference answer RA, is obtained after the image encoder passes through a layer of multi-layer perceptron.
And a fourth module in the second step is an attention refinement module, which is used for optimizing the characteristics of different stages. The attention refinement module is divided into two paths, and the first path is multiplied by the second path after the operations of global pooling, 1*1 convolution, normalization and activation, so that the global features of different stages are optimized.
And step two, the module five is a feature fusion module used for fusing the spatial features, the global features and the reference answers. In the module, firstly, the spatial feature SP5, the global feature CP and the reference answer RA are spliced, then the spatial feature SP5, the global feature CP and the reference answer RA are divided into three paths after passing through a convolution block, the first path of features are multiplied by the second path of features after global pooling, 1*1 convolution, reLU activation, 1*1 convolution and SigMoid activation, and then the third path of features are added to obtain a final model feature, and the features are up-sampled and restored after 8 to obtain a prediction result.
Step three: and selecting a proper loss function, setting the training round number to 150 by back propagation optimization parameters, and only storing the optimal model structure and model parameters in the training process.
And thirdly, selecting a proper loss function as a log_softmax loss function, setting the training round number to 150 by back propagation optimization parameters, and only storing the optimal model structure and model parameters in the training process, namely only storing the model structure and model parameters with the minimum loss values.
It will be understood that the application has been described in terms of several embodiments, and that various changes and equivalents may be made to these features and embodiments by those skilled in the art without departing from the spirit and scope of the application. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the application without departing from the essential scope thereof. Therefore, it is intended that the application not be limited to the particular embodiment disclosed, but that the application will include all embodiments falling within the scope of the appended claims.

Claims (10)

1. The image semantic segmentation method based on the reference mechanism is characterized by comprising the following steps of:
step one, acquiring a Cityscape data set, dividing a training set, a testing set and a verification set, and preprocessing data of the training set;
step two, constructing a segmentation model by using a Pytorch framework, wherein the segmentation model comprises a spatial feature extraction module, a global feature extraction module, a reference answer module, an attention refinement module and a feature fusion module; the space feature extraction module, the global feature extraction module and the reference answer module respectively provide three different features, and the three different features are subjected to a feature fusion module to obtain a segmentation prediction graph; the reference answer module provides a reference answer by using a trained large model, the module is used during training, and the module is not used during testing and verification;
selecting a proper loss function, training the segmentation model through back propagation optimization parameters, calculating loss by using a log_softmax function in a torch.nn.functional library, and determining whether to save the model and parameters obtained through training by comparing the magnitude of the loss value; only the model structure and the model parameters with the minimum loss value are saved in the training process.
2. The image semantic segmentation method based on a reference mechanism according to claim 1, wherein the image labels in the cityscape dataset not only comprise color images, but also comprise instance label images and depth images, the instance label images are processed independently, and only the color images are taken as image labels;
after the image is labeled, the image is cut 736 first, then the operation of random scaling, 0.5 times brightness and 0.5 times contrast is carried out, and the label of the Cityscape is recoded from the original 35 classification to the 19 classification.
3. The method of claim 1, wherein the step of preprocessing the training set data includes reading data from a folder, performing center clipping according to a size, enhancing the data, encoding a tag, converting a Numpy array into a Tensor, and instantiating a class of data.
4. The image semantic segmentation method based on the reference mechanism according to claim 1, wherein in the second step, a segmentation model is built, wherein the first module is a spatial feature extraction module for extracting spatial features of an image; the second module is a global feature extraction module for extracting global features of the image; the third module is a reference answer module and is used for extracting all the characteristics of the image; the fourth module is an attention refinement module, which is used for optimizing the characteristics of different stages; the fifth module is a feature fusion module, which is used for fusing the spatial feature extraction module, the global feature extraction module and the reference answer module; finally, obtaining a prediction segmentation map through an up-sampling operation.
5. The image semantic segmentation method based on the reference mechanism according to claim 1, wherein the spatial feature extraction module is used for extracting spatial features of the image; the image is subjected to five convolution blocks to obtain spatial characteristics, wherein the convolution blocks consist of a convolution layer, a normalization layer and an activation layer, and the obtained spatial characteristics are 1/8 of the size of the original input image and are marked as spatial_feature.
6. The image semantic segmentation method based on the reference mechanism according to claim 1, wherein the global feature extraction module is configured to extract global features of an image, the global feature extraction module is similar to a U-shaped network structure, and is configured to perform five downsampling by convolution to obtain feature images with sizes of 1/2, 1/4, 1/8, 1/16 and 1/32, and reserve feature images with sizes of 1/8, 1/16 and 1/32, and the feature images obtained by last downsampling are respectively denoted as feature2, feature3 and feature4, and perform global average pooling to obtain feature images denoted as tail.
7. The reference mechanism-based image semantic segmentation method according to claim 1, wherein a reference answer module is used for extracting all features of an image, and the module is composed of an image encoder and a decoder; the image encoder uses MAE pre-training ViT as an encoder, and the image enters the encoder to output embedding with the size of 1/16; the decoder uses a transducer decoder; and obtaining a prediction graph with the same size as the original input image through a reference answer module.
8. The reference mechanism-based image semantic segmentation method according to claim 6, wherein the attention refinement module is used for optimizing features of different stages; the attention refinement module is used in global features, guides the features of different stages by global average pooling, and consists of an adaptive AvgPool2d pooling layer, a convolution layer with a convolution kernel size of 1*1, a normalization layer and a Sigmoid activation layer;
features 2, 3, 4 obtained by the global feature extraction module and features obtained by the reference answer module all need to be optimized by the attention refinement module.
9. The image semantic segmentation method based on a reference mechanism according to claim 1, wherein the feature fusion module is used for fusing features obtained by the spatial feature extraction module, the global feature extraction module and the reference answer module; the feature fusion module is used for firstly splicing the spatial feature SP5, the global feature CP and the reference answer RA, then dividing the spatial feature SP5, the global feature CP and the reference answer RA into three paths after passing through a convolution block, multiplying the first path of features by the second path of features after global pooling, 1*1 convolution, reLU activation, 1*1 convolution and SigMoid activation, and then adding the first path of features to the third path of features to obtain the final model features.
10. The image semantic segmentation method based on the reference mechanism according to claim 1, wherein in the third step, the loss is calculated by using a log_softmax function in a torch.nn.functional library, and the cross entropy loss is calculated with a real label, and the specific calculation formula is as follows:
the original score is converted into a probability distribution using a softmax function:
(1)
applying the logits to the softmax function yields the probability distribution:
(2)
the probability is converted to a logarithmic probability using natural logarithms:
(3)
applying a log function to the probability distribution probs obtained in the formula (2) to obtain a log probability:
(4)
calculating cross entropy loss with real tags:
(5)
in equations (1) to (5), assuming that the model output is logits, batch_size is the number of samples in the batch, exp (x) represents indexing each element in x, sum (exp (x)) represents summing all index terms, y_true is a true label, model parameters are continuously optimized by log_softmax loss function.
CN202311029652.3A 2023-08-16 2023-08-16 Image semantic segmentation method based on reference mechanism Active CN116740364B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311029652.3A CN116740364B (en) 2023-08-16 2023-08-16 Image semantic segmentation method based on reference mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311029652.3A CN116740364B (en) 2023-08-16 2023-08-16 Image semantic segmentation method based on reference mechanism

Publications (2)

Publication Number Publication Date
CN116740364A true CN116740364A (en) 2023-09-12
CN116740364B CN116740364B (en) 2023-10-27

Family

ID=87901622

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311029652.3A Active CN116740364B (en) 2023-08-16 2023-08-16 Image semantic segmentation method based on reference mechanism

Country Status (1)

Country Link
CN (1) CN116740364B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117542121A (en) * 2023-12-06 2024-02-09 河北双学教育科技有限公司 Computer vision-based intelligent training and checking system and method
CN118071865A (en) * 2024-04-17 2024-05-24 英瑞云医疗科技(烟台)有限公司 Cross-modal synthesis method and device for medical images from brain peduncles CT to T1

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822284A (en) * 2021-09-24 2021-12-21 北京邮电大学 RGBD image semantic segmentation method based on boundary attention
CN114398999A (en) * 2022-01-19 2022-04-26 上海大学 Low-contrast image semantic segmentation method based on global semantic feature fusion
CN115797931A (en) * 2023-02-13 2023-03-14 山东锋士信息技术有限公司 Remote sensing image semantic segmentation method based on double-branch feature fusion
CN115984701A (en) * 2023-02-07 2023-04-18 无锡学院 Multi-modal remote sensing image semantic segmentation method based on coding and decoding structure
CN116310305A (en) * 2022-11-29 2023-06-23 湘潭大学 Coding and decoding structure semantic segmentation model based on tensor and second-order covariance attention mechanism
CN116524189A (en) * 2023-05-05 2023-08-01 大连海事大学 High-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization
CN116543155A (en) * 2023-05-08 2023-08-04 海南大学 Semantic segmentation method and device based on context cascading and multi-scale feature refinement
CN116580195A (en) * 2023-04-26 2023-08-11 齐鲁工业大学(山东省科学院) Remote sensing image semantic segmentation method and system based on ConvNeXt convolution

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822284A (en) * 2021-09-24 2021-12-21 北京邮电大学 RGBD image semantic segmentation method based on boundary attention
CN114398999A (en) * 2022-01-19 2022-04-26 上海大学 Low-contrast image semantic segmentation method based on global semantic feature fusion
CN116310305A (en) * 2022-11-29 2023-06-23 湘潭大学 Coding and decoding structure semantic segmentation model based on tensor and second-order covariance attention mechanism
CN115984701A (en) * 2023-02-07 2023-04-18 无锡学院 Multi-modal remote sensing image semantic segmentation method based on coding and decoding structure
CN115797931A (en) * 2023-02-13 2023-03-14 山东锋士信息技术有限公司 Remote sensing image semantic segmentation method based on double-branch feature fusion
CN116580195A (en) * 2023-04-26 2023-08-11 齐鲁工业大学(山东省科学院) Remote sensing image semantic segmentation method and system based on ConvNeXt convolution
CN116524189A (en) * 2023-05-05 2023-08-01 大连海事大学 High-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization
CN116543155A (en) * 2023-05-08 2023-08-04 海南大学 Semantic segmentation method and device based on context cascading and multi-scale feature refinement

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHANGQIAN YU ET AL.: "BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation", ARXIV, pages 1 - 17 *
WEIDE LIU ET AL.: "CRNet: Cross-Reference Networks for Few-Shot Segmentation", ARXIV, pages 1 - 9 *
胡学刚 等: "双路径特征融合编解码结构的高速语义分割", 计算机辅助设计与图形学学报, vol. 34, no. 12, pages 1911 - 1919 *
韩慧慧 等: "编码一解码结构的语义分割", 中国图象图形学报, vol. 25, no. 5, pages 255 - 266 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117542121A (en) * 2023-12-06 2024-02-09 河北双学教育科技有限公司 Computer vision-based intelligent training and checking system and method
CN118071865A (en) * 2024-04-17 2024-05-24 英瑞云医疗科技(烟台)有限公司 Cross-modal synthesis method and device for medical images from brain peduncles CT to T1

Also Published As

Publication number Publication date
CN116740364B (en) 2023-10-27

Similar Documents

Publication Publication Date Title
CN108509978B (en) Multi-class target detection method and model based on CNN (CNN) multi-level feature fusion
CN112541503B (en) Real-time semantic segmentation method based on context attention mechanism and information fusion
US10325371B1 (en) Method and device for segmenting image to be used for surveillance using weighted convolution filters for respective grid cells by converting modes according to classes of areas to satisfy level 4 of autonomous vehicle, and testing method and testing device using the same
CN113780296A (en) Remote sensing image semantic segmentation method and system based on multi-scale information fusion
CN116740364B (en) Image semantic segmentation method based on reference mechanism
CN111310766A (en) License plate identification method based on coding and decoding and two-dimensional attention mechanism
CN114398855A (en) Text extraction method, system and medium based on fusion pre-training
CN115565071A (en) Hyperspectral image transform network training and classifying method
CN113034506A (en) Remote sensing image semantic segmentation method and device, computer equipment and storage medium
CN112598076A (en) Motor vehicle attribute identification method and system
CN114677515A (en) Weak supervision semantic segmentation method based on inter-class similarity
US20230186436A1 (en) Method for fine-grained detection of driver distraction based on unsupervised learning
CN116071553A (en) Weak supervision semantic segmentation method and device based on naive VisionTransformer
CN112149526A (en) Lane line detection method and system based on long-distance information fusion
CN111310820A (en) Foundation meteorological cloud chart classification method based on cross validation depth CNN feature integration
CN116524261A (en) Image classification method and product based on multi-mode small sample continuous learning
CN115905613A (en) Audio and video multitask learning and evaluation method, computer equipment and medium
Jain et al. Flynet–neural network model for automatic building detection from satellite images
CN112966569B (en) Image processing method and device, computer equipment and storage medium
Yuan et al. A plug-and-play image enhancement model for end-to-end object detection in low-light condition
CN114758128B (en) Scene panorama segmentation method and system based on controlled pixel embedding characterization explicit interaction
CN112396006B (en) Building damage identification method and device based on machine learning and computing equipment
Jadhav et al. [Re] CLRNet: Cross Layer Refinement Network for Lane Detection
CN115661463A (en) Semi-supervised semantic segmentation method based on scale perception attention
CN116342980A (en) Spliced image recognition method, computer readable storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant