CN116543161A - Semantic segmentation method, semantic segmentation device, computer equipment and storage medium - Google Patents

Semantic segmentation method, semantic segmentation device, computer equipment and storage medium Download PDF

Info

Publication number
CN116543161A
CN116543161A CN202310533096.7A CN202310533096A CN116543161A CN 116543161 A CN116543161 A CN 116543161A CN 202310533096 A CN202310533096 A CN 202310533096A CN 116543161 A CN116543161 A CN 116543161A
Authority
CN
China
Prior art keywords
scale
channel
attention
feature map
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310533096.7A
Other languages
Chinese (zh)
Inventor
曹连雨
张小璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Hongyun Technology Nanjing Co ltd
Original Assignee
Zhongke Hongyun Technology Nanjing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Hongyun Technology Nanjing Co ltd filed Critical Zhongke Hongyun Technology Nanjing Co ltd
Priority to CN202310533096.7A priority Critical patent/CN116543161A/en
Publication of CN116543161A publication Critical patent/CN116543161A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure relates to the field of image processing technologies, and in particular, to a semantic segmentation method, apparatus, computer device, and storage medium. The method comprises the following steps: acquiring single-scale feature images extracted from images to be detected under different resolutions; respectively inputting each single-scale feature map to an attention channel module, and carrying out feature extraction on the channel dimension and the space dimension of each single-scale feature map through the attention channel module to obtain a first fusion feature map; inputting the first fusion feature map into the cavity convolution layer to obtain multi-scale information gain output by the cavity convolution layer; and generating a semantic segmentation result based on the first fusion feature and the multi-scale information gain. The semantic segmentation precision of the remote sensing image can be improved.

Description

Semantic segmentation method, semantic segmentation device, computer equipment and storage medium
Technical Field
The present disclosure relates to the field of image processing technologies, and in particular, to a semantic segmentation method, apparatus, computer device, and storage medium.
Background
Semantic segmentation plays a key role in many image processing applications, and image semantic segmentation refers to the process of identifying an image at the pixel level, i.e., labeling the class of objects to which each pixel in the image belongs. The semantic segmentation of the remote sensing image needs to consider the influence of noise, and targets with different specifications in the image are identified and marked so as to distinguish a large-scale background and obtain a segmentation result.
However, due to the characteristics of complex background, multiple targets, inconsistent size, large width and the like of the remote sensing image, the semantic segmentation of the remote sensing image brings great challenges, and therefore, improvement is needed.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a semantic segmentation method, apparatus, computer device, and storage medium that can improve the semantic segmentation accuracy of remote sensing images.
In a first aspect, the present application provides a semantic segmentation method, the method comprising:
acquiring single-scale feature images extracted from images to be detected under different resolutions;
respectively inputting each single-scale feature map to an attention channel module, and carrying out feature extraction on the channel dimension and the space dimension of each single-scale feature map through the attention channel module to obtain a first fusion feature map;
inputting the first fusion feature map into the cavity convolution layer to obtain multi-scale information gain output by the cavity convolution layer;
and generating a semantic segmentation result based on the first fusion feature and the multi-scale information gain.
In one embodiment, obtaining a single scale feature map extracted from an image to be measured at different resolutions includes: and respectively extracting features of the images to be detected with different resolutions by using the ResNet, denseNet and VGG backbone network models to obtain single-scale feature images with different scales.
In one embodiment, each single-scale feature map is input to an attention channel module, and feature extraction is performed on a channel dimension and a space dimension of each single-scale feature map by the attention channel module to obtain a first fusion feature, including:
the method comprises the steps of respectively inputting each single-scale feature map to a channel learning submodule in an attention channel module, and learning the channel attention of each single-scale feature map through the channel learning submodule to obtain multi-scale channel attention features;
the method comprises the steps of respectively inputting each single-scale feature map to a space learning submodule in an attention channel module, and learning the space attention of each single-scale feature map through the space learning submodule to obtain multi-scale space attention features;
and obtaining a first fusion characteristic diagram based on the multi-scale channel attention characteristic and the multi-scale space attention characteristic.
In one embodiment, each single-scale feature map is input to a channel learning submodule in the attention channel module, and the channel attention of each single-scale feature map is learned by the channel learning submodule to obtain multi-scale channel attention features, including:
for any single-scale feature map, a channel learning submodule is used for adaptively determining the size of a convolution kernel after the convolution features are aggregated by using a global average pooling layer without dimension reduction, and one-dimensional convolution is carried out based on the convolution kernel;
learning the inter-channel sub-features corresponding to the single-scale feature map of the channel by activating the function;
the single-scale feature map is subjected to a self-adaptive pooling layer to obtain channel pooling sub-features corresponding to the single-scale feature map;
and obtaining the multi-scale channel attention characteristic based on the inter-channel sub-characteristic and the channel pooling sub-characteristic corresponding to each single-scale characteristic graph.
In one embodiment, each single-scale feature map is input to a spatial learning sub-module in the attention channel module, and spatial attention of each single-scale feature map is learned by the spatial learning sub-module to obtain multi-scale spatial attention features, including:
constructing a multi-scale attention mechanism through convolution blocks with different sizes in the space learning sub-module;
based on a multi-scale attention mechanism, predicting and obtaining a region of interest in each single-scale feature map, and taking the features of the region of interest as multi-scale space attention features.
In one embodiment, inputting the first fusion feature map to the hole convolution layer to obtain the multi-scale information gain output by the hole convolution layer includes:
the first fusion feature map is checked through bands Kong Juanji with different expansion rates to carry out convolution sampling, so that a plurality of gain sub-feature maps are obtained;
connecting the sampled gain sub-feature graphs by using a densely connected structure;
and taking the connected gain sub-feature diagram as a multi-scale information gain output by the cavity convolution layer.
In a second aspect, the present application further provides a semantic segmentation network, where the semantic segmentation method is applied, the semantic segmentation network includes:
the feature extraction sub-network comprises a ResNet network model, a DenseNet network model and a VGG network model and is used for extracting single-scale feature images of images to be detected under different scales;
the attention channel sub-network comprises ECA-Net, self-adaptive pooling and MSAM network models, and is used for extracting the characteristics of the channel dimension and the space dimension of each single-scale characteristic map to obtain a first fusion characteristic map;
the cavity convolution sub-network comprises a Densexp model, and is used for inputting the first fusion feature map into the cavity convolution layer to obtain multi-scale information gain output by the cavity convolution layer;
the output layer comprises a full connection layer and is used for generating a semantic segmentation result based on the first fusion characteristic and the multi-scale information gain.
In a third aspect, the present application further provides a semantic segmentation apparatus, including:
the acquisition module is used for acquiring single-scale feature images extracted from the image to be detected under different resolutions;
the attention module is used for inputting each single-scale feature map to the attention channel module respectively, and extracting features of channel dimensions and space dimensions of each single-scale feature map through the attention channel module to obtain a first fusion feature map;
the gain module is used for inputting the first fusion feature map into the cavity convolution layer to obtain multi-scale information gain output by the cavity convolution layer;
and the segmentation module is used for generating a semantic segmentation result based on the first fusion characteristic and the multi-scale information gain.
In a third aspect, the present application also provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring single-scale feature images extracted from images to be detected under different resolutions;
respectively inputting each single-scale feature map to an attention channel module, and carrying out feature extraction on the channel dimension and the space dimension of each single-scale feature map through the attention channel module to obtain a first fusion feature map;
inputting the first fusion feature map into the cavity convolution layer to obtain multi-scale information gain output by the cavity convolution layer;
and generating a semantic segmentation result based on the first fusion feature and the multi-scale information gain.
In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
acquiring single-scale feature images extracted from images to be detected under different resolutions;
respectively inputting each single-scale feature map to an attention channel module, and carrying out feature extraction on the channel dimension and the space dimension of each single-scale feature map through the attention channel module to obtain a first fusion feature map;
inputting the first fusion feature map into the cavity convolution layer to obtain multi-scale information gain output by the cavity convolution layer;
and generating a semantic segmentation result based on the first fusion feature and the multi-scale information gain.
According to the semantic segmentation method, the semantic segmentation device, the computer equipment and the storage medium, multi-scale feature extraction is carried out on images to be detected with different resolutions, and feature extraction is carried out on channel dimensions and space dimensions of all single-scale feature images through the attention channel module, so that the first fusion features comprise attention features of all feature images in the channel dimensions and attention features of all feature images in the space dimensions, the purpose of using the channel attention module is to make the input images more meaningful, and the weight of all channels of the input images is calculated through a network. Specifically, the key information is contained in the channels, so that more attention is paid to the channels with little important information, and the aim of improving the characteristic representation capability is fulfilled; in addition, the cavity convolution is used for solving the problem that information is not lost while the receptive field is improved, and multi-scale information gain can be obtained; through the first fusion feature containing the multi-scale features of the image to be detected and the multi-scale information gain, the semantic segmentation method can adapt to target recognition of different scales, and the recognition accuracy of semantic segmentation is improved. .
Drawings
FIG. 1 is a flow diagram of a semantic segmentation method in one embodiment;
FIG. 2 is a flow chart of a first fused feature map according to an embodiment;
FIG. 3 is a network block diagram of an ECAP module in one embodiment;
FIG. 4 is a network architecture diagram of an MSAM in one embodiment;
FIG. 5 is a network structure diagram of Densespp in one embodiment;
FIG. 6 is a network block diagram of a semantic segmentation network in one embodiment;
FIG. 7 is a block diagram of a semantic segmentation device according to one embodiment;
fig. 8 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, there is provided a semantic segmentation method, which is executed by an electronic device, and specifically includes the following steps:
s101, acquiring single-scale feature images extracted from images to be detected under different resolutions.
The image to be measured can be a remote sensing image. Specifically, the three backbone network models ResNet, denseNet and VGG are utilized to respectively extract the features of the images to be detected with different resolutions, and single-scale feature images with different scales are obtained.
Optionally, the input part of the multi-scale feature extraction adopts cascade image input, the input image has three resolution images of high, medium and low with the scales of 1,1/2 and 1/4 respectively, and the feature extraction is carried out on the images through three backbone networks of Resnet101, denseNet169 and VGG16 respectively.
Wherein, resNet is also called as residual neural network, which avoids the problems of gradient disappearance and network degradation in common neural network by introducing a shortcut structure.
The DenseNet adopts another thought, changes the mode that each layer is only connected with the next layer in the traditional neural network, and densely connects each layer with the rest layers so as to expect the information flow of each layer to be maximum. ResNet is characterized by: (1) residual error learning; (2) shortcuts connections; (3) deepen the network without degradation. The DenseNet is characterized in that: (1) dense shortcut connections; (2) feature reuse; (3) introducing a transition layer. Although Densenet requires fewer parameters than Resnet, memory needs to be read frequently, so training is slower than Resnet training. The number of channels of the input/output of each convolution of DenseNet is much smaller than that of ResNet. The parameters of the fully connected layer are also much less than ResNet. Although Densenet requires a number of parameters less than Resnet, the feature map of Densenet is much larger than ResNet, resulting in a convolution process that is much more computationally intensive than ResNet, a larger memory footprint, and a need to read memory frequently, so the training speed is slower than Resnet training speed. Therefore, the image feature extraction networks of scale=1 and 1/2 respectively adopt the res net01 and the DenseNet169 by combining the factors of memory consumption, speed, parameter quantity and the like. However, at the lower scale 1/4 size, the VGG16 with relatively small number of layers and small parameter is used for feature extraction.
S102, respectively inputting each single-scale feature map to an attention channel module, and carrying out feature extraction on channel dimensions and space dimensions of each single-scale feature map through the attention channel module to obtain a first fusion feature map.
The purpose of the channel attention module is to make the input image more meaningful, and calculate the weight of each channel of the input image through the network. Specifically, the key information is included in the channels, and the key information is not included in the channels, so that the purpose of improving the characteristic representation capability is achieved.
Wherein the first fusion feature comprises: attention features of each feature map in the channel dimension and attention features of each feature map in the spatial dimension. Alternatively, the channel attention module may be a SCANet network.
And S103, inputting the first fusion feature map into the cavity convolution layer to obtain multi-scale information gain output by the cavity convolution layer.
The method is characterized in that the cavity convolution is used for solving the problem that information is not lost while the receptive field is improved, and the ASPP is used for obtaining multi-scale information gain by parallel or cascade stacking of cavity convolutions with different cavity rates.
S104, generating a semantic segmentation result based on the first fusion feature and the multi-scale information gain.
Specifically, the first fusion feature is subjected to 1*1 convolution, and after 1*1 convolution, information gains with multiple scales are input to the full-connection layer, so that a prediction result is obtained, and the prediction result is a semantic segmentation result.
In the semantic segmentation method, multi-scale feature extraction is performed on images to be detected with different resolutions, and feature extraction is performed on channel dimensions and space dimensions of each single-scale feature image through the attention channel module, so that the first fusion features comprise attention features of each feature image in the channel dimensions and attention features of each feature image in the space dimensions, and the purpose of using the channel attention module is to make the input images more meaningful, and calculate weights of all channels of the input images through a network. Specifically, the key information is contained in the channels, so that more attention is paid to the channels with little important information, and the aim of improving the characteristic representation capability is fulfilled; in addition, the cavity convolution is used for solving the problem that information is not lost while the receptive field is improved, and multi-scale information gain can be obtained; through the first fusion feature containing the multi-scale features of the image to be detected and the multi-scale information gain, the semantic segmentation method can adapt to target recognition of different scales, and the recognition accuracy of semantic segmentation is improved.
As shown in fig. 2, the present embodiment provides an alternative way to input each single-scale feature map to the attention channel module, and perform feature extraction on the channel dimension and the space dimension of each single-scale feature map through the attention channel module to obtain the first fusion feature, that is, provides a way to refine S102. The specific implementation process can comprise the following steps:
s201, respectively inputting each single-scale feature map to a channel learning submodule in the attention channel module, and learning the channel attention of each single-scale feature map through the channel learning submodule to obtain multi-scale channel attention features.
In this embodiment, the channel learning sub-module (ECAP module) combines the lightweight channel attention mechanism ECA-Net and the Pooling module Pooling, as shown in fig. 3, and fig. 3 is a network structure diagram of the ECAP module.
Specifically, the ECAP module is configured to determine, for each single-scale feature map, channel attention:
ECA-Net, for any single scale feature map, after using global average pooling layer without dimension reduction to aggregate convolution features, adaptively determining the size k of a convolution kernel, and carrying out one-dimensional convolution based on the convolution kernel; learning the inter-channel sub-features corresponding to the single-scale feature map of the channel by activating the function Sigmoid;
it can be appreciated that the weight learning of channel attention is affected by the dimensional decay in the two FC layers after GAP in SENet. After channel-level global averaging pooling without dimension reduction, ECA-Net captures local cross-channel interaction information by considering each channel and its k neighbors. ECA-Net guarantees model efficiency and computational efficiency and achieves better results than SENet.
For the aggregate feature y epsilon RC without dimension reduction, ECA-Net can learn the channel attention:
w=σ(W(y))
w is a parameter matrix of CxC;
wherein W is var2 Is a diagonal matrix comprising C parameters;
W var3 is a complete matrix containing c×c parameters;
the key differences are that: SE-Var3 allows for cross-channel interactions, while SE-Var2 does not, so SE-Var3 performs better.
In ECA-Net, another approach to obtain local cross-channel interactions was explored to ensure efficiency and effectiveness, using a band matrix Wk to learn channel attention:
w=σ(C1D k (y))
wherein C1D represents a one-dimensional convolution.
Further, the single-scale feature map is subjected to a self-adaptive pooling layer to obtain channel pooling sub-features corresponding to the single-scale feature map; and obtaining the multi-scale channel attention characteristic based on the inter-channel sub-characteristic and the channel pooling sub-characteristic corresponding to each single-scale characteristic graph.
It can be understood that the average pooling is to take the average value of each rectangular area, so that the information of all the features in the feature map can be extracted to enter the next layer, unlike the maximum pooling which only retains the features with the largest values, the image has more useful information and less background information due to the specificity of the remote sensing image, so that the information of the image can be retained more by using the average pooling.
Adaptive Pooling (Adaptive Pooling) is peculiar in that the size of the output tensor is given by the output size. The principle is as follows: if the kernel_ size, padding, stride of the pooling layer and the input tensor size input_size are known, the output tensor size output_size is:
output_size=(input_size+2×padding-kernel_size)/stride+1
then kernel_size= (input_size+2×packing) - (output_size-1) ×stride
Therefore, the Pooling module adopts an adaptive average Pooling design, the output characteristic diagram sizes are 1×1,2×2,3×3 and 6×6, and the channel number is not changed.
S202, each single-scale feature map is respectively input to a space learning submodule in the attention channel module, and the space attention of each single-scale feature map is learned through the space learning submodule, so that multi-scale space attention features are obtained.
Specifically, a multi-scale attention mechanism is constructed through convolution blocks with different sizes in the space learning sub-module; based on a multi-scale attention mechanism, predicting and obtaining a region of interest in each single-scale feature map, and taking the features of the region of interest as multi-scale space attention features.
It will be appreciated that the spatial attention is selected from a multi-scale attention mechanism, the nature of which is to locate information of interest and suppress unwanted information. Not all the areas in the image are equally important in the task, only the areas relevant to the task need be concerned, such as classifying the main body of the task, and the spatial attention model is used for searching the most important parts in the network for processing.
As shown in fig. 4, the spatial learning sub-module in this embodiment is an MSAM, and the MSAM uses convolution blocks of 3×3,5×5 and 7×7 to construct a multi-scale spatial attention module, so as to split a larger two-dimensional convolution into two smaller one-bit convolutions by referring to the concept of Factorization into small convolutions in imperceptin v 3. Specifically, the 3 x 3,5 x 5,7 x 7 convolutions were split into 1*3 and 3 x 1,1 x 5, and 5 x 1,1 x 7 convolutions and 7*1 convolutions, respectively. Through convolution splitting, a large number of parameters are saved, the operation is accelerated, the overfitting is subtracted, and meanwhile, the expression capacity of a layer of nonlinear expansion model is increased. The result of the asymmetric convolution structure splitting is more obvious than that of the asymmetric convolution structure splitting into a plurality of same small convolution kernels, and more and richer space features can be processed and feature diversity is increased.
And S203, obtaining a first fusion characteristic diagram based on the multi-scale channel attention characteristic and the multi-scale space attention characteristic.
Specifically, the multi-scale channel attention feature and the multi-scale space attention feature obtained based on each single-scale feature map are fused, and a first fusion feature map can be obtained.
The embodiment provides an optional way of inputting the first fusion feature map to the hole convolution layer to obtain the multi-scale information gain output by the hole convolution layer, namely a way of thinning S103. The specific implementation process can comprise the following steps: the first fusion feature map is checked through bands Kong Juanji with different expansion rates to carry out convolution sampling, so that a plurality of gain sub-feature maps are obtained; connecting the sampled gain sub-feature graphs by using a densely connected structure; and taking the connected gain sub-feature diagram as a multi-scale information gain output by the cavity convolution layer.
Specifically, as shown in fig. 5, the hole convolution layer in this embodiment is Densespasp.
It is understood that Densseospp combines the ideas of Densenet (Dense Convolutional Network) and the hole space convolution pooling pyramid (ASPP, atrous Spatial Pyramid Pooling). The first fusion characteristic diagram is subjected to convolution sampling through the Kong Juanji kernels with different expansion rates to obtain a gain sub-characteristic diagram; while these sampled gain sub-profiles use a densely connected structure that combines the convolved profile of each layer with all subsequent layers, each layer's profile is also a combination of all previous layer profiles.
Target image features of different scales are acquired through bands Kong Juanji with different expansion ratios and then combined in a dense connection mode, and the dense connection can obtain the advantages of larger receptive field than a certain holed convolution or a plurality of simple holed convolution pyramids and obviously better feature reuse than an ASPP or encoder-decoder model.
DenseASPP achieves a larger receptive field than conventional ASPP. In ASPP, the Atrous convolution layers work in parallel and the four sub-branches do not share any information in the feed forward process. In contrast, the expanded convolution layer in DenseASPP shares information through a layer-jump connection. Layers with small and large dilations work interdependently, where the feed forward process not only builds a denser feature pyramid, but also produces larger filters to perceive larger context information. The maximum receptive field (3,6,12,18,24) of ASPP is
R max =max[R 3,3 ,R 3,6 ,R 3,12 ,R 3,18 ,R 3,24 ]=R 3,24 =51,
However, the maximum receptive field (3,6,12,18,24) of DenseASPP is:
R max =R 3,3 +R 3,6 +R 3,12 +R 3,18 +R 3,24 -4=128.
meanwhile, due to the dense connection and mutual combination of the characteristic diagrams with different dimensions, the problem that the gap left by the convolution kernel with the hole with the larger expansion rate cannot sample detailed information can be solved. To control the model size and prevent the network from becoming too wide, following DenseNet, a 1×1 convolutional layer is added before each dilated convolutional layer in DenseASPP to reduce the number of channels in the prime map to half its original size. The network also has the advantage of DenseNet, namely, the problem of gradient disappearance of the deep network can be relieved.
The embodiment provides a semantic segmentation network, which applies the semantic segmentation method, as shown in fig. 6, and includes:
the feature extraction sub-network backhaul comprises a ResNet network model, a DenseNet network model and a VGG network model, and is used for extracting single-scale feature images of images to be detected under different scales;
the Attention channel sub-network Attention comprises ECA-Net, self-adaptive pooling and MSAM network models, and is used for extracting characteristics of channel dimensions and space dimensions of each single-scale characteristic map to obtain a first fusion characteristic map;
the cavity convolution sub-network Densexp comprises a Densexp model, wherein the Densexp model is used for inputting a first fusion feature map into a cavity convolution layer to obtain multi-scale information gain output by the cavity convolution layer;
and the output layer Concat comprises a full connection layer and is used for generating a semantic segmentation result based on the first fusion characteristic and the multi-scale information gain.
It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a semantic segmentation device for realizing the above-mentioned semantic segmentation method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in one or more embodiments of the semantic segmentation device provided below may refer to the limitation of the semantic segmentation method described above, and will not be repeated here.
In one embodiment, as shown in fig. 7, there is provided a semantic segmentation apparatus 1 including: an acquisition module 11, an attention module 12, a gain module 13 and a segmentation module 14, wherein:
the acquisition module 11 is used for acquiring single-scale feature images extracted from the image to be detected under different resolutions;
the attention module 12 is configured to input each single-scale feature map to an attention channel module, and perform feature extraction on a channel dimension and a space dimension of each single-scale feature map through the attention channel module to obtain a first fused feature map;
the gain module 13 is used for inputting the first fusion feature map into the cavity convolution layer to obtain multi-scale information gain output by the cavity convolution layer;
the segmentation module 14 is configured to generate a semantic segmentation result based on the first fusion feature and the multi-scale information gain.
In one embodiment, the obtaining module 11 is configured to extract features of images to be detected with different resolutions by using three backbone network models, resNet, denseNet and VGG, respectively, so as to obtain single-scale feature maps with different scales.
In one embodiment, the attention module 12 includes:
the first attention sub-module is used for respectively inputting each single-scale feature map to the channel learning sub-module in the attention channel module, and the channel learning sub-module learns the channel attention of each single-scale feature map to obtain multi-scale channel attention features;
the second attention sub-module is used for respectively inputting each single-scale feature map to the space learning sub-module in the attention channel module, and learning the space attention of each single-scale feature map through the space learning sub-module to obtain multi-scale space attention features;
and the third attention sub-module is used for obtaining a first fusion characteristic diagram based on the multi-scale channel attention characteristic and the multi-scale space attention characteristic.
In one embodiment, the first attention sub-module is further configured to: for any single-scale feature map, a channel learning submodule is used for adaptively determining the size of a convolution kernel after the convolution features are aggregated by using a global average pooling layer without dimension reduction, and one-dimensional convolution is carried out based on the convolution kernel;
learning the inter-channel sub-features corresponding to the single-scale feature map of the channel by activating the function;
the single-scale feature map is subjected to a self-adaptive pooling layer to obtain channel pooling sub-features corresponding to the single-scale feature map;
and obtaining the multi-scale channel attention characteristic based on the inter-channel sub-characteristic and the channel pooling sub-characteristic corresponding to each single-scale characteristic graph.
In one embodiment, the second attention sub-module is further configured to construct a multi-scale attention mechanism by convolving blocks of different sizes in the spatial learning sub-module;
based on a multi-scale attention mechanism, predicting and obtaining a region of interest in each single-scale feature map, and taking the features of the region of interest as multi-scale space attention features.
In one embodiment, the gain module 13 is further configured to: the first fusion feature map is checked through bands Kong Juanji with different expansion rates to carry out convolution sampling, so that a plurality of gain sub-feature maps are obtained;
connecting the sampled gain sub-feature graphs by using a densely connected structure;
and taking the connected gain sub-feature diagram as a multi-scale information gain output by the cavity convolution layer.
The respective modules in the above-described semantic segmentation apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 8. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data of the semantic segmentation method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a semantic segmentation method.
It will be appreciated by those skilled in the art that the structure shown in fig. 8 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:
acquiring single-scale feature images extracted from images to be detected under different resolutions;
respectively inputting each single-scale feature map to an attention channel module, and carrying out feature extraction on the channel dimension and the space dimension of each single-scale feature map through the attention channel module to obtain a first fusion feature map;
inputting the first fusion feature map into the cavity convolution layer to obtain multi-scale information gain output by the cavity convolution layer;
and generating a semantic segmentation result based on the first fusion feature and the multi-scale information gain.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring single-scale feature images extracted from images to be detected under different resolutions;
respectively inputting each single-scale feature map to an attention channel module, and carrying out feature extraction on the channel dimension and the space dimension of each single-scale feature map through the attention channel module to obtain a first fusion feature map;
inputting the first fusion feature map into the cavity convolution layer to obtain multi-scale information gain output by the cavity convolution layer;
and generating a semantic segmentation result based on the first fusion feature and the multi-scale information gain.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims (10)

1. A method of semantic segmentation, the method comprising:
acquiring single-scale feature images extracted from images to be detected under different resolutions;
inputting each single-scale feature map to an attention channel module respectively, and extracting features of channel dimensions and space dimensions of each single-scale feature map through the attention channel module to obtain a first fusion feature map;
inputting the first fusion feature map into a cavity convolution layer to obtain multi-scale information gain output by the cavity convolution layer;
and generating a semantic segmentation result based on the first fusion feature and the multi-scale information gain.
2. The method according to claim 1, wherein the obtaining the single-scale feature map extracted from the image to be measured at different resolutions includes: and respectively extracting features of the images to be detected with different resolutions by using the ResNet, denseNet and VGG backbone network models to obtain single-scale feature images with different scales.
3. The method according to claim 1, wherein the inputting each single-scale feature map into the attention channel module, and performing feature extraction on the channel dimension and the space dimension of each single-scale feature map by the attention channel module, to obtain a first fusion feature, includes:
the method comprises the steps of respectively inputting each single-scale feature map to a channel learning sub-module in an attention channel module, and learning the channel attention of each single-scale feature map through the channel learning sub-module to obtain multi-scale channel attention features;
the method comprises the steps of respectively inputting each single-scale feature map to a space learning submodule in an attention channel module, and learning the space attention of each single-scale feature map through the space learning submodule to obtain multi-scale space attention features;
and obtaining a first fusion feature map based on the multi-scale channel attention feature and the multi-scale space attention feature.
4. A method according to claim 3, wherein the inputting each single-scale feature map into the channel learning sub-module in the attention channel module, and learning the channel attention of each single-scale feature map by the channel learning sub-module to obtain the multi-scale channel attention feature comprises:
for any single-scale feature map, the channel learning submodule is used for adaptively determining the size of a convolution kernel after the convolution features are aggregated by using a global average pooling layer without dimension reduction, and one-dimensional convolution is carried out based on the convolution kernel;
learning the inter-channel sub-features corresponding to the single-scale feature map of the channel by activating the function;
the single-scale feature map is subjected to a self-adaptive pooling layer to obtain channel pooling sub-features corresponding to the single-scale feature map;
and obtaining the multi-scale channel attention characteristic based on the inter-channel sub-characteristic and the channel pooling sub-characteristic corresponding to each single-scale characteristic graph.
5. A method according to claim 3, wherein the inputting each single-scale feature map into the spatial learning sub-module in the attention channel module, respectively, learns the spatial attention of each single-scale feature map by the spatial learning sub-module, and obtains the multi-scale spatial attention feature, includes:
constructing a multi-scale attention mechanism through convolution blocks with different sizes in the space learning sub-module;
and predicting and obtaining the region of interest in each single-scale feature map based on the multi-scale attention mechanism, and taking the features of the region of interest as multi-scale space attention features.
6. The method of claim 1, wherein inputting the first fused feature map to the hole convolution layer, to obtain the multi-scale information gain output by the hole convolution layer, comprises:
the first fusion feature map is checked through bands Kong Juanji with different expansion rates to carry out convolution sampling, so that a plurality of gain sub-feature maps are obtained;
connecting the sampled gain sub-feature graphs by using a densely connected structure;
and taking the connected gain sub-feature diagram as the multi-scale information gain output by the cavity convolution layer.
7. A semantic segmentation network to which the semantic segmentation method according to any one of claims 1-6 is applied, characterized in that the semantic segmentation network comprises:
the feature extraction sub-network comprises a ResNet network model, a DenseNet network model and a VGG network model and is used for extracting single-scale feature images of images to be detected under different scales;
the attention channel sub-network comprises ECA-Net, self-adaptive pooling and MSAM network models, and is used for extracting the characteristics of the channel dimension and the space dimension of each single-scale characteristic map to obtain a first fusion characteristic map;
the cavity convolution sub-network comprises a Densexp model, and is used for inputting a first fusion feature map into a cavity convolution layer to obtain multi-scale information gain output by the cavity convolution layer;
and the output layer comprises a full connection layer and is used for generating a semantic segmentation result based on the first fusion characteristic and the multi-scale information gain.
8. A semantic segmentation apparatus, the apparatus comprising:
the acquisition module is used for acquiring single-scale feature images extracted from the image to be detected under different resolutions;
the attention module is used for inputting each single-scale feature map to the attention channel module respectively, and extracting features of channel dimensions and space dimensions of each single-scale feature map through the attention channel module to obtain a first fusion feature map;
the gain module is used for inputting the first fusion feature map into the cavity convolution layer to obtain multi-scale information gain output by the cavity convolution layer;
and the segmentation module is used for generating a semantic segmentation result based on the first fusion feature and the multi-scale information gain.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
CN202310533096.7A 2023-05-12 2023-05-12 Semantic segmentation method, semantic segmentation device, computer equipment and storage medium Pending CN116543161A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310533096.7A CN116543161A (en) 2023-05-12 2023-05-12 Semantic segmentation method, semantic segmentation device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310533096.7A CN116543161A (en) 2023-05-12 2023-05-12 Semantic segmentation method, semantic segmentation device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116543161A true CN116543161A (en) 2023-08-04

Family

ID=87451977

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310533096.7A Pending CN116543161A (en) 2023-05-12 2023-05-12 Semantic segmentation method, semantic segmentation device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116543161A (en)

Similar Documents

Publication Publication Date Title
US20190130250A1 (en) Method and apparatus with neural network performing convolution
Mahmoud et al. Diffy: A Déjà vu-free differential deep neural network accelerator
CN112308200B (en) Searching method and device for neural network
KR102452951B1 (en) Method and apparatus for performing convolution operation in neural network
US11836971B2 (en) Method and device with convolution neural network processing
CN111680755B (en) Medical image recognition model construction and medical image recognition method, device, medium and terminal
US11636575B2 (en) Method and apparatus for acquiring feature data from low-bit image
CN114241388A (en) Video instance segmentation method and segmentation device based on space-time memory information
CN116071309B (en) Method, device, equipment and storage medium for detecting sound scanning defect of component
CN111709415B (en) Target detection method, device, computer equipment and storage medium
US10747845B2 (en) System, method and apparatus for computationally efficient data manipulation
KR102067629B1 (en) Learning method and apparatus for improved resolution of low resolution satellite images
CN116228753B (en) Tumor prognosis evaluation method, device, computer equipment and storage medium
CN112889072A (en) System, method and apparatus for reducing power consumption
CN116543161A (en) Semantic segmentation method, semantic segmentation device, computer equipment and storage medium
CN115424038A (en) Multi-scale image processing method, system and device and computer equipment
CN115205148A (en) Image deblurring method based on double-path residual error network
CN112001479B (en) Processing method and system based on deep learning model and electronic equipment
CN111831207B (en) Data processing method, device and equipment thereof
CN114463764A (en) Table line detection method and device, computer equipment and storage medium
KR20200023154A (en) Method and apparatus for processing convolution neural network
CN114902240A (en) Neural network channel number searching method and device
Xing et al. Image super-resolution using aggregated residual transformation networks with spatial attention
WO2023165290A1 (en) Data processing method and apparatus, and electronic device and storage medium
CN117440104B (en) Data compression reconstruction method based on target significance characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20230804