CN116189180A - Urban streetscape advertisement image segmentation method - Google Patents

Urban streetscape advertisement image segmentation method Download PDF

Info

Publication number
CN116189180A
CN116189180A CN202310473810.8A CN202310473810A CN116189180A CN 116189180 A CN116189180 A CN 116189180A CN 202310473810 A CN202310473810 A CN 202310473810A CN 116189180 A CN116189180 A CN 116189180A
Authority
CN
China
Prior art keywords
attention
representing
module
cswin
transformer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202310473810.8A
Other languages
Chinese (zh)
Inventor
王浚丞
张婕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qindao University Of Technology
Original Assignee
Qindao University Of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qindao University Of Technology filed Critical Qindao University Of Technology
Priority to CN202310473810.8A priority Critical patent/CN116189180A/en
Publication of CN116189180A publication Critical patent/CN116189180A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/176Urban or other man-made structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/35Categorising the entire scene, e.g. birthday party or wedding scene
    • G06V20/38Outdoor scenes
    • G06V20/39Urban scenes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of image segmentation, and discloses a city street advertisement image segmentation method, which specifically comprises the following steps: collecting a city streetscape advertisement image data set; preprocessing an image; constructing a model; training a model; and (5) evaluating segmentation performance. According to the method, modeling is carried out by utilizing global context information of the urban street advertisement image, so that more accurate urban street advertisement image segmentation is realized; the problems of high complexity and high cost of advertisement image segmentation calculation are solved, an encoder is constructed by introducing a CSWin Transformer method to extract characteristics, and the calculation cost is reduced while global information is modeled; the invention provides a feature fusion module which can better fuse the detail features from the encoder and the semantic information of the decoder; at the jump joint, an ASPP multi-scale fusion module is provided, which is beneficial to extracting deep semantic information; the enhanced segmentation head module is beneficial to improving segmentation precision.

Description

Urban streetscape advertisement image segmentation method
Technical Field
The invention belongs to the technical field of image segmentation, and particularly relates to a city street advertisement image segmentation method.
Background
The city street view image is used as an important background element of the travel advertisement and plays an important role in the advertisement, so that the city street view image can be used as a geographic identifier of the advertisement, the advertisement is more specific and visual, more geographic emotions can be added to the advertisement, and the cultural value of the advertisement is improved. The urban street view image is segmented, namely the class labeling of the pixel level is carried out on the urban street view image, and the method has important application value in the field of advertising science. Different elements in the advertisement can be separated through an image segmentation technology so as to edit and synthesize in post-production, so that the advertisement production and delivery are more accurate and efficient.
In recent years, with the development of unmanned plane technology and modern satellite remote sensing technology, urban street view images are further developed in resolution, observation scale and imaging modes, and the characteristics of complex background, higher resolution and richer space detail and texture information are presented, so that the possibility of accurately segmenting the urban street view images is improved. However, due to the characteristics of large scale change of features, high similarity among the classes and mutual shielding of features in the urban streetscape, the difficulty of image segmentation on urban streetscape advertisement images is high.
Currently, with the development of hardware such as chips and graphics processing units, deep learning has achieved remarkable achievements in the fields of image processing such as image segmentation. The Convolutional Neural Network (CNN) has strong capability of capturing detail positioning information, can be used for representing image features of a hierarchical structure, and has become a mainstream technology for urban street view image segmentation. However, due to the limitation of the convolution operation receptive field, the modeling of the global context information of the image is difficult, the long Cheng Yuyi dependency relationship cannot be constructed, and the segmentation effect is not ideal for the urban street view advertisement image with complex background, blurred ground feature semantics and high resolution. The transducer and the Swin transducer have strong global modeling capability, so that the global information of the image is extracted and modeled, and a new research thought is opened for the research in the field of computer vision. Although the transducer and Swin transducer models can effectively model global information, the complexity of model calculation is high, and the possibility of application of the model on a high-resolution city street image is seriously affected.
Based on the analysis, the invention provides a city street advertisement image segmentation method, which can model global information of a city street advertisement image and control the calculation complexity within the limits of task tolerance. The model adopts a U-shaped network structure, uses CSWin Transformer construction encoder with low computational complexity and strong global modeling capability for feature extraction, uses CNN as a main body to construct decoder for feature graph recovery, and adds jump connection between the encoder and the decoder. In particular, in order to better fuse the local semantic features from the encoder with the global semantic features from the deep network, a feature fusion module is designed in each stage of the CNN decoder; and a hole space pyramid pooling (Atrous Spatial Pyramid Pooling, ASPP) multi-scale feature fusion module is designed at the jump connection part so as to facilitate global semantic understanding; finally, an enhanced segmentation head is proposed to achieve efficient segmentation with a lightweight attention mechanism.
The prior art has the following disadvantages:
1) At present, a classical network Unet is taken as an example of a network model of the existing CNN architecture. Because CNN convolution operation is adopted in the Unet model, the convolution operation receptive field is limited, the global context information of the urban street view advertisement image is difficult to model, and long Cheng Yuyi dependency relationship cannot be constructed. 2) The prior art SwinUnet has the defects. The technology is improved based on a Unet model, and Swin Tranformer is introduced in an encoder stage to model a global image, so that global image context information is obtained. The decoding stage is also constructed by using a Swin Tranformer to perform up-sampling reduction of the feature map. The disadvantages are as follows: the calculation complexity of the Swin Tranformer is very high, the whole network is constructed by the Swin Tranformer, and the model is huge and difficult to train. Swin transducer has strong modeling capability on global information, can extract deep semantic information, but has a lower extraction effect on local information than a CNN network. Meanwhile, up-sampling reduction is performed in the decoder stage by using a Swin Transformer construction, and the effect is not as efficient as that of CNN. C. Because SwinUnet is applied to the field of medical images, the model is not designed aiming at the application scene of the urban street advertisement image, and the problems of large difference in dimension of ground features, high similarity and mutual shielding in the urban street can not be solved, so that the effect of being applied to the urban street advertisement image is poor.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides a city street advertisement image segmentation method.
The invention is realized in such a way that a city street advertisement image segmentation method specifically comprises the following steps:
s1: collecting a city streetscape advertisement image data set;
s2: preprocessing an image;
s3: constructing an image model based on CSWin Transformer;
s4: training a model;
s5: urban street view advertisement image segmentation performance evaluation.
Further, the S1 specifically includes:
selecting an aerial remote sensing high-resolution image dataset provided by ISPRS in Germany Vaihingen region and Potsdam region; the image in the data set is provided with a manually marked ground object type label graph, and five foreground types and a background type are provided, wherein the five foreground types are respectively an opaque water surface, a building, low vegetation, trees and an automobile; the Vaihingen is a small and scattered village, the data set comprises 33 city street images with different sizes, the average size of the images is 2494 multiplied by 2064 pixels, the serial numbers ID of 2, 4, 6, 8, 10, 12, 14, 16, 20, 22, 24, 27, 29, 31, 33, 35 and 38 are selected as test sets, and the other 16 images are selected as training sets; the Potsdam is a typical historic city, has huge building groups, narrow streets and dense building structures, contains 38 city street scenery images with the same size in a data set, has the image size of 6000 multiplied by 6000 pixels, and selects the images with the numbers of 2_13, 2_14, 3_13, 3_14, 4_13, 4_14, 4_15, 5_13, 5_14, 5_15, 6_13, 6_14, 6_15 and 7_13 as test sets and the rest 24 images as training sets.
Further, the S2 specifically includes:
s201: image clipping: because the sizes of the images are inconsistent, in order to facilitate the subsequent network training, firstly, the city street pictures are cut, and the training set data images are cut by adopting a window with the size of 256 multiplied by 256;
s202: data enhancement: in order to improve the robustness and generalization capability of the model, all pictures in the training data set are subjected to random scaling, random vertical overturn and random horizontal overturn data enhancement technologies.
Further, the step S3 specifically includes:
a city street advertisement image segmentation method, the whole adopts simple and effective U-shaped network structure, mainly include encoder, decoder, jump connect, divide the four parts of the head;
the image model of CSWin Transformer comprises a CSWin Transformer module, a feature fusion module, an ASPP multi-scale feature fusion module and an enhanced segmentation head module.
Further, the whole framework of the urban street view advertisement image segmentation method is specifically as follows:
for a given city street view image
Figure SMS_1
Firstly, the sequence mapping layer consisting of convolution with the size of 7 multiplied by 7 and the step length of 4 in the stage 1 is processed to obtain +.>
Figure SMS_2
The picture of the size is segmented into sequences, the number of channels is C, and global information is learned through a CSWin Transformer module; to obtain a multi-scale, hierarchical representation of features, the encoder is divided into four stages, each stage comprising a step size of 3 x 3 A downsampling block consisting of convolutions of 2 and a CSWin Transformer block consisting of CSWin Transformer blocks.
CSWin Transformer blocks per stage are of the number of
Figure SMS_3
The downsampling module is used for reducing the number of picture sequences and doubling the number of channels. Thus, for the ith stage, the feature map consisting of the corresponding number of sequences has a size +.>
Figure SMS_4
The number of channels is +.>
Figure SMS_5
This is consistent with backbone network architecture of other common convolutional neural networks; through four stages of encoder stages, get +.>
Figure SMS_6
A feature map of size, which is then fed into the decoder stage; the decoder and the encoder are in a symmetrical structure and also comprise four stages, and each stage comprises a CNN up-sampling module and a feature fusion module; the CNN up-sampling module consists of +.>
Figure SMS_7
The size deconvolution is used for doubling the size of the feature map and halving the number of channels; the feature fusion module adopts +.>
Figure SMS_8
Convolution designs a lightweight attention mechanism, and fuses low-level (low-level) detail features and high-level (high-level) semantic features from an encoder in an adaptive weight manner; in four stages corresponding to the encoder and the decoder, following the classical Unet network design, adding 4-hop connection for assisting the recovery of the detail information such as the position and the like; because the features in the stage 3 and the stage 4 have larger receptive fields and contain rich deep semantic features, if the deep semantic information can be understood in a multi-scale manner, the model is facilitated to better understand the multi-scale object information; skip in stage 3 and stage 4 In the jump connection, an ASPP multi-scale feature fusion module is designed based on an attention mechanism; finally, the outputs of the four stages of the decoder are up-sampled to a uniform size, all as inputs, fed into the enhanced segmentation head, via +.>
Figure SMS_9
The size convolution and ReLU activation functions output a segmentation map of the same size as the resolution of the original input image.
Further, the CSWin Transformer module is specifically as follows:
CSWin Transformer is taken as an encoder backbone network of the urban street view advertisement image segmentation network, and the network is provided with a self-attention mechanism of a cross window, so that not only can global context information be effectively modeled, but also the calculation cost can be effectively reduced, and the cross window is formed by dividing strip-shaped windows in horizontal and vertical directions; for horizontal direction, input
Figure SMS_10
Divided into +.>
Figure SMS_11
Horizontal strips, i.e.
Figure SMS_12
Wherein each band comprises +>
Figure SMS_13
The sequence, in particular, the sw width in each stage can be adjusted according to the calculation complexity and the model condition, and the size is not fixed; suppose that the dimensions of query Q, key value K and value V in the transducer are +.>
Figure SMS_14
The number of the multi-head attention heads is +. >
Figure SMS_15
Then the attention result in the horizontal direction +.>
Figure SMS_16
The definition is as follows:
Figure SMS_17
(1)
Figure SMS_18
(2)
Figure SMS_19
(3)
wherein the method comprises the steps of
Figure SMS_21
Representing an input feature map, Y represents->
Figure SMS_23
As a result of self-attention, < >>
Figure SMS_27
Figure SMS_20
Respectively represent +.>
Figure SMS_25
The attention header queries the mapping matrix of Q, key value K and value V,/for>
Figure SMS_26
Set to->
Figure SMS_28
Sw represents the width of each stripe, W represents the width of the input feature map, M represents the number of stripes into which the feature map is divided, and H represents the height of the feature map. Correspondingly, the result of attention in the vertical direction is similar to the definition in the horizontal direction, meaning +.>
Figure SMS_22
Finally, the two directions of attention are connected together to form a self-attention result +.>
Figure SMS_24
:
Figure SMS_29
(4)
Figure SMS_30
(5)
Wherein Concat stands for the splicing operation,head k representing multi-head attention, k representing the number of heads of multi-head attention,
Figure SMS_31
is a projection matrix mapping self-attention results to the target dimension C, +.>
Figure SMS_32
Represents the result of attention in the horizontal direction, +.>
Figure SMS_33
Representing the vertical direction attention results. From this, the calculation method of CSWin transformer module in encoder can be obtained as follows:
Figure SMS_34
(6)
Figure SMS_35
(7)
where LN represents layer normalization (layer normalization, LN), MLP represents multi-layer perceptron (MLP),
Figure SMS_36
output features representing self-Attention CSWin-Attention, +. >
Figure SMS_37
Representing the output characteristics of the multi-layer sensor.
Further, the feature fusion module is used for realizing self-adaptive selection of enhanced low-level detail features by a lightweight attention mechanismOr higher level semantic information to better fuse lower level detail features and higher level semantic information from the encoder; the feature fusion module takes as input both low-level detail information from the encoder and high-level semantic information from the decoder for the first
Figure SMS_39
For each stage, low-level detail information +.>
Figure SMS_43
By->
Figure SMS_47
A convolution and batch normalization (BatchNorm, BN) layer to obtain an output result +.>
Figure SMS_41
High-level semantic information->
Figure SMS_44
As input to two branches, one branch passes +.>
Figure SMS_49
Convolution, batch normalization layer and Sigmoid activation layer, generating high-level semantic weight +.>
Figure SMS_51
Output result with low-level detail information +.>
Figure SMS_38
Multiplying; the other branch passes->
Figure SMS_45
Convolving and batching normalization layers to obtain +.>
Figure SMS_48
Low level detail branch outcome->
Figure SMS_50
And semantic weight +.>
Figure SMS_40
After multiplication, the result obtained and +.>
Figure SMS_42
Adding to obtain the final output result of the feature fusion module +.>
Figure SMS_46
The specific formula is as follows:
Figure SMS_52
(8)
Figure SMS_53
(9)
Figure SMS_54
(10)
Figure SMS_57
(11) The method comprises the steps of carrying out a first treatment on the surface of the Wherein->
Figure SMS_59
Representing Sigmoid activation function, BN represents batch normalization, conv represents +.>
Figure SMS_62
Convolution operation->
Figure SMS_55
Representing low-level detail information- >
Figure SMS_58
Representing the intermediate result of a low-level detail branch, +.>
Figure SMS_61
Representing high-level semantic information,/->
Figure SMS_63
Representing high-level semantic weights, +.>
Figure SMS_56
Representing high-level semanticsOutput of branch->
Figure SMS_60
And the representative feature fusion module outputs a final result.
Further, the ASPP multi-scale fusion module may use attention to adaptively weight the multi-scale feature map, and for the target ground object, the feature map with the matching receptive field is enhanced, while other feature maps are suppressed, specifically as follows: first, a feature map is input
Figure SMS_64
ASPP pyramid structures with 5 branches are respectively +.>
Figure SMS_65
Convolution branch, 3 different expansion coefficients +.>
Figure SMS_66
The expansion convolution branch (expansion coefficient rate=6, 8, 12) and a global average pooling branch, after the feature map passes through 5 branches, 5 feature maps with the same resolution and different receptive fields are output->
Figure SMS_67
Each feature map passes through the attention fusion module and is subjected to attention force striving generated by the attention fusion module>
Figure SMS_68
Multiplying and adding with original input to obtain a characteristic diagram +.>
Figure SMS_69
The formula is as follows:
Figure SMS_70
the method comprises the steps of carrying out a first treatment on the surface of the Wherein->
Figure SMS_71
Feature map representing the generation of five branches of ASPP pyramid,/->
Figure SMS_72
Representing the feature map output through the attention fusion module, < >>
Figure SMS_73
Representing attention force diagram, pixel points can be more focused on the pixel points related to the pixel points, and the pixel points are +. >
Figure SMS_74
The definition is as follows:
Figure SMS_75
the method comprises the steps of carrying out a first treatment on the surface of the Wherein->
Figure SMS_76
Representing an attention deficit map, conv represents +.>
Figure SMS_77
Point-wise convolution operation, BN stands for batch normalization, sigmoid stands for activation function, ++>
Figure SMS_78
Representing matrix elements by cross-product. The formula indicates->
Figure SMS_79
The system is micro, attention can be generated in the channel dimension through activating the function, different weights can be given to the attention module in the space dimension, and different weights can be given to the attention module in the channel dimension; finally, the output characteristic diagrams of the five branches are spliced and output +.>
Figure SMS_80
Further, after the enhanced segmentation head module passes through the feature fusion module and the ASPP multi-scale fusion module, the feature map of each stage in the decoder contains rich spatial position information and semantic information, which are obtained by being critical to the remote sensing urban scene image, firstly, the feature map of the low resolution of four stages of the decoder is sampled to the same high resolution, elements are added, and then the number of channels is adjusted through two-layer convolution, so that a final semantic segmentation map is generated.
Further, the step S4 specifically includes: in the training process, the pictures in the training set and the labeling masks corresponding to the pictures are input into a CSWin Transformer-based city streetscape advertisement image segmentation model for training, and a network model is optimized;
During training, an English-to-Chinese 3090TiGPU graphic card is adopted, CSWin Transformer network parameters obtained by pre-training on an ImageNet data set are adopted, an AdamW optimizer optimization model is adopted, the learning rate is set to be 6e-4, and the learning rate is adjusted by a cosine strategy (cosine strategy); using Dice loss
Figure SMS_81
And Cross entropy loss->
Figure SMS_82
Training of joint supervision model, total loss->
Figure SMS_83
The calculation formula is as follows:
Figure SMS_84
the method comprises the steps of carrying out a first treatment on the surface of the Wherein->
Figure SMS_85
Representing total loss->
Figure SMS_86
Representing the Dice loss, < >>
Figure SMS_87
Representing cross entropy loss.
Further, the step S5 specifically includes:
the method mainly adopts an average intersection ratio mIoU and Overall Accuracy (OA) as evaluation indexes for urban scene remote sensing image segmentation performance evaluation:
Figure SMS_88
the method comprises the steps of carrying out a first treatment on the surface of the Wherein, mIoU represents average cross ratio, < ->
Figure SMS_89
Representing the number of building pixels with correct classification, < >>
Figure SMS_90
The number of background pixels representing classification errors, +.>
Figure SMS_91
The number of pixels in the building that are misclassified is indicated.
Figure SMS_92
The method comprises the steps of carrying out a first treatment on the surface of the Wherein OA represents the overall accuracy, < >>
Figure SMS_93
Representing the number of building pixels with correct classification, < >>
Figure SMS_94
Number of background pixels representing correct classification, +.>
Figure SMS_95
The number of background pixels representing classification errors, +.>
Figure SMS_96
The number of pixels in the building that are misclassified is indicated.
In combination with the technical scheme and the technical problems to be solved, the technical scheme to be protected has the following advantages and positive effects:
First, aiming at the technical problems in the prior art and the difficulty in solving the problems, the technical problems solved by the technical proposal of the invention are analyzed in detail and deeply by tightly combining the technical proposal to be protected, the results and data in the research and development process, and the like, and some technical effects brought after the problems are solved have creative technical effects. The specific description is as follows: A. a city streetscape advertisement image segmentation method.
B. Because the method utilizes CSWin Transformer to construct a U-shaped semantic segmentation network structure as an encoder basic unit, global modeling can be performed by utilizing global context information of the urban street view advertisement image, and more accurate semantic segmentation of the urban street view image is realized.
C. Because the image or the feature map is divided into the strip-shaped windows in the CSWin Transformer module and self-attention is carried out in the strip-shaped windows, the problems of high resolution and high image processing complexity of the urban street view image can be solved, and the calculation amount and the complexity are reduced.
D. For the characteristics of the city street view image, a feature fusion module, an ASPP multi-scale feature fusion module and an enhanced segmentation head are designed, so that the segmentation precision is improved.
Secondly, the technical scheme is regarded as a whole or from the perspective of products, and the technical scheme to be protected has the following technical effects and advantages:
A. and modeling global information by using the global context information of the urban street view advertisement image, so as to realize more accurate urban street view advertisement image segmentation.
B. The method solves the problems of high computational complexity and high computational overhead of the Swin Transformer, introduces a CSWin Transformer method to construct an encoder for feature extraction, and reduces the computational overhead while modeling global information.
The following three targeted modules are designed for the characteristics of the urban street view advertisement image.
C. In the decoder stage, a feature fusion module is provided, so that the detail features from the encoder and the deep semantic information of the decoder can be fused better.
D. An ASPP multi-scale fusion module is provided at the jump joint, which is beneficial to the extraction of deep semantic information.
E. The enhanced segmentation head module is more suitable for the segmentation task of the urban street advertisement image, and is beneficial to improving the segmentation precision.
Thirdly, as inventive supplementary evidence of the claims of the present invention, the following important aspects are also presented:
(1) The technical scheme of the invention fills the technical blank in the domestic and foreign industries:
the current image segmentation technique method is mainly applied to medical images, remote sensing images, indoor object images and the like, and lacks an effective segmentation method for urban street advertisement images. The advertisement image is usually high in resolution, blurred in background and unclear in boundary semantics, and is directly applied to image segmentation methods in other fields, so that the effect is not ideal. The method for segmenting the urban street view advertisement image based on CSWin Transformer is designed by analyzing the characteristics of the urban street view advertisement image one by one in a targeted manner, can accurately and effectively extract the urban street view in the advertisement image, can efficiently perform network training, and fills the gap in the domestic and foreign industry technology in the field.
(2) Whether the technical scheme of the invention solves the technical problems that people want to solve all the time but fail to obtain success all the time is solved:
currently, there are two main approaches to the task of segmentation of high resolution images. Firstly, the network architecture based on CNN has low computational complexity, but can not model global information, and is difficult to solve the problems of semantic ambiguity, unclear boundary segmentation and the like in high-resolution image segmentation. The other is a network architecture based on a transducer, global information can be modeled, and the problems that the semantics are fuzzy and the category is difficult to infer are effectively solved, but the model is too huge, and the calculation complexity is higher.
1) CNN-based network architecture: the method mainly adopts a full convolutional neural network (FCN) to construct, adopts a U-shaped symmetrical structure of an encoder and a decoder, adds jump connection between the encoder and the decoder to perform characteristic splicing, and assists in recovering position information. Although the convolution network is a mainstream method for image segmentation, the convolution receptive field is limited, so that global context information of an image cannot be captured well, and the problem of semantic blurring of image segmentation cannot be solved.
2) Transformer-based network architecture: the technology for modeling the global context information of the image in domestic and foreign industries constructs a U-shaped network architecture by ViT, but because ViT encodes the global context information, the technology has high computational complexity and cannot be directly applied to high-resolution image segmentation tasks. Further, viT is to extract features from a single-scale feature map, and has no multi-scale feature information, and has poor segmentation effect on objects of various sizes in an image. Then, a scholars put forward a Swin Transformer to construct a semantic segmentation network, compared with ViT, the Swin Transformer can extract multi-scale characteristic information, so that the computational complexity is reduced to a certain extent, but the computational complexity still exceeds the CNN network architecture.
According to the invention, an image segmentation network is constructed based on CSWin Transformer, and CSWin Transformer and CNN network architectures are combined, so that not only can global information be modeled and accurate semantic segmentation be realized, but also the computational complexity of the model is effectively reduced, and the method is a network model with good segmentation performance and low computational complexity, and is an effective solution for balancing the segmentation performance and the computational complexity.
Drawings
FIG. 1 is a diagram of an embodiment of the present invention a flow chart of a city streetscape advertisement image segmentation method;
FIG. 2 is an overall architecture diagram of a model provided by an embodiment of the present invention;
FIG. 3 is a schematic illustration of the present invention provided by the examples CSWin Transformer cross window attention mechanism drawing;
FIG. 4 is a block diagram of a feature fusion module provided by an embodiment of the present invention;
FIG. 5 is a block diagram of an ASPP multi-scale fusion module according to an embodiment of the present invention;
FIG. 6 is a block diagram of an enhanced split head provided by an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In order to fully understand how the invention may be embodied by those skilled in the art, this section is an explanatory embodiment of the invention, which is described in the following claims.
As shown in fig. 1, an embodiment of the present invention provides a method for dividing an advertisement image of a city street, which specifically includes:
s1: collecting a city streetscape advertisement image data set;
s2: preprocessing an image;
s3: constructing an image model based on CSWin Transformer;
s4: training a model;
s5: urban street view advertisement image segmentation performance evaluation.
(1) Urban street view advertising image dataset collection.
Aerial remote sensing high resolution image datasets were selected for the Vaihingen region and the Potsdam region of Germany provided by ISPRS. The data set image has a manually marked ground object category label map, and has five foreground categories (opaque water surface, building, low vegetation, tree and automobile) and one background category.
Vaihingen is a small and scattered village, the dataset contains 33 city street images of different sizes, the average size of the images is 2494 x 2064 pixels, the numbers ID 2, 4, 6, 8, 10, 12, 14, 16, 20, 22, 24, 27, 29, 31, 33, 35, 38 are selected as the test set, and the remaining 16 images are the training set.
Potsdam is a typical historic city with large building clusters, narrow streets, and dense building structures. The dataset contained 38 city scene images of the same size, the dataset image size being 6000 x 6000 pixels. The test sets with the numbers of 2_13, 2_14, 3_13, 3_14, 4_13, 4_14, 4_15, 5_13, 5_14, 5_15, 6_13, 6_14, 6_15 and 7_13 are selected, and the rest 24 pieces are training sets.
(2) And (5) preprocessing an image.
The image preprocessing mainly comprises the following steps:
image clipping: because the sizes of the images are inconsistent, in order to facilitate the subsequent network training, firstly, the city street pictures are cut, and the training set data images are cut by adopting a window with the size of 256 multiplied by 256.
Data enhancement: in order to improve the robustness and generalization capability of the model, all pictures in the training data set are subjected to data enhancement technologies such as random scaling (scaling is [0.5,0.75,1.0,1.25,1.5 ]), random vertical inversion, random horizontal inversion and the like.
(3) Urban streetscape advertisement image the whole structure of the segmentation method.
A simple and effective U-shaped network structure is adopted as a whole in the urban street view advertisement image segmentation method, and mainly comprises an encoder, a decoder, jump connection and a segmentation head. Next, the overall architecture is first introduced, as shown in fig. 2, and then four key modules in the model, as well as CSWin Transformer module, feature fusion module, ASPP multi-scale feature fusion module, and enhanced segmentation head module, are sequentially introduced.
A. An overall architecture.
For a given city street view image
Figure SMS_97
Firstly, obtaining +.A.A token mapping layer consisting of convolution with the size of 7 multiplied by 7 and the step length of 4 in stage1 is passed>
Figure SMS_98
And a picture block sequence with the size, wherein the number of channels is C, and then the global information is learned through a CSWin Transformer module. To obtain a multi-scale, hierarchical representation of the features, the encoder is divided into four stages, each stage comprising a downsampling module consisting of a 3 x 3 convolution of step size 2 and a CSWin Transformer module consisting of CSWin Transformer block, the number of CSWin Transformer block per stage being ∈ ->
Figure SMS_99
. The downsampling module is used for reducing the number of the tokens and doubling the number of channels. Thus, for the ith stage, the feature map consisting of the corresponding number of token has a size of +.>
Figure SMS_100
The number of channels is +.>
Figure SMS_101
This is consistent with other common CNN backbone network architectures.
Through four steps stage braidingA coder stage, obtaining
Figure SMS_102
The feature map of the size is then fed into the decoder stage. The decoder is in a symmetrical structure with the encoder and also comprises four stages, and each stage comprises a CNN up-sampling module and a characteristic and characteristic fusion module. The CNN up-sampling module is composed of +.>
Figure SMS_103
The size deconvolution is used for doubling the size of the characteristic diagram and halving the number of channels. The feature fusion module adopts- >
Figure SMS_104
Convolution designs a lightweight attention mechanism that fuses low-dimensional detail features and high-dimensional semantic features from the encoder in an adaptive weight manner.
In four stages corresponding to the encoder and the decoder, a classical Unet network design is followed, and 4-hop connection is added for assisting in recovering detailed information such as position and the like. Because the features in the stage 3 and the stage 4 have larger receptive fields and contain rich deep semantic features, if the deep semantic information can be understood in a multi-scale manner, the model is facilitated to better understand the multi-scale object information. Thus, in the jump connection of stage 3 and stage 4, an ASPP multi-scale feature fusion module is designed herein based on the attention mechanism.
Finally, the outputs of the four stages of the decoder are up-sampled to a uniform size, all as inputs, fed into the enhanced segmentation head, passed through
Figure SMS_105
The size convolution and ReLU activation functions output a segmentation map of the same size as the resolution of the original input image.
Cswin transducer module.
CSWin Transformer is used as an encoder backbone network of the urban street view advertisement image segmentation network, and the network is provided with a self-attention mechanism of a cross window, so that not only can global context information be effectively modeled, but also the calculation cost can be effectively reduced. It is divided into a cross-shaped window by a strip-shaped window in both horizontal and vertical directions, as shown in fig. 3.
For horizontal direction, input
Figure SMS_106
Divided into +.>
Figure SMS_107
Horizontal strips, i.e.
Figure SMS_108
Wherein each band comprises +>
Figure SMS_109
And a token. In particular, the sw width in each stage can be adjusted according to the computational complexity and model conditions, and is not fixed in size. Suppose that the dimensions of query Q, key value K and value V in the transducer are +.>
Figure SMS_110
The number of the multi-head attention heads is +.>
Figure SMS_111
Then there is a horizontally oriented attention result
Figure SMS_112
The definition is as follows:
Figure SMS_113
(1)
Figure SMS_114
(2)
Figure SMS_115
(3)
wherein the method comprises the steps of
Figure SMS_118
Representing the input characteristic map of the object,y represents the result of self-attention, < ->
Figure SMS_121
Figure SMS_123
Respectively represent +.>
Figure SMS_117
The individual attention header queries the mapping matrix of Q, key value K and value V,
Figure SMS_120
set to->
Figure SMS_122
Sw represents the width of each stripe, W represents the width of the input feature map, and M represents the number of stripes into which the feature map is divided. Correspondingly, the vertical attention results are defined similarly to the horizontal direction, and represent
Figure SMS_124
. Finally, the attention in two directions are connected together to form a self-attention result
Figure SMS_116
:
Figure SMS_119
(4)
Figure SMS_125
(5)
Wherein Concat stands for splicing operation,head k representing multi-head attention, k representing the number of heads of multi-head attention,
Figure SMS_126
is a projection matrix mapping self-attention results to the target dimension C, +.>
Figure SMS_127
Represents the result of attention in the horizontal direction, +. >
Figure SMS_128
Representing the vertical direction attention results. From this, the calculation method of CSWin transformer module in encoder can be obtained as follows:
Figure SMS_129
(6)
Figure SMS_130
(7)
where LN represents layer normalization (layer normalization, LN), MLP represents multi-layer perceptron (MLP),
Figure SMS_131
output features representing self-Attention CSWin-Attention, +.>
Figure SMS_132
Representing a multi-layer perceptron is provided.
C. Feature fusion module (Feature Fusion Module, FFM).
In stages 1, 2 and 3 of the decoder, a feature fusion module is designed to implement adaptive selection of enhanced low-level (low-level) detail features or high-level (high-level) semantic information by a lightweight attention mechanism so as to better fuse the low-level detail features and the high-level semantic information from the encoder, and the structure of the module is shown in fig. 4.
The feature fusion module takes as input both low-level detail information from the encoder and high-level semantic information from the decoder. For the first
Figure SMS_134
For each stage, low-level detail information +.>
Figure SMS_138
By passing through and (2)>
Figure SMS_141
Roll and Batch Normalization (BN) layer,obtain output result->
Figure SMS_135
. High-level semantic information->
Figure SMS_136
As input to two branches, one branch passes +.>
Figure SMS_140
Convolution, batch normalization layer and Sigmoid activation layer, generating semantic weights +. >
Figure SMS_142
Output result with low-level detail information +.>
Figure SMS_133
Multiplying; the other branch passes->
Figure SMS_137
Convolving and batching normalization layers to obtain +.>
Figure SMS_139
Adding the multiplication result of the previous step to obtain a final output result +.>
Figure SMS_143
The specific formula is as follows:
Figure SMS_144
(8)
Figure SMS_145
(9)
Figure SMS_146
(10)
Figure SMS_147
(11)
wherein the method comprises the steps of
Figure SMS_149
Representing Sigmoid activation function, BN representing BatchNorm batch normalization, conv representing +.>
Figure SMS_151
Convolution operation->
Figure SMS_153
Representing low-level detail information->
Figure SMS_148
Representing the intermediate result of a low-level detail branch, +.>
Figure SMS_152
Representing high-level semantic information,/->
Figure SMS_154
Representing high-level semantic weights, +.>
Figure SMS_155
Output representing high-level semantic branches, +.>
Figure SMS_150
Representing the output result of the final feature fusion module.
Aspp multiscale fusion module (Multiscale ASPP Fusion Module, MASPP).
In order to help the model to better understand deep semantic information, in the method, a pyramid pooling model ASPP for obtaining a multiscale receptive field is introduced in jump connection corresponding to the stages 3 and 4, and an ASPP multiscale fusion module based on an attention mechanism is provided on the basis of the ASPP model, so that the multiscale feature map can be adaptively weighted by using attention force diagram. Specifically, for the target ground object, the feature map with the matching receptive field is enhanced, while the other feature maps are suppressed, and the specific structure is shown in fig. 5.
First, a feature map is input
Figure SMS_156
ASPP pyramid structures with 5 branches are respectively +.>
Figure SMS_157
Integrating branches, 3 different expansion coefficients +.>
Figure SMS_158
The expansion convolution branch (expansion coefficient rate=6, 8, 12), and also a global average pooling branch. After the feature map passes through 5 minutes, 5 feature maps with different single receptive fields and the same resolution are output +.>
Figure SMS_159
. Each feature map passes through the attention fusion module and is in force with the attention generated by the attention fusion module>
Figure SMS_160
Multiplying and adding with original input to obtain a characteristic diagram +.>
Figure SMS_161
The formula is as follows:
Figure SMS_162
the method comprises the steps of carrying out a first treatment on the surface of the Wherein->
Figure SMS_163
Feature map representing the generation of five branches of ASPP pyramid,/->
Figure SMS_164
Feature map representing output through attention fusion module +.>
Figure SMS_165
Representing attention force diagram, pixel points can be more focused on the pixel points related to the pixel points, and the pixel points are +.>
Figure SMS_166
The definition is as follows:
Figure SMS_167
the method comprises the steps of carrying out a first treatment on the surface of the Wherein Conv represents->
Figure SMS_168
Point-wise convolution operation, BN stands for batch normalization, sigmoid stands for activation function, +.>
Figure SMS_169
Represents the cross multiplication of matrix elements, the formula shows +.>
Figure SMS_170
Is microscopic. Attention can be generated in the channel dimension by the Sigmoid function. Therefore, the attention module can be given different weights not only in the space dimension, but also in the channel dimension.
Finally, splicing the output characteristic diagrams of the five branches, and outputting
Figure SMS_171
E. An enhanced split head module (Enhanced Segmentation Head Module, ESegH).
After the feature fusion module and the ASPP multi-scale fusion module, the feature map of each stage in the decoder contains rich space position information and semantic information, which is important to the remote sensing city scene image. The enhanced segmentation head module is provided, firstly, the feature images with low resolution in four stages of a decoder are up-sampled to the same high resolution, element addition is carried out, and then the number of channels is adjusted through two-layer convolution, so that a final semantic segmentation image is generated.
(4) And (5) model training.
In the training process, pictures in a training set and labeling masks corresponding to the pictures are input into a CSWin Transformer-based city streetscape advertisement image segmentation model for training, and a network model is optimized. During training, an Inlet-Weida 3090TiGPU graphic card is adopted, network parameters obtained by pre-training on an ImageNet data set are adopted by CSWin Transformer, an AdamW optimizer optimization model is adopted, the learning rate is set to be 6e-4, and the learning rate is adjusted by a cosine strategy (cosine strategy). Using Dice loss
Figure SMS_172
And Cross loss->
Figure SMS_173
Training of joint supervision model, total loss->
Figure SMS_174
The calculation formula is as follows:
Figure SMS_175
;/>
Figure SMS_176
wherein represents total loss, < > >
Figure SMS_177
Representing the Dice loss, < >>
Figure SMS_178
Representing cross entropy loss.
(5) Urban street view advertisement image segmentation performance evaluation.
The average intersection ratio mIoU and the Overall Accuracy (OA) are mainly adopted as evaluation indexes for urban scene remote sensing image segmentation performance evaluation
Figure SMS_179
Wherein mIoU represents average cross ratio, < >>
Figure SMS_180
Representing the number of building pixels with correct classification, < >>
Figure SMS_181
Indicating the number of correctly classified background pixels. />
Figure SMS_182
The number of background pixels representing classification errors, +.>
Figure SMS_183
The number of pixels in the building that are misclassified is indicated.
Figure SMS_184
Wherein OA represents the overall accuracy, +.>
Figure SMS_185
Representing the number of building pixels with correct classification, < >>
Figure SMS_186
Indicating the number of correctly classified background pixels. />
Figure SMS_187
The number of background pixels representing classification errors, +.>
Figure SMS_188
The number of pixels in the building that are misclassified is indicated.
It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.
The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims (11)

1. A city streetscape advertisement image segmentation method is characterized by comprising the following steps:
s1: collecting a city streetscape advertisement image data set;
s2: preprocessing an image;
s3: constructing an image model based on CSWin Transformer;
s4: training a model;
s5: urban street view advertisement image segmentation performance evaluation.
2. The method for segmenting urban street view advertisement images according to claim 1, wherein the step S1 specifically comprises:
selecting an aerial remote sensing high-resolution image dataset provided by ISPRS in Germany Vaihingen region and Potsdam region; the image in the data set is provided with a manually marked ground object type label graph, and five foreground types and a background type are provided, wherein the five foreground types are respectively an opaque water surface, a building, low vegetation, trees and an automobile;
the Vaihingen is a small and scattered village, the data set comprises 33 city street images with different sizes, the average size of the images is 2494 multiplied by 2064 pixels, the serial numbers ID of 2, 4, 6, 8, 10, 12, 14, 16, 20, 22, 24, 27, 29, 31, 33, 35 and 38 are selected as test sets, and the other 16 images are selected as training sets;
The Potsdam is a typical historic city, has huge building groups, narrow streets and dense building structures, contains 38 city scene images with the same size in a data set, has the image size of 6000 multiplied by 6000 pixels, and selects the numbers of 2_13, 2_14, 3_13, 3_14, 4_13, 4_14, 4_15, 5_13, 5_14, 5_15, 6_13, 6_14, 6_15 and 7_13 as test sets and the rest 24 as training sets.
3. The method for segmenting urban street view advertisement images according to claim 1, wherein the step S2 specifically comprises:
s201: image clipping: because the sizes of the images are inconsistent, in order to facilitate the subsequent network training, firstly, the city street pictures are cut, and the training set data images are cut by adopting a window with the size of 256 multiplied by 256;
s202: data enhancement: in order to improve the robustness and generalization capability of the model, all pictures in the training data set are subjected to random scaling, random vertical overturn and random horizontal overturn data enhancement technologies.
4. The method for segmenting urban street view advertisement images according to claim 1, wherein the step S3 specifically comprises:
the city street advertisement image segmentation method based on CSWin Transformer adopts a simple and effective U-shaped network structure as a whole and mainly comprises a coder, a decoder, jump connection and a segmentation head;
The image model of CSWin Transformer comprises a CSWin Transformer module, a feature fusion module, an ASPP multi-scale feature fusion module and an enhanced segmentation head module.
5. The urban street view advertisement image segmentation method according to claim 3, wherein the urban street view advertisement image segmentation method based on CSWin Transformer has the following overall architecture:
for a given city street view image
Figure QLYQS_1
Firstly, the sequence mapping layer consisting of convolution with the size of 7 multiplied by 7 and the step length of 4 in the stage 1 is processed to obtain +.>
Figure QLYQS_2
A picture block sequence with the size and the channel number of C, and then learning global information through a CSWin Transformer module;
to obtain a multi-scale, hierarchical representation of features, the encoder is divided into four stages, eachEach stage comprises a downsampling module consisting of convolution of 3×3 size and step size of 2 and a CSWin Transformer module consisting of CSWin Transformer block, and the number of CSWin Transformer block of each stage is
Figure QLYQS_3
The downsampling module is used for reducing the number of tokens and doubling the number of channels;
through four stages of encoder stages, obtain
Figure QLYQS_4
A feature map of size, then feeding the feature degree to the decoder stage; the decoder is in a symmetrical structure with the encoder and also comprises four stages, wherein each stage comprises a CNN up-sampling module and a feature fusion module;
The CNN up-sampling module consists of
Figure QLYQS_5
The size deconvolution is used for doubling the size of the feature map and halving the number of channels; the feature fusion module adopts +.>
Figure QLYQS_6
Designing a lightweight attention mechanism, and fusing low-dimensional detail features and high-dimensional semantic features from an encoder in a self-adaptive weight mode;
in four stages corresponding to the encoder and the decoder, following the classical Unet network design, adding 4-hop connection for assisting the recovery of the detail information such as the position and the like; because the features in the stage 3 and the stage 4 have larger receptive fields and contain rich deep semantic features, if the deep semantic information can be understood in a multi-scale manner, the model is facilitated to better understand the multi-scale object information; in the jump connection of the stage 3 and the stage 4, designing an ASPP multi-scale feature fusion module based on an attention mechanism;
finally, the outputs of the four stages of the decoder are up-sampled to a uniform size, all as inputs, fed into the enhanced segmentation head, passed through
Figure QLYQS_7
The size convolution and ReLU activation functions output a segmentation map of the same size as the resolution of the original input image.
6. A method for segmenting urban street view advertising images as claimed in claim 3, wherein the CSWin Transformer module is as follows:
CSWin Transformer is taken as an encoder backbone network of the urban street view advertisement image segmentation network, and the network is provided with a self-attention mechanism of a cross window, so that not only can global context information be effectively modeled, but also the calculation cost can be effectively reduced, and the cross window is formed by dividing strip-shaped windows in horizontal and vertical directions;
for horizontal direction, input
Figure QLYQS_8
Divided into +.>
Figure QLYQS_9
Horizontal strips, i.e.
Figure QLYQS_10
Wherein each band comprises +>
Figure QLYQS_11
The sequence, in particular, the sw width in each stage can be adjusted according to the calculation complexity and the model condition, and the size is not fixed; suppose that the dimensions of query Q, key value K and value V in the transducer are +.>
Figure QLYQS_12
The number of the multi-head attention heads is +.>
Figure QLYQS_13
Horizontal attention outcome->
Figure QLYQS_14
The definition is as follows:
Figure QLYQS_15
(1)
Figure QLYQS_16
(2)
Figure QLYQS_17
(3)
wherein the method comprises the steps of
Figure QLYQS_20
Represents an input feature map, Y represents the pair +.>
Figure QLYQS_23
As a result of self-attention, < >>
Figure QLYQS_25
Figure QLYQS_19
Respectively represent +.>
Figure QLYQS_21
The attention header queries the mapping matrix of Q, key value K and value V,/for>
Figure QLYQS_24
Set to->
Figure QLYQS_26
Sw represents the width of each strip, W represents the width of the input feature map, M represents the number of strips into which the feature map is divided, and the vertical attention result is similarly defined as the horizontal direction, representing- >
Figure QLYQS_18
Finally, the two directions of attention are connected together to form a self-attention result +.>
Figure QLYQS_22
:
Figure QLYQS_27
(4)
Figure QLYQS_28
(5)
Wherein Concat stands for the splicing operation,head k representing multi-head attention, k representing the number of heads of multi-head attention,
Figure QLYQS_29
is a projection matrix mapping self-attention results to the target dimension C, +.>
Figure QLYQS_30
Represents the result of attention in the horizontal direction, +.>
Figure QLYQS_31
Representing the vertical attention result, the calculation method of the CSWin transformer module in the encoder is as follows:
Figure QLYQS_32
(6)/>
Figure QLYQS_33
(7)
where LN represents layer normalization (layer normalization, LN), MLP represents multi-layer perceptron (MLP),
Figure QLYQS_34
output representing self-Attention CSWin-AttentionCharacteristic(s)>
Figure QLYQS_35
Representing the output characteristics of the multi-layer sensor.
7. The urban street view advertisement image segmentation method according to claim 4, wherein the feature fusion module is configured to adaptively select and enhance low-level detail features or high-level semantic information by using a lightweight attention mechanism so as to better fuse the low-level detail features and the high-level semantic information from the encoder;
the feature fusion module takes as input both low-level detail information from the encoder and high-level semantic information from the decoder for the first
Figure QLYQS_37
For each stage, low-level detail information +. >
Figure QLYQS_40
By->
Figure QLYQS_44
A convolution and batch normalization (BatchNorm, BN) layer to obtain an output result +.>
Figure QLYQS_39
High-level semantic information->
Figure QLYQS_43
As input to two branches, one branch passes +.>
Figure QLYQS_46
Convolution, batch normalization layer and Sigmoid activation layer, generating semantic weights +.>
Figure QLYQS_49
Output result with low-level detail information +.>
Figure QLYQS_36
Multiplying; the other branch passes->
Figure QLYQS_42
Convolving and batching normalization layers to obtain +.>
Figure QLYQS_45
Low level byte branch outcome->
Figure QLYQS_48
And semantic weight +.>
Figure QLYQS_38
After multiplication, the result and->
Figure QLYQS_41
Adding to obtain the final output result of the feature fusion module +.>
Figure QLYQS_47
The specific formula is as follows:
Figure QLYQS_50
(8)
Figure QLYQS_51
(9)
Figure QLYQS_52
(10)
Figure QLYQS_53
(11)
wherein the method comprises the steps of
Figure QLYQS_55
Representing Sigmoid activation function, BN represents batch normalization, conv represents +.>
Figure QLYQS_57
Convolution operation->
Figure QLYQS_60
Representing low-level detail information->
Figure QLYQS_56
Representing the intermediate result of a low-level detail branch, +.>
Figure QLYQS_58
Representing high-level semantic information,/->
Figure QLYQS_59
Representing high-level semantic weights, +.>
Figure QLYQS_61
Output representing high-level semantic branches, +.>
Figure QLYQS_54
And the representative feature fusion module outputs a final result.
8. The method of claim 4, wherein the ASPP multiscale fusion module adaptively weights multiscale feature images by using attention force, the feature images with matching receptive fields are enhanced and other feature images are suppressed for the target ground object, and the method is as follows: first, a feature map is input
Figure QLYQS_62
ASPP pyramid structures with 5 branches are respectively +.>
Figure QLYQS_63
Convolution branch, 3 different expansion coefficients +.>
Figure QLYQS_64
Expanded convolution branches, also a global averagePooling branches, after the feature map passes 5 branches, outputting 5 feature maps with the same resolution and different single receptive fields +.>
Figure QLYQS_65
Each feature map passes through the attention fusion module and is subjected to attention force striving generated by the attention fusion module>
Figure QLYQS_66
Multiplying and adding with original input to obtain a characteristic diagram +.>
Figure QLYQS_67
The formula is as follows:
Figure QLYQS_68
wherein->
Figure QLYQS_69
Feature map representing the generation of five branches of ASPP pyramid,/->
Figure QLYQS_70
Representing the feature map output through the attention fusion module, < >>
Figure QLYQS_71
Representing attention force diagram, pixel points can be more focused on the pixel points related to the pixel points, and the pixel points are +.>
Figure QLYQS_72
The definition is as follows:
Figure QLYQS_73
wherein->
Figure QLYQS_74
Representing an attention deficit map, conv represents +.>
Figure QLYQS_75
In operation, BN stands for batch normalization, sigmoid stands for activation function, +.>
Figure QLYQS_76
Represents the cross multiplication of matrix elements, the formula shows +.>
Figure QLYQS_77
The system is micro, attention can be generated in the channel dimension through activating the function, different weights can be given to the attention module in the space dimension, and different weights can be given to the attention module in the channel dimension;
finally, splicing the output characteristic diagrams of the five branches, and outputting
Figure QLYQS_78
。/>
9. The method for segmenting urban street view advertisement images according to claim 4, wherein the enhanced segmentation head module is obtained by sampling the feature map of the low resolution of four stages of the decoder to the same high resolution and adding elements, and adjusting the channel number by two-layer convolution after feature fusion module and ASPP multi-scale fusion module, wherein the feature map of each stage in the decoder contains abundant spatial position information and semantic information, which are vital to the remote sensing urban scene image.
10. The method for segmenting urban street view advertisement images according to claim 1, wherein the step S4 specifically comprises:
in the training process, the pictures in the training set and the labeling masks corresponding to the pictures are input into a CSWin Transformer-based city streetscape advertisement image segmentation model for training, and a network model is optimized; during training, an English-Weida 3090TiGPU graphic card is adopted, CSWin Transformer network parameters obtained by pre-training on an ImageNet data set are adopted, an AdamW optimizer optimization model is adopted, the learning rate is set to be 6e-4, and the learning rate is adjusted through a cosine strategy;
using Dice loss
Figure QLYQS_79
And Cross loss->
Figure QLYQS_80
Training of joint supervision model, total loss->
Figure QLYQS_81
The calculation formula is as follows:
Figure QLYQS_82
the method comprises the steps of carrying out a first treatment on the surface of the Wherein->
Figure QLYQS_83
Representing total loss->
Figure QLYQS_84
Representing the Dice loss, < >>
Figure QLYQS_85
Representing cross entropy loss.
11. The method for segmenting urban street view advertisement images according to claim 1, wherein the step S5 specifically comprises:
the average intersection ratio mIoU and the overall precision are mainly adopted as evaluation indexes for urban scene remote sensing image segmentation performance evaluation:
Figure QLYQS_86
wherein mIoU represents the average cross ratio, </i >>
Figure QLYQS_87
Representing the number of building pixels with correct classification, < >>
Figure QLYQS_88
Number of background pixels representing correct classification, +.>
Figure QLYQS_89
The number of background pixels representing classification errors, +.>
Figure QLYQS_90
Representing the number of building pixels with wrong classification;
Figure QLYQS_91
wherein OA represents the overall accuracy, +.>
Figure QLYQS_92
Indicating the number of building pixels that are correctly classified,
Figure QLYQS_93
number of background pixels representing correct classification, +.>
Figure QLYQS_94
The number of background pixels representing classification errors, +.>
Figure QLYQS_95
The number of pixels in the building that are misclassified is indicated. />
CN202310473810.8A 2023-04-28 2023-04-28 Urban streetscape advertisement image segmentation method Withdrawn CN116189180A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310473810.8A CN116189180A (en) 2023-04-28 2023-04-28 Urban streetscape advertisement image segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310473810.8A CN116189180A (en) 2023-04-28 2023-04-28 Urban streetscape advertisement image segmentation method

Publications (1)

Publication Number Publication Date
CN116189180A true CN116189180A (en) 2023-05-30

Family

ID=86452713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310473810.8A Withdrawn CN116189180A (en) 2023-04-28 2023-04-28 Urban streetscape advertisement image segmentation method

Country Status (1)

Country Link
CN (1) CN116189180A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036613A (en) * 2023-08-18 2023-11-10 武汉大学 Polarization three-dimensional reconstruction method and system based on multiple receptive field blending network
CN117346846A (en) * 2023-09-20 2024-01-05 中山大学 Automatic correction type water measuring weir flow photographic monitoring method and device
CN117456530A (en) * 2023-12-20 2024-01-26 山东大学 Building contour segmentation method, system, medium and equipment based on remote sensing image
CN117649666A (en) * 2024-01-30 2024-03-05 中国海洋大学 Image semantic segmentation method and system based on dynamic multi-scale information query
CN117765378A (en) * 2024-02-22 2024-03-26 成都信息工程大学 Method and device for detecting forbidden articles in complex environment with multi-scale feature fusion
CN117889867A (en) * 2024-03-18 2024-04-16 南京师范大学 Path planning method based on local self-attention moving window algorithm
CN118195361A (en) * 2024-05-17 2024-06-14 国网吉林省电力有限公司经济技术研究院 Big data-based energy management method and system

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036613A (en) * 2023-08-18 2023-11-10 武汉大学 Polarization three-dimensional reconstruction method and system based on multiple receptive field blending network
CN117036613B (en) * 2023-08-18 2024-04-02 武汉大学 Polarization three-dimensional reconstruction method and system based on multiple receptive field blending network
CN117346846A (en) * 2023-09-20 2024-01-05 中山大学 Automatic correction type water measuring weir flow photographic monitoring method and device
CN117456530A (en) * 2023-12-20 2024-01-26 山东大学 Building contour segmentation method, system, medium and equipment based on remote sensing image
CN117456530B (en) * 2023-12-20 2024-04-12 山东大学 Building contour segmentation method, system, medium and equipment based on remote sensing image
CN117649666A (en) * 2024-01-30 2024-03-05 中国海洋大学 Image semantic segmentation method and system based on dynamic multi-scale information query
CN117649666B (en) * 2024-01-30 2024-04-26 中国海洋大学 Image semantic segmentation method and system based on dynamic multi-scale information query
CN117765378A (en) * 2024-02-22 2024-03-26 成都信息工程大学 Method and device for detecting forbidden articles in complex environment with multi-scale feature fusion
CN117765378B (en) * 2024-02-22 2024-04-26 成都信息工程大学 Method and device for detecting forbidden articles in complex environment with multi-scale feature fusion
CN117889867A (en) * 2024-03-18 2024-04-16 南京师范大学 Path planning method based on local self-attention moving window algorithm
CN117889867B (en) * 2024-03-18 2024-05-24 南京师范大学 Path planning method based on local self-attention moving window algorithm
CN118195361A (en) * 2024-05-17 2024-06-14 国网吉林省电力有限公司经济技术研究院 Big data-based energy management method and system

Similar Documents

Publication Publication Date Title
CN116189180A (en) Urban streetscape advertisement image segmentation method
CN111563508B (en) Semantic segmentation method based on spatial information fusion
CN113850825B (en) Remote sensing image road segmentation method based on context information and multi-scale feature fusion
CN111598030B (en) Method and system for detecting and segmenting vehicle in aerial image
CN111612008B (en) Image segmentation method based on convolution network
CN110084850B (en) Dynamic scene visual positioning method based on image semantic segmentation
CN113780149B (en) Remote sensing image building target efficient extraction method based on attention mechanism
CN112163498B (en) Method for establishing pedestrian re-identification model with foreground guiding and texture focusing functions and application of method
CN112308860A (en) Earth observation image semantic segmentation method based on self-supervision learning
WO2021218786A1 (en) Data processing system, object detection method and apparatus thereof
CN110781850A (en) Semantic segmentation system and method for road recognition, and computer storage medium
CN112434723B (en) Day/night image classification and object detection method based on attention network
CN116453121B (en) Training method and device for lane line recognition model
CN112766136A (en) Space parking space detection method based on deep learning
CN114445430A (en) Real-time image semantic segmentation method and system for lightweight multi-scale feature fusion
CN109657538B (en) Scene segmentation method and system based on context information guidance
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN115661505A (en) Semantic perception image shadow detection method
CN116051977A (en) Multi-branch fusion-based lightweight foggy weather street view semantic segmentation algorithm
CN116468895A (en) Similarity matrix guided few-sample semantic segmentation method and system
CN116310916A (en) Semantic segmentation method and system for high-resolution remote sensing city image
CN113361528B (en) Multi-scale target detection method and system
Mazhar et al. Block attention network: a lightweight deep network for real-time semantic segmentation of road scenes in resource-constrained devices
Fan et al. Combining swin transformer with unet for remote sensing image semantic segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20230530

WW01 Invention patent application withdrawn after publication