CN116189180A

CN116189180A - Urban streetscape advertisement image segmentation method

Info

Publication number: CN116189180A
Application number: CN202310473810.8A
Authority: CN
Inventors: 王浚丞; 张婕
Original assignee: Qindao University Of Technology
Current assignee: Qindao University Of Technology
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2023-05-30

Abstract

The invention belongs to the technical field of image segmentation, and discloses a city street advertisement image segmentation method, which specifically comprises the following steps: collecting a city streetscape advertisement image data set; preprocessing an image; constructing a model; training a model; and (5) evaluating segmentation performance. According to the method, modeling is carried out by utilizing global context information of the urban street advertisement image, so that more accurate urban street advertisement image segmentation is realized; the problems of high complexity and high cost of advertisement image segmentation calculation are solved, an encoder is constructed by introducing a CSWin Transformer method to extract characteristics, and the calculation cost is reduced while global information is modeled; the invention provides a feature fusion module which can better fuse the detail features from the encoder and the semantic information of the decoder; at the jump joint, an ASPP multi-scale fusion module is provided, which is beneficial to extracting deep semantic information; the enhanced segmentation head module is beneficial to improving segmentation precision.

Description

Urban streetscape advertisement image segmentation method

Technical Field

The invention belongs to the technical field of image segmentation, and particularly relates to a city street advertisement image segmentation method.

Background

The city street view image is used as an important background element of the travel advertisement and plays an important role in the advertisement, so that the city street view image can be used as a geographic identifier of the advertisement, the advertisement is more specific and visual, more geographic emotions can be added to the advertisement, and the cultural value of the advertisement is improved. The urban street view image is segmented, namely the class labeling of the pixel level is carried out on the urban street view image, and the method has important application value in the field of advertising science. Different elements in the advertisement can be separated through an image segmentation technology so as to edit and synthesize in post-production, so that the advertisement production and delivery are more accurate and efficient.

In recent years, with the development of unmanned plane technology and modern satellite remote sensing technology, urban street view images are further developed in resolution, observation scale and imaging modes, and the characteristics of complex background, higher resolution and richer space detail and texture information are presented, so that the possibility of accurately segmenting the urban street view images is improved. However, due to the characteristics of large scale change of features, high similarity among the classes and mutual shielding of features in the urban streetscape, the difficulty of image segmentation on urban streetscape advertisement images is high.

Currently, with the development of hardware such as chips and graphics processing units, deep learning has achieved remarkable achievements in the fields of image processing such as image segmentation. The Convolutional Neural Network (CNN) has strong capability of capturing detail positioning information, can be used for representing image features of a hierarchical structure, and has become a mainstream technology for urban street view image segmentation. However, due to the limitation of the convolution operation receptive field, the modeling of the global context information of the image is difficult, the long Cheng Yuyi dependency relationship cannot be constructed, and the segmentation effect is not ideal for the urban street view advertisement image with complex background, blurred ground feature semantics and high resolution. The transducer and the Swin transducer have strong global modeling capability, so that the global information of the image is extracted and modeled, and a new research thought is opened for the research in the field of computer vision. Although the transducer and Swin transducer models can effectively model global information, the complexity of model calculation is high, and the possibility of application of the model on a high-resolution city street image is seriously affected.

Based on the analysis, the invention provides a city street advertisement image segmentation method, which can model global information of a city street advertisement image and control the calculation complexity within the limits of task tolerance. The model adopts a U-shaped network structure, uses CSWin Transformer construction encoder with low computational complexity and strong global modeling capability for feature extraction, uses CNN as a main body to construct decoder for feature graph recovery, and adds jump connection between the encoder and the decoder. In particular, in order to better fuse the local semantic features from the encoder with the global semantic features from the deep network, a feature fusion module is designed in each stage of the CNN decoder; and a hole space pyramid pooling (Atrous Spatial Pyramid Pooling, ASPP) multi-scale feature fusion module is designed at the jump connection part so as to facilitate global semantic understanding; finally, an enhanced segmentation head is proposed to achieve efficient segmentation with a lightweight attention mechanism.

The prior art has the following disadvantages:

1) At present, a classical network Unet is taken as an example of a network model of the existing CNN architecture. Because CNN convolution operation is adopted in the Unet model, the convolution operation receptive field is limited, the global context information of the urban street view advertisement image is difficult to model, and long Cheng Yuyi dependency relationship cannot be constructed. 2) The prior art SwinUnet has the defects. The technology is improved based on a Unet model, and Swin Tranformer is introduced in an encoder stage to model a global image, so that global image context information is obtained. The decoding stage is also constructed by using a Swin Tranformer to perform up-sampling reduction of the feature map. The disadvantages are as follows: the calculation complexity of the Swin Tranformer is very high, the whole network is constructed by the Swin Tranformer, and the model is huge and difficult to train. Swin transducer has strong modeling capability on global information, can extract deep semantic information, but has a lower extraction effect on local information than a CNN network. Meanwhile, up-sampling reduction is performed in the decoder stage by using a Swin Transformer construction, and the effect is not as efficient as that of CNN. C. Because SwinUnet is applied to the field of medical images, the model is not designed aiming at the application scene of the urban street advertisement image, and the problems of large difference in dimension of ground features, high similarity and mutual shielding in the urban street can not be solved, so that the effect of being applied to the urban street advertisement image is poor.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a city street advertisement image segmentation method.

The invention is realized in such a way that a city street advertisement image segmentation method specifically comprises the following steps:

s1: collecting a city streetscape advertisement image data set;

s2: preprocessing an image;

s3: constructing an image model based on CSWin Transformer;

s4: training a model;

s5: urban street view advertisement image segmentation performance evaluation.

Further, the S1 specifically includes:

selecting an aerial remote sensing high-resolution image dataset provided by ISPRS in Germany Vaihingen region and Potsdam region; the image in the data set is provided with a manually marked ground object type label graph, and five foreground types and a background type are provided, wherein the five foreground types are respectively an opaque water surface, a building, low vegetation, trees and an automobile; the Vaihingen is a small and scattered village, the data set comprises 33 city street images with different sizes, the average size of the images is 2494 multiplied by 2064 pixels, the serial numbers ID of 2, 4, 6, 8, 10, 12, 14, 16, 20, 22, 24, 27, 29, 31, 33, 35 and 38 are selected as test sets, and the other 16 images are selected as training sets; the Potsdam is a typical historic city, has huge building groups, narrow streets and dense building structures, contains 38 city street scenery images with the same size in a data set, has the image size of 6000 multiplied by 6000 pixels, and selects the images with the numbers of 2_13, 2_14, 3_13, 3_14, 4_13, 4_14, 4_15, 5_13, 5_14, 5_15, 6_13, 6_14, 6_15 and 7_13 as test sets and the rest 24 images as training sets.

Further, the S2 specifically includes:

s201: image clipping: because the sizes of the images are inconsistent, in order to facilitate the subsequent network training, firstly, the city street pictures are cut, and the training set data images are cut by adopting a window with the size of 256 multiplied by 256;

s202: data enhancement: in order to improve the robustness and generalization capability of the model, all pictures in the training data set are subjected to random scaling, random vertical overturn and random horizontal overturn data enhancement technologies.

Further, the step S3 specifically includes:

a city street advertisement image segmentation method, the whole adopts simple and effective U-shaped network structure, mainly include encoder, decoder, jump connect, divide the four parts of the head;

the image model of CSWin Transformer comprises a CSWin Transformer module, a feature fusion module, an ASPP multi-scale feature fusion module and an enhanced segmentation head module.

Further, the whole framework of the urban street view advertisement image segmentation method is specifically as follows:

for a given city street view image

Firstly, the sequence mapping layer consisting of convolution with the size of 7 multiplied by 7 and the step length of 4 in the stage 1 is processed to obtain +.>

The picture of the size is segmented into sequences, the number of channels is C, and global information is learned through a CSWin Transformer module; to obtain a multi-scale, hierarchical representation of features, the encoder is divided into four stages, each stage comprising a step size of 3 x 3 A downsampling block consisting of convolutions of 2 and a CSWin Transformer block consisting of CSWin Transformer blocks.

CSWin Transformer blocks per stage are of the number of

The downsampling module is used for reducing the number of picture sequences and doubling the number of channels. Thus, for the ith stage, the feature map consisting of the corresponding number of sequences has a size +.>

The number of channels is +.>

This is consistent with backbone network architecture of other common convolutional neural networks; through four stages of encoder stages, get +.>

A feature map of size, which is then fed into the decoder stage; the decoder and the encoder are in a symmetrical structure and also comprise four stages, and each stage comprises a CNN up-sampling module and a feature fusion module; the CNN up-sampling module consists of +.>

The size deconvolution is used for doubling the size of the feature map and halving the number of channels; the feature fusion module adopts +.>

Convolution designs a lightweight attention mechanism, and fuses low-level (low-level) detail features and high-level (high-level) semantic features from an encoder in an adaptive weight manner; in four stages corresponding to the encoder and the decoder, following the classical Unet network design, adding 4-hop connection for assisting the recovery of the detail information such as the position and the like; because the features in the stage 3 and the stage 4 have larger receptive fields and contain rich deep semantic features, if the deep semantic information can be understood in a multi-scale manner, the model is facilitated to better understand the multi-scale object information; skip in stage 3 and stage 4 In the jump connection, an ASPP multi-scale feature fusion module is designed based on an attention mechanism; finally, the outputs of the four stages of the decoder are up-sampled to a uniform size, all as inputs, fed into the enhanced segmentation head, via +.>

The size convolution and ReLU activation functions output a segmentation map of the same size as the resolution of the original input image.

Further, the CSWin Transformer module is specifically as follows:

CSWin Transformer is taken as an encoder backbone network of the urban street view advertisement image segmentation network, and the network is provided with a self-attention mechanism of a cross window, so that not only can global context information be effectively modeled, but also the calculation cost can be effectively reduced, and the cross window is formed by dividing strip-shaped windows in horizontal and vertical directions; for horizontal direction, input

Divided into +.>

Horizontal strips, i.e.

Wherein each band comprises +>

The sequence, in particular, the sw width in each stage can be adjusted according to the calculation complexity and the model condition, and the size is not fixed; suppose that the dimensions of query Q, key value K and value V in the transducer are +.>

The number of the multi-head attention heads is +. >

Then the attention result in the horizontal direction +.>

The definition is as follows:

(1)

(2)

(3)

wherein the method comprises the steps of

Representing an input feature map, Y represents->

As a result of self-attention, < >>

，

Respectively represent +.>

The attention header queries the mapping matrix of Q, key value K and value V,/for>

Set to->

Sw represents the width of each stripe, W represents the width of the input feature map, M represents the number of stripes into which the feature map is divided, and H represents the height of the feature map. Correspondingly, the result of attention in the vertical direction is similar to the definition in the horizontal direction, meaning +.>

Finally, the two directions of attention are connected together to form a self-attention result +.>

:

(4)

(5)

Wherein Concat stands for the splicing operation,head _k representing multi-head attention, k representing the number of heads of multi-head attention,

is a projection matrix mapping self-attention results to the target dimension C, +.>

Represents the result of attention in the horizontal direction, +.>

Representing the vertical direction attention results. From this, the calculation method of CSWin transformer module in encoder can be obtained as follows:

(6)

(7)

where LN represents layer normalization (layer normalization, LN), MLP represents multi-layer perceptron (MLP),

output features representing self-Attention CSWin-Attention, +. >

Representing the output characteristics of the multi-layer sensor.

Further, the feature fusion module is used for realizing self-adaptive selection of enhanced low-level detail features by a lightweight attention mechanismOr higher level semantic information to better fuse lower level detail features and higher level semantic information from the encoder; the feature fusion module takes as input both low-level detail information from the encoder and high-level semantic information from the decoder for the first

For each stage, low-level detail information +.>

By->

A convolution and batch normalization (BatchNorm, BN) layer to obtain an output result +.>

High-level semantic information->

As input to two branches, one branch passes +.>

Convolution, batch normalization layer and Sigmoid activation layer, generating high-level semantic weight +.>

Output result with low-level detail information +.>

Multiplying; the other branch passes->

Convolving and batching normalization layers to obtain +.>

Low level detail branch outcome->

And semantic weight +.>

After multiplication, the result obtained and +.>

Adding to obtain the final output result of the feature fusion module +.>

The specific formula is as follows:

(8)

(9)

(10)

(11) The method comprises the steps of carrying out a first treatment on the surface of the Wherein->

Representing Sigmoid activation function, BN represents batch normalization, conv represents +.>

Convolution operation->

Representing low-level detail information- >

Representing the intermediate result of a low-level detail branch, +.>

Representing high-level semantic information,/->

Representing high-level semantic weights, +.>

Representing high-level semanticsOutput of branch->

And the representative feature fusion module outputs a final result.

Further, the ASPP multi-scale fusion module may use attention to adaptively weight the multi-scale feature map, and for the target ground object, the feature map with the matching receptive field is enhanced, while other feature maps are suppressed, specifically as follows: first, a feature map is input

ASPP pyramid structures with 5 branches are respectively +.>

Convolution branch, 3 different expansion coefficients +.>

The expansion convolution branch (expansion coefficient rate=6, 8, 12) and a global average pooling branch, after the feature map passes through 5 branches, 5 feature maps with the same resolution and different receptive fields are output->

Each feature map passes through the attention fusion module and is subjected to attention force striving generated by the attention fusion module>

Multiplying and adding with original input to obtain a characteristic diagram +.>

The formula is as follows:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein->

Feature map representing the generation of five branches of ASPP pyramid,/->

Representing the feature map output through the attention fusion module, < >>

Representing attention force diagram, pixel points can be more focused on the pixel points related to the pixel points, and the pixel points are +. >

The definition is as follows:

Representing an attention deficit map, conv represents +.>

Point-wise convolution operation, BN stands for batch normalization, sigmoid stands for activation function, ++>

Representing matrix elements by cross-product. The formula indicates->

The system is micro, attention can be generated in the channel dimension through activating the function, different weights can be given to the attention module in the space dimension, and different weights can be given to the attention module in the channel dimension; finally, the output characteristic diagrams of the five branches are spliced and output +.>

。

Further, after the enhanced segmentation head module passes through the feature fusion module and the ASPP multi-scale fusion module, the feature map of each stage in the decoder contains rich spatial position information and semantic information, which are obtained by being critical to the remote sensing urban scene image, firstly, the feature map of the low resolution of four stages of the decoder is sampled to the same high resolution, elements are added, and then the number of channels is adjusted through two-layer convolution, so that a final semantic segmentation map is generated.

Further, the step S4 specifically includes: in the training process, the pictures in the training set and the labeling masks corresponding to the pictures are input into a CSWin Transformer-based city streetscape advertisement image segmentation model for training, and a network model is optimized;

During training, an English-to-Chinese 3090TiGPU graphic card is adopted, CSWin Transformer network parameters obtained by pre-training on an ImageNet data set are adopted, an AdamW optimizer optimization model is adopted, the learning rate is set to be 6e-4, and the learning rate is adjusted by a cosine strategy (cosine strategy); using Dice loss

And Cross entropy loss->

Training of joint supervision model, total loss->

The calculation formula is as follows:

Representing total loss->

Representing the Dice loss, < >>

Representing cross entropy loss.

Further, the step S5 specifically includes:

the method mainly adopts an average intersection ratio mIoU and Overall Accuracy (OA) as evaluation indexes for urban scene remote sensing image segmentation performance evaluation:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein, mIoU represents average cross ratio, < ->

Representing the number of building pixels with correct classification, < >>

The number of background pixels representing classification errors, +.>

The number of pixels in the building that are misclassified is indicated.

The method comprises the steps of carrying out a first treatment on the surface of the Wherein OA represents the overall accuracy, < >>

Representing the number of building pixels with correct classification, < >>

Number of background pixels representing correct classification, +.>

The number of background pixels representing classification errors, +.>

The number of pixels in the building that are misclassified is indicated.

In combination with the technical scheme and the technical problems to be solved, the technical scheme to be protected has the following advantages and positive effects:

First, aiming at the technical problems in the prior art and the difficulty in solving the problems, the technical problems solved by the technical proposal of the invention are analyzed in detail and deeply by tightly combining the technical proposal to be protected, the results and data in the research and development process, and the like, and some technical effects brought after the problems are solved have creative technical effects. The specific description is as follows: A. a city streetscape advertisement image segmentation method.

B. Because the method utilizes CSWin Transformer to construct a U-shaped semantic segmentation network structure as an encoder basic unit, global modeling can be performed by utilizing global context information of the urban street view advertisement image, and more accurate semantic segmentation of the urban street view image is realized.

C. Because the image or the feature map is divided into the strip-shaped windows in the CSWin Transformer module and self-attention is carried out in the strip-shaped windows, the problems of high resolution and high image processing complexity of the urban street view image can be solved, and the calculation amount and the complexity are reduced.

D. For the characteristics of the city street view image, a feature fusion module, an ASPP multi-scale feature fusion module and an enhanced segmentation head are designed, so that the segmentation precision is improved.

Secondly, the technical scheme is regarded as a whole or from the perspective of products, and the technical scheme to be protected has the following technical effects and advantages:

A. and modeling global information by using the global context information of the urban street view advertisement image, so as to realize more accurate urban street view advertisement image segmentation.

B. The method solves the problems of high computational complexity and high computational overhead of the Swin Transformer, introduces a CSWin Transformer method to construct an encoder for feature extraction, and reduces the computational overhead while modeling global information.

The following three targeted modules are designed for the characteristics of the urban street view advertisement image.

C. In the decoder stage, a feature fusion module is provided, so that the detail features from the encoder and the deep semantic information of the decoder can be fused better.

D. An ASPP multi-scale fusion module is provided at the jump joint, which is beneficial to the extraction of deep semantic information.

E. The enhanced segmentation head module is more suitable for the segmentation task of the urban street advertisement image, and is beneficial to improving the segmentation precision.

Thirdly, as inventive supplementary evidence of the claims of the present invention, the following important aspects are also presented:

(1) The technical scheme of the invention fills the technical blank in the domestic and foreign industries:

the current image segmentation technique method is mainly applied to medical images, remote sensing images, indoor object images and the like, and lacks an effective segmentation method for urban street advertisement images. The advertisement image is usually high in resolution, blurred in background and unclear in boundary semantics, and is directly applied to image segmentation methods in other fields, so that the effect is not ideal. The method for segmenting the urban street view advertisement image based on CSWin Transformer is designed by analyzing the characteristics of the urban street view advertisement image one by one in a targeted manner, can accurately and effectively extract the urban street view in the advertisement image, can efficiently perform network training, and fills the gap in the domestic and foreign industry technology in the field.

(2) Whether the technical scheme of the invention solves the technical problems that people want to solve all the time but fail to obtain success all the time is solved:

currently, there are two main approaches to the task of segmentation of high resolution images. Firstly, the network architecture based on CNN has low computational complexity, but can not model global information, and is difficult to solve the problems of semantic ambiguity, unclear boundary segmentation and the like in high-resolution image segmentation. The other is a network architecture based on a transducer, global information can be modeled, and the problems that the semantics are fuzzy and the category is difficult to infer are effectively solved, but the model is too huge, and the calculation complexity is higher.

1) CNN-based network architecture: the method mainly adopts a full convolutional neural network (FCN) to construct, adopts a U-shaped symmetrical structure of an encoder and a decoder, adds jump connection between the encoder and the decoder to perform characteristic splicing, and assists in recovering position information. Although the convolution network is a mainstream method for image segmentation, the convolution receptive field is limited, so that global context information of an image cannot be captured well, and the problem of semantic blurring of image segmentation cannot be solved.

2) Transformer-based network architecture: the technology for modeling the global context information of the image in domestic and foreign industries constructs a U-shaped network architecture by ViT, but because ViT encodes the global context information, the technology has high computational complexity and cannot be directly applied to high-resolution image segmentation tasks. Further, viT is to extract features from a single-scale feature map, and has no multi-scale feature information, and has poor segmentation effect on objects of various sizes in an image. Then, a scholars put forward a Swin Transformer to construct a semantic segmentation network, compared with ViT, the Swin Transformer can extract multi-scale characteristic information, so that the computational complexity is reduced to a certain extent, but the computational complexity still exceeds the CNN network architecture.

According to the invention, an image segmentation network is constructed based on CSWin Transformer, and CSWin Transformer and CNN network architectures are combined, so that not only can global information be modeled and accurate semantic segmentation be realized, but also the computational complexity of the model is effectively reduced, and the method is a network model with good segmentation performance and low computational complexity, and is an effective solution for balancing the segmentation performance and the computational complexity.

Drawings

FIG. 1 is a diagram of an embodiment of the present invention a flow chart of a city streetscape advertisement image segmentation method;

FIG. 2 is an overall architecture diagram of a model provided by an embodiment of the present invention;

FIG. 3 is a schematic illustration of the present invention provided by the examples CSWin Transformer cross window attention mechanism drawing;

FIG. 4 is a block diagram of a feature fusion module provided by an embodiment of the present invention;

FIG. 5 is a block diagram of an ASPP multi-scale fusion module according to an embodiment of the present invention;

FIG. 6 is a block diagram of an enhanced split head provided by an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

In order to fully understand how the invention may be embodied by those skilled in the art, this section is an explanatory embodiment of the invention, which is described in the following claims.

As shown in fig. 1, an embodiment of the present invention provides a method for dividing an advertisement image of a city street, which specifically includes:

s1: collecting a city streetscape advertisement image data set;

s2: preprocessing an image;

s3: constructing an image model based on CSWin Transformer;

s4: training a model;

s5: urban street view advertisement image segmentation performance evaluation.

(1) Urban street view advertising image dataset collection.

Aerial remote sensing high resolution image datasets were selected for the Vaihingen region and the Potsdam region of Germany provided by ISPRS. The data set image has a manually marked ground object category label map, and has five foreground categories (opaque water surface, building, low vegetation, tree and automobile) and one background category.

Vaihingen is a small and scattered village, the dataset contains 33 city street images of different sizes, the average size of the images is 2494 x 2064 pixels, the

numbers ID

2, 4, 6, 8, 10, 12, 14, 16, 20, 22, 24, 27, 29, 31, 33, 35, 38 are selected as the test set, and the remaining 16 images are the training set.

Potsdam is a typical historic city with large building clusters, narrow streets, and dense building structures. The dataset contained 38 city scene images of the same size, the dataset image size being 6000 x 6000 pixels. The test sets with the numbers of 2_13, 2_14, 3_13, 3_14, 4_13, 4_14, 4_15, 5_13, 5_14, 5_15, 6_13, 6_14, 6_15 and 7_13 are selected, and the rest 24 pieces are training sets.

(2) And (5) preprocessing an image.

The image preprocessing mainly comprises the following steps:

image clipping: because the sizes of the images are inconsistent, in order to facilitate the subsequent network training, firstly, the city street pictures are cut, and the training set data images are cut by adopting a window with the size of 256 multiplied by 256.

Data enhancement: in order to improve the robustness and generalization capability of the model, all pictures in the training data set are subjected to data enhancement technologies such as random scaling (scaling is [0.5,0.75,1.0,1.25,1.5 ]), random vertical inversion, random horizontal inversion and the like.

(3) Urban streetscape advertisement image the whole structure of the segmentation method.

A simple and effective U-shaped network structure is adopted as a whole in the urban street view advertisement image segmentation method, and mainly comprises an encoder, a decoder, jump connection and a segmentation head. Next, the overall architecture is first introduced, as shown in fig. 2, and then four key modules in the model, as well as CSWin Transformer module, feature fusion module, ASPP multi-scale feature fusion module, and enhanced segmentation head module, are sequentially introduced.

A. An overall architecture.

For a given city street view image

Firstly, obtaining +.A.A token mapping layer consisting of convolution with the size of 7 multiplied by 7 and the step length of 4 in stage1 is passed>

And a picture block sequence with the size, wherein the number of channels is C, and then the global information is learned through a CSWin Transformer module. To obtain a multi-scale, hierarchical representation of the features, the encoder is divided into four stages, each stage comprising a downsampling module consisting of a 3 x 3 convolution of step size 2 and a CSWin Transformer module consisting of CSWin Transformer block, the number of CSWin Transformer block per stage being ∈ ->

. The downsampling module is used for reducing the number of the tokens and doubling the number of channels. Thus, for the ith stage, the feature map consisting of the corresponding number of token has a size of +.>

The number of channels is +.>

This is consistent with other common CNN backbone network architectures.

Through four steps stage braidingA coder stage, obtaining

The feature map of the size is then fed into the decoder stage. The decoder is in a symmetrical structure with the encoder and also comprises four stages, and each stage comprises a CNN up-sampling module and a characteristic and characteristic fusion module. The CNN up-sampling module is composed of +.>

The size deconvolution is used for doubling the size of the characteristic diagram and halving the number of channels. The feature fusion module adopts- >

Convolution designs a lightweight attention mechanism that fuses low-dimensional detail features and high-dimensional semantic features from the encoder in an adaptive weight manner.

In four stages corresponding to the encoder and the decoder, a classical Unet network design is followed, and 4-hop connection is added for assisting in recovering detailed information such as position and the like. Because the features in the stage 3 and the stage 4 have larger receptive fields and contain rich deep semantic features, if the deep semantic information can be understood in a multi-scale manner, the model is facilitated to better understand the multi-scale object information. Thus, in the jump connection of stage 3 and stage 4, an ASPP multi-scale feature fusion module is designed herein based on the attention mechanism.

Finally, the outputs of the four stages of the decoder are up-sampled to a uniform size, all as inputs, fed into the enhanced segmentation head, passed through

Cswin transducer module.

CSWin Transformer is used as an encoder backbone network of the urban street view advertisement image segmentation network, and the network is provided with a self-attention mechanism of a cross window, so that not only can global context information be effectively modeled, but also the calculation cost can be effectively reduced. It is divided into a cross-shaped window by a strip-shaped window in both horizontal and vertical directions, as shown in fig. 3.

For horizontal direction, input

Divided into +.>

Horizontal strips, i.e.

Wherein each band comprises +>

And a token. In particular, the sw width in each stage can be adjusted according to the computational complexity and model conditions, and is not fixed in size. Suppose that the dimensions of query Q, key value K and value V in the transducer are +.>

The number of the multi-head attention heads is +.>

Then there is a horizontally oriented attention result

The definition is as follows:

(1)

(2)

(3)

wherein the method comprises the steps of

Representing the input characteristic map of the object,y represents the result of self-attention, < ->

，

Respectively represent +.>

The individual attention header queries the mapping matrix of Q, key value K and value V,

set to->

Sw represents the width of each stripe, W represents the width of the input feature map, and M represents the number of stripes into which the feature map is divided. Correspondingly, the vertical attention results are defined similarly to the horizontal direction, and represent

. Finally, the attention in two directions are connected together to form a self-attention result

:

(4)

(5)

Wherein Concat stands for splicing operation,head _k representing multi-head attention, k representing the number of heads of multi-head attention,

Represents the result of attention in the horizontal direction, +. >

(6)

(7)

output features representing self-Attention CSWin-Attention, +.>

Representing a multi-layer perceptron is provided.

C. Feature fusion module (Feature Fusion Module, FFM).

In stages 1, 2 and 3 of the decoder, a feature fusion module is designed to implement adaptive selection of enhanced low-level (low-level) detail features or high-level (high-level) semantic information by a lightweight attention mechanism so as to better fuse the low-level detail features and the high-level semantic information from the encoder, and the structure of the module is shown in fig. 4.

The feature fusion module takes as input both low-level detail information from the encoder and high-level semantic information from the decoder. For the first

For each stage, low-level detail information +.>

By passing through and (2)>

Roll and Batch Normalization (BN) layer,obtain output result->

. High-level semantic information->

As input to two branches, one branch passes +.>

Convolution, batch normalization layer and Sigmoid activation layer, generating semantic weights +. >

Output result with low-level detail information +.>

Multiplying; the other branch passes->

Convolving and batching normalization layers to obtain +.>

Adding the multiplication result of the previous step to obtain a final output result +.>

The specific formula is as follows:

(8)

(9)

(10)

(11)

wherein the method comprises the steps of

Representing Sigmoid activation function, BN representing BatchNorm batch normalization, conv representing +.>

Convolution operation->

Representing low-level detail information->

Representing the intermediate result of a low-level detail branch, +.>

Representing high-level semantic information,/->

Representing high-level semantic weights, +.>

Output representing high-level semantic branches, +.>

Representing the output result of the final feature fusion module.

Aspp multiscale fusion module (Multiscale ASPP Fusion Module, MASPP).

In order to help the model to better understand deep semantic information, in the method, a pyramid pooling model ASPP for obtaining a multiscale receptive field is introduced in jump connection corresponding to the stages 3 and 4, and an ASPP multiscale fusion module based on an attention mechanism is provided on the basis of the ASPP model, so that the multiscale feature map can be adaptively weighted by using attention force diagram. Specifically, for the target ground object, the feature map with the matching receptive field is enhanced, while the other feature maps are suppressed, and the specific structure is shown in fig. 5.

First, a feature map is input

ASPP pyramid structures with 5 branches are respectively +.>

Integrating branches, 3 different expansion coefficients +.>

The expansion convolution branch (expansion coefficient rate=6, 8, 12), and also a global average pooling branch. After the feature map passes through 5 minutes, 5 feature maps with different single receptive fields and the same resolution are output +.>

. Each feature map passes through the attention fusion module and is in force with the attention generated by the attention fusion module>

The formula is as follows:

Feature map representing the generation of five branches of ASPP pyramid,/->

Feature map representing output through attention fusion module +.>

Representing attention force diagram, pixel points can be more focused on the pixel points related to the pixel points, and the pixel points are +.>

The definition is as follows:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein Conv represents->

Point-wise convolution operation, BN stands for batch normalization, sigmoid stands for activation function, +.>

Represents the cross multiplication of matrix elements, the formula shows +.>

Is microscopic. Attention can be generated in the channel dimension by the Sigmoid function. Therefore, the attention module can be given different weights not only in the space dimension, but also in the channel dimension.

Finally, splicing the output characteristic diagrams of the five branches, and outputting

。

E. An enhanced split head module (Enhanced Segmentation Head Module, ESegH).

After the feature fusion module and the ASPP multi-scale fusion module, the feature map of each stage in the decoder contains rich space position information and semantic information, which is important to the remote sensing city scene image. The enhanced segmentation head module is provided, firstly, the feature images with low resolution in four stages of a decoder are up-sampled to the same high resolution, element addition is carried out, and then the number of channels is adjusted through two-layer convolution, so that a final semantic segmentation image is generated.

(4) And (5) model training.

In the training process, pictures in a training set and labeling masks corresponding to the pictures are input into a CSWin Transformer-based city streetscape advertisement image segmentation model for training, and a network model is optimized. During training, an Inlet-Weida 3090TiGPU graphic card is adopted, network parameters obtained by pre-training on an ImageNet data set are adopted by CSWin Transformer, an AdamW optimizer optimization model is adopted, the learning rate is set to be 6e-4, and the learning rate is adjusted by a cosine strategy (cosine strategy). Using Dice loss

And Cross loss->

Training of joint supervision model, total loss->

The calculation formula is as follows:

;/>

wherein represents total loss, < > >

Representing the Dice loss, < >>

Representing cross entropy loss.

(5) Urban street view advertisement image segmentation performance evaluation.

The average intersection ratio mIoU and the Overall Accuracy (OA) are mainly adopted as evaluation indexes for urban scene remote sensing image segmentation performance evaluation

Wherein mIoU represents average cross ratio, < >>

Representing the number of building pixels with correct classification, < >>

Indicating the number of correctly classified background pixels. />

The number of background pixels representing classification errors, +.>

The number of pixels in the building that are misclassified is indicated.

Wherein OA represents the overall accuracy, +.>

Representing the number of building pixels with correct classification, < >>

Indicating the number of correctly classified background pixels. />

The number of background pixels representing classification errors, +.>

The number of pixels in the building that are misclassified is indicated.

It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.

The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims

1. A city streetscape advertisement image segmentation method is characterized by comprising the following steps:

s1: collecting a city streetscape advertisement image data set;

s2: preprocessing an image;

s3: constructing an image model based on CSWin Transformer;

s4: training a model;

s5: urban street view advertisement image segmentation performance evaluation.

2. The method for segmenting urban street view advertisement images according to claim 1, wherein the step S1 specifically comprises:

selecting an aerial remote sensing high-resolution image dataset provided by ISPRS in Germany Vaihingen region and Potsdam region; the image in the data set is provided with a manually marked ground object type label graph, and five foreground types and a background type are provided, wherein the five foreground types are respectively an opaque water surface, a building, low vegetation, trees and an automobile;

the Vaihingen is a small and scattered village, the data set comprises 33 city street images with different sizes, the average size of the images is 2494 multiplied by 2064 pixels, the serial numbers ID of 2, 4, 6, 8, 10, 12, 14, 16, 20, 22, 24, 27, 29, 31, 33, 35 and 38 are selected as test sets, and the other 16 images are selected as training sets;

The Potsdam is a typical historic city, has huge building groups, narrow streets and dense building structures, contains 38 city scene images with the same size in a data set, has the image size of 6000 multiplied by 6000 pixels, and selects the numbers of 2_13, 2_14, 3_13, 3_14, 4_13, 4_14, 4_15, 5_13, 5_14, 5_15, 6_13, 6_14, 6_15 and 7_13 as test sets and the rest 24 as training sets.

3. The method for segmenting urban street view advertisement images according to claim 1, wherein the step S2 specifically comprises:

4. The method for segmenting urban street view advertisement images according to claim 1, wherein the step S3 specifically comprises:

the city street advertisement image segmentation method based on CSWin Transformer adopts a simple and effective U-shaped network structure as a whole and mainly comprises a coder, a decoder, jump connection and a segmentation head;

5. The urban street view advertisement image segmentation method according to claim 3, wherein the urban street view advertisement image segmentation method based on CSWin Transformer has the following overall architecture:

for a given city street view image

A picture block sequence with the size and the channel number of C, and then learning global information through a CSWin Transformer module;

to obtain a multi-scale, hierarchical representation of features, the encoder is divided into four stages, eachEach stage comprises a downsampling module consisting of convolution of 3×3 size and step size of 2 and a CSWin Transformer module consisting of CSWin Transformer block, and the number of CSWin Transformer block of each stage is

The downsampling module is used for reducing the number of tokens and doubling the number of channels;

through four stages of encoder stages, obtain

A feature map of size, then feeding the feature degree to the decoder stage; the decoder is in a symmetrical structure with the encoder and also comprises four stages, wherein each stage comprises a CNN up-sampling module and a feature fusion module;

The CNN up-sampling module consists of

Designing a lightweight attention mechanism, and fusing low-dimensional detail features and high-dimensional semantic features from an encoder in a self-adaptive weight mode;

in four stages corresponding to the encoder and the decoder, following the classical Unet network design, adding 4-hop connection for assisting the recovery of the detail information such as the position and the like; because the features in the stage 3 and the stage 4 have larger receptive fields and contain rich deep semantic features, if the deep semantic information can be understood in a multi-scale manner, the model is facilitated to better understand the multi-scale object information; in the jump connection of the stage 3 and the stage 4, designing an ASPP multi-scale feature fusion module based on an attention mechanism;

6. A method for segmenting urban street view advertising images as claimed in claim 3, wherein the CSWin Transformer module is as follows:

CSWin Transformer is taken as an encoder backbone network of the urban street view advertisement image segmentation network, and the network is provided with a self-attention mechanism of a cross window, so that not only can global context information be effectively modeled, but also the calculation cost can be effectively reduced, and the cross window is formed by dividing strip-shaped windows in horizontal and vertical directions;

for horizontal direction, input

Divided into +.>

Horizontal strips, i.e.

Wherein each band comprises +>

The number of the multi-head attention heads is +.>

Horizontal attention outcome->

The definition is as follows:

(1)

(2)

(3)

wherein the method comprises the steps of

Represents an input feature map, Y represents the pair +.>

As a result of self-attention, < >>

，

Respectively represent +.>

Set to->

Sw represents the width of each strip, W represents the width of the input feature map, M represents the number of strips into which the feature map is divided, and the vertical attention result is similarly defined as the horizontal direction, representing- >

:

(4)

(5)

Represents the result of attention in the horizontal direction, +.>

Representing the vertical attention result, the calculation method of the CSWin transformer module in the encoder is as follows:

(6)/>

(7)

output representing self-Attention CSWin-AttentionCharacteristic(s)>

Representing the output characteristics of the multi-layer sensor.

7. The urban street view advertisement image segmentation method according to claim 4, wherein the feature fusion module is configured to adaptively select and enhance low-level detail features or high-level semantic information by using a lightweight attention mechanism so as to better fuse the low-level detail features and the high-level semantic information from the encoder;

the feature fusion module takes as input both low-level detail information from the encoder and high-level semantic information from the decoder for the first

For each stage, low-level detail information +. >

By->

High-level semantic information->

As input to two branches, one branch passes +.>

Convolution, batch normalization layer and Sigmoid activation layer, generating semantic weights +.>

Output result with low-level detail information +.>

Multiplying; the other branch passes->

Convolving and batching normalization layers to obtain +.>

Low level byte branch outcome->

And semantic weight +.>

After multiplication, the result and->

Adding to obtain the final output result of the feature fusion module +.>

The specific formula is as follows:

(8)

(9)

(10)

(11)

wherein the method comprises the steps of

Convolution operation->

Representing low-level detail information->

Representing the intermediate result of a low-level detail branch, +.>

Representing high-level semantic information,/->

Representing high-level semantic weights, +.>

Output representing high-level semantic branches, +.>

And the representative feature fusion module outputs a final result.

8. The method of claim 4, wherein the ASPP multiscale fusion module adaptively weights multiscale feature images by using attention force, the feature images with matching receptive fields are enhanced and other feature images are suppressed for the target ground object, and the method is as follows: first, a feature map is input

ASPP pyramid structures with 5 branches are respectively +.>

Convolution branch, 3 different expansion coefficients +.>

Expanded convolution branches, also a global averagePooling branches, after the feature map passes 5 branches, outputting 5 feature maps with the same resolution and different single receptive fields +.>

The formula is as follows:

wherein->

Feature map representing the generation of five branches of ASPP pyramid,/->

Representing the feature map output through the attention fusion module, < >>

The definition is as follows:

wherein->

Representing an attention deficit map, conv represents +.>

In operation, BN stands for batch normalization, sigmoid stands for activation function, +.>

Represents the cross multiplication of matrix elements, the formula shows +.>

The system is micro, attention can be generated in the channel dimension through activating the function, different weights can be given to the attention module in the space dimension, and different weights can be given to the attention module in the channel dimension;

。/>

9. The method for segmenting urban street view advertisement images according to claim 4, wherein the enhanced segmentation head module is obtained by sampling the feature map of the low resolution of four stages of the decoder to the same high resolution and adding elements, and adjusting the channel number by two-layer convolution after feature fusion module and ASPP multi-scale fusion module, wherein the feature map of each stage in the decoder contains abundant spatial position information and semantic information, which are vital to the remote sensing urban scene image.

10. The method for segmenting urban street view advertisement images according to claim 1, wherein the step S4 specifically comprises:

in the training process, the pictures in the training set and the labeling masks corresponding to the pictures are input into a CSWin Transformer-based city streetscape advertisement image segmentation model for training, and a network model is optimized; during training, an English-Weida 3090TiGPU graphic card is adopted, CSWin Transformer network parameters obtained by pre-training on an ImageNet data set are adopted, an AdamW optimizer optimization model is adopted, the learning rate is set to be 6e-4, and the learning rate is adjusted through a cosine strategy;

using Dice loss

And Cross loss->

Training of joint supervision model, total loss->

The calculation formula is as follows:

Representing total loss->

Representing the Dice loss, < >>

Representing cross entropy loss.

11. The method for segmenting urban street view advertisement images according to claim 1, wherein the step S5 specifically comprises:

the average intersection ratio mIoU and the overall precision are mainly adopted as evaluation indexes for urban scene remote sensing image segmentation performance evaluation: