CN118096922A

CN118096922A - Method for generating map based on style migration and remote sensing image

Info

Publication number: CN118096922A
Application number: CN202410299406.8A
Authority: CN
Inventors: 王奔; 丁志鹏; 孙水发; 冯阳
Original assignee: Hangzhou Normal University
Current assignee: Hangzhou Normal University
Priority date: 2024-03-15
Filing date: 2024-03-15
Publication date: 2024-05-28

Abstract

The invention discloses a method for generating a map based on style migration and remote sensing images, wherein a constructed map generation network model comprises an encoder, a style conversion module and a decoder; capturing long-range dependency relations among map features by fusing a multi-head self-attention mechanism and a residual error module as a style converter; the method combines the traditional transpose convolution and Carafe operators in the up-sampling stage of the decoder to better utilize the neighborhood information to improve the up-sampling quality; and optimizing and training the map generation network model, inputting the remote sensing image into the trained optimal model, and outputting a corresponding map image. The invention provides more accurate characteristic information for the up-sampling operation, and the up-sampling operation is carried out through up-sampling kernel prediction and characteristic recombination, so that the generated map has good visual improvement effect on the aspects of roads, buildings, edge details, map content color saturation and the like. The problems of detail loss and unclear content in map generation in the prior art are solved.

Description

A method for generating maps based on style transfer and remote sensing images

技术领域Technical Field

本发明属于地图制作技术领域，涉及一种基于风格迁移和遥感影像生成地图方法。The present invention belongs to the technical field of map making and relates to a method for generating a map based on style migration and remote sensing images.

背景技术Background technique

地图在大众的日常生活和工作中扮演着重要且不可或缺的角色。它们不仅提供了空间定位和导航的功能，同时也提供了丰富的地理信息和空间数据资源。传统的地图制作方法通常依赖于人工勘测和车载GPS轨迹数据，然而在地图更新过程中，这些方法存在一些固有的局限性。首先，传统地图制作方法需要大量的人力资源和时间，从而导致地图更新速度较慢。其次，人工勘测可能会引入人为误差，进而导致地图与实际地面存在差异。此外，车载GPS轨迹数据也可能受到环境和设备限制，因而无法完全准确地反映真实情况。考虑到当前地面建筑和道路的频繁改造以及自然灾害的发生，导致地面实际情况与既有地图存在不匹配。因此，需要一种既快速又准确的地图生成方法。Maps play an important and indispensable role in people's daily life and work. They not only provide spatial positioning and navigation functions, but also provide rich geographic information and spatial data resources. Traditional map production methods usually rely on manual surveys and vehicle-mounted GPS trajectory data. However, these methods have some inherent limitations in the map update process. First, traditional map production methods require a lot of human resources and time, resulting in slow map updates. Second, manual surveys may introduce human errors, which in turn lead to differences between maps and actual ground. In addition, vehicle-mounted GPS trajectory data may also be limited by the environment and equipment, and therefore cannot fully and accurately reflect the real situation. Considering the frequent renovation of ground buildings and roads and the occurrence of natural disasters, the actual situation on the ground does not match the existing maps. Therefore, a map generation method that is both fast and accurate is needed.

风格迁移思想的出现，为上述问题带来的解决方法。近年来，随着硬件设备计算性能的提升和深度学习的迅速发展，深度学习被广泛用于计算机视觉的各个领域，基于深度学习的网络模型不断涌现，各种算法不断被优化升级，并在实际应用中取得了巨大成效。得益于此，众多学者开始使用神经网络进行对图像的风格迁移。The emergence of the idea of style transfer has brought a solution to the above problems. In recent years, with the improvement of computing performance of hardware devices and the rapid development of deep learning, deep learning has been widely used in various fields of computer vision. Network models based on deep learning continue to emerge, various algorithms are constantly optimized and upgraded, and have achieved great results in practical applications. Thanks to this, many scholars have begun to use neural networks to transfer the style of images.

近年来，已经有不少学者提出基于GAN网络的地图生成方法，比如：全监督模型Pix2pix：基于条件生成对抗网络通过成对图像的监督训练，实现了一对一图像风格迁移，但需要依赖配对数据甚至语义标签。然而，在许多实际场景中，获取准确的配对数据和标签是非常困难的；无监督模型CycleGAN：通过循环一致性损失函数来确保转换的一致性和准确性，并且不需要配对的训练数据，但由于缺少配对数据的监督训练，在实际生成效果上并不十分令人满意；半监督模型SmapGAN：基于半监督的生成对抗网络来实现同区域遥感影像与地图之间的风格迁移转换。并设计图像梯度L1损失和图像梯度结构损失，生成具有全局拓扑关系和对象详细边缘曲线的有风格的地图块，该模型兼具了监督和非监督模型的优点，尽管具有一定优势，但由于基于Resblock的风格转换器主要使用局部感受野来捕捉输入数据中的空间局部关系，仍然存在对长距离依赖关系的处理不足。此外，传统的转置卷积受卷积核的局部感受野和填充的影响，在上采样过程中容易引起模糊和信息丢失。In recent years, many scholars have proposed map generation methods based on GAN networks, such as: the fully supervised model Pix2pix: based on the conditional generative adversarial network, through the supervised training of paired images, one-to-one image style transfer is achieved, but it needs to rely on paired data and even semantic labels. However, in many practical scenarios, it is very difficult to obtain accurate paired data and labels; the unsupervised model CycleGAN: through the cycle consistency loss function to ensure the consistency and accuracy of the conversion, and does not require paired training data, but due to the lack of supervised training of paired data, the actual generation effect is not very satisfactory; the semi-supervised model SmapGAN: based on the semi-supervised generative adversarial network to achieve the style transfer conversion between remote sensing images and maps of the same region. And design the image gradient L1 loss and image gradient structure loss to generate a styled map block with global topological relationships and detailed edge curves of objects. This model combines the advantages of supervised and unsupervised models. Although it has certain advantages, since the style converter based on Resblock mainly uses local receptive fields to capture the spatial local relationship in the input data, it still lacks the processing of long-distance dependencies. In addition, traditional transposed convolution is affected by the local receptive field and padding of the convolution kernel, which easily causes blurring and information loss during upsampling.

基于上述提到的监督模型、非监督模型和半监督模型存在的缺陷而导致遥感影像到地图生成存在地图细节丢失和内容不清晰等问题。因此提出一种基于风格迁移和遥感影像生成地图方法。Based on the defects of the supervised model, unsupervised model and semi-supervised model mentioned above, the generation of remote sensing images to maps has problems such as loss of map details and unclear content. Therefore, a method of generating maps based on style transfer and remote sensing images is proposed.

发明内容Summary of the invention

本发明的目的就是克服上述现有技术的缺陷，提供一种基于风格迁移和遥感影像生成地图方法。The purpose of the present invention is to overcome the defects of the above-mentioned prior art and provide a method for generating maps based on style transfer and remote sensing images.

具体包括如下步骤：The specific steps include:

步骤1：首先，将任意公开遥感影像数据集按照设定比例划分为训练集、验证集和测试集。数据集包括多张成对图像，每张均包含遥感影像和相对应的地图图像。同时，对训练集和验证集进行数据增强处理，测试集不进行数据增强处理。数据增强的内容包括翻转和旋转操作。训练集的目的是为了学习特征，验证集用于辅助模型调试，而测试集则用于评估模型的精度；Step 1: First, divide any public remote sensing image dataset into training set, validation set and test set according to the set ratio. The dataset includes multiple paired images, each of which contains a remote sensing image and a corresponding map image. At the same time, data enhancement is performed on the training set and validation set, while no data enhancement is performed on the test set. Data enhancement includes flipping and rotation operations. The purpose of the training set is to learn features, the validation set is used to assist model debugging, and the test set is used to evaluate the accuracy of the model;

步骤2：构建地图生成网络模型，该模型包括编码器、风格转换模块和解码器；Step 2: Build a map generation network model, which includes an encoder, a style transfer module, and a decoder;

编码器包括1个卷积核大小为7×7的卷积层、批归一化层、ReLU激活函数和2个下采样层，下采样层包括卷积核大小为3×3的卷积层、批归一化层、ReLU激活函数；The encoder includes a convolution layer with a convolution kernel size of 7×7, a batch normalization layer, a ReLU activation function, and two downsampling layers. The downsampling layer includes a convolution layer with a convolution kernel size of 3×3, a batch normalization layer, and a ReLU activation function.

风格转换模块使用残差块与多头自注意力机制相结合作为风格转换器的主要结构；经编码器处理后的得到的特征图Feature_R依次通过9个由3×3卷积层、ReLU激活函数、3×3卷积层组成的残差块进行初步的风格转换。在风格转换器的最后一层，依次使用一个1×1卷积层、ReLU激活函数、4个头的自注意力机制模块、ReLU激活函数、1×1卷积层和ReLU激活函数后输出Feature_M。The style conversion module uses a residual block combined with a multi-head self-attention mechanism as the main structure of the style converter; the feature map Feature _R obtained after being processed by the encoder is sequentially passed through 9 residual blocks consisting of 3×3 convolution layers, ReLU activation functions, and 3×3 convolution layers for preliminary style conversion. In the last layer of the style converter, a 1×1 convolution layer, a ReLU activation function, a 4-head self-attention mechanism module, a ReLU activation function, a 1×1 convolution layer, and a ReLU activation function are used in sequence to output Feature _M.

解码器包括上采样模块和一个卷积核大小为7×7的卷积层，将Feature_M通过一个传统的3×3转置卷积作为第一层的上采样方法，其用公式表示为：F₁＝TC(Feature_M)；式中，TC表示转置卷积操作，F₁表示经过3×3转置卷积后的结果。在第二层使用Carafe上采样算子，Carafe模块由三个卷积层组成，并分为上采样核预测模块和特征重组模块两个部分。The decoder includes an upsampling module and a convolution layer with a convolution kernel size of 7×7. Feature _M is upsampled by a traditional 3×3 transposed convolution as the first layer, which is expressed by the formula: F ₁ =TC(Feature _M ); where TC represents the transposed convolution operation and F ₁ represents the result after the 3×3 transposed convolution. The Carafe upsampling operator is used in the second layer. The Carafe module consists of three convolution layers and is divided into two parts: an upsampling kernel prediction module and a feature reorganization module.

Carafe模块处理过程具体如下：The Carafe module processing process is as follows:

首先，使用一个1×1的卷积层将输入特征图的通道数从C压缩到C_n。其用公式表示为：F₂＝C₁(F₁)；式中，C₁表示通道压缩操作，F₂表示将F₁进行通道压缩后得到的结果。First, a 1×1 convolutional layer is used to compress the number of channels of the input feature map from C to C _n . This is expressed by the formula: F ₂ =C ₁ (F ₁ ); where C ₁ represents the channel compression operation, and F ₂ represents the result of channel compression of F ₁ .

其次，用一个3×3的卷积核来进行上采样核预测，其中输入为H×W×C_n，输出为参数设置σ＝2，k_up＝3，σ表示上采样因子，k_up表示重组内核大小。通过该操作能够增加编码器的接收域，并在更大的区域内充分利用上下文信息。同时，将通道维度在空间维度上展开，得到形状为/>的上采样核，并对每个大小为k_up×k_up的重组核进行空间归一化处理。其用公式表示为：/>式中，KPM表示进行上采样核预测操作，Unfolding表示对其进行空间展开，softmax表示对其进行归一化操作，F₃表示上采样核预测的结果。Secondly, a 3×3 convolution kernel is used to perform upsampling kernel prediction, where the input is H×W×C _n and the output is The parameters are set as σ = 2, k _up = 3, where σ represents the upsampling factor and k _up represents the size of the reorganization kernel. This operation can increase the encoder's receptive field and make full use of context information in a larger area. At the same time, the channel dimension is expanded in the spatial dimension to obtain a shape of/> The upsampling kernel is then used, and each recombined kernel of size k _up ×k _up is spatially normalized. It is expressed as:/> In the formula, KPM represents the upsampling kernel prediction operation, Unfolding represents the spatial expansion, softmax represents the normalization operation, and F ₃ represents the result of the upsampling kernel prediction.

在特征重组模块中，将上采样预测核与输入特征图的某特征点为中心的k_up×k_up区域进行点积运算，以实现特征重组。最后，使用卷积核为1×1的卷积层压缩通道并输出Feature_CM。其用公式表示为：式中，CARM表示特征重组操作，N(F_1i,k_up)表示输入的特征图中以F_1i特征点为中心的大小为k_up×k_up的区域，F_3i表示上采样核预测模块预测得到的该点的上采样核，C₂表示通道压缩操作，Feature_CM表示TC-Carafe输出的结果。In the feature recombination module, the up-sampled prediction kernel is dot-producted with the k _up ×k _up region centered on a feature point of the input feature map to achieve feature recombination. Finally, a convolution layer with a convolution kernel of 1×1 is used to compress the channel and output Feature _CM . It is expressed as: Where CARM represents the feature recombination operation, N(F _1i ,k _up ) represents the area of size k _up ×k _up centered on the feature point F _1i in the input feature map, F _3i represents the upsampling kernel of the point predicted by the upsampling kernel prediction module, C ₂ represents the channel compression operation, and Feature _CM represents the result output by TC-Carafe.

最后，将Feature_CM送入大小为7×7卷积层并进行Tanh激活后输出生成的地图图像。Finally, Feature _CM is fed into a 7×7 convolutional layer and Tanh activation is performed to output the generated map image.

步骤3：将步骤1中训练集和验证集输入步骤2搭建的网络模型进行优化训练，得到最优模型；Step 3: Input the training set and validation set in step 1 into the network model built in step 2 for optimization training to obtain the optimal model;

采用判别器对步骤2中输出的地图进行判别，判别器采用PatchGAN，对整体图像的判断任务拆解为对图像中多个局部区域(patch)的判断任务，这种方法有助于网络更细致地理解图像的局部结构和细节；The discriminator is used to discriminate the map output in step 2. The discriminator uses PatchGAN, which decomposes the judgment task of the whole image into the judgment task of multiple local areas (patches) in the image. This method helps the network understand the local structure and details of the image more carefully.

训练过程中采用的损失函数如下：The loss function used in the training process is as follows:

1)拓扑一致性损失：其中为图像梯度L1损失，L_grastr为图像梯度结构损失。1) Topological consistency loss: is the image gradient L1 loss, and L _grastr is the image gradient structure loss.

式中，表示从遥感影像样本中采样；C₁和C₂为常数项；M和N表示输入图具有N列的M×N尺度；/>为G_j(y)与G_j(G_X→Y(x))的协方差；/>为G_j(y)的标准差；为G_j(G_X→Y(x))的标准差；/>为G_i(y)与G_i(G_X→Y(x))的协方差；/>为G_i(y)的标准差；/>为G_i(G_X→Y(x))的标准差。对于真实地图和生成地图的255×255梯度图像，它们的像素矩阵G(y)和G(G_X→Y(x))有255行255列。对于第j列(第i行)，该列(第i行)上点的像素值由一个i维(第j维)随机变量G_j(G_i)组成。G(G_X→Y(x))-G(y)中，G(y)为真实地图y的梯度图像，G(G_X→Y(x))为生成假地图G_X→Y(x)的梯度图像。In the formula, Indicates sampling from remote sensing image samples; C ₁ and C ₂ are constant terms; M and N indicate that the input image has an M×N scale of N columns; /> is the covariance of G _j (y) and G _j (G _X→Y (x));/> is the standard deviation of G _j (y); is the standard deviation of G _j (G _X→Y (x));/> is the covariance of _Gi (y) and _Gi (G _X→Y (x));/> is the standard deviation of _Gi (y);/> is the standard deviation of _Gi (G _X→Y (x)). For the 255×255 gradient images of the real map and the generated map, their pixel matrices G(y) and G(G _X→Y (x)) have 255 rows and 255 columns. For the jth column (ith row), the pixel value of the point on this column (ith row) is composed of an i-dimensional (j-dimensional) random variable _Gj ( _Gi ). In G(G _X→Y (x))-G(y), G(y) is the gradient image of the real map y, and G(G _X→Y (x)) is the gradient image of the generated fake map G _X→Y (x).

2)内容损失：旨在确保生成地图与Ground Truth在内容上相似，其中和为循环损失，/>和/>为直接损失。在无监督阶段，采用循环损失。在有监督阶段，采用直接损失。2) Content loss: aims to ensure that the generated map is similar to the ground truth in content. and is the circulation loss, /> and/> is the direct loss. In the unsupervised stage, the cycle loss is used. In the supervised stage, the direct loss is used.

式中，表示遥感影像的循环损失，λ为微调系数，L1u表示在无监督阶段的L1损失，/>为像素L1损失，该循环损失通过L1损失计算像素之间的差异，使生成地图与遥感影像在内容上保持循环一致性。/>表示从遥感影像样本中采样，G_Y→X(G_X→Y(x))-x表示计算G_X→Y(x)生成的假地图图像后，再经G_Y→X生成假遥感影像与真遥感影像x的循环损失。In the formula, represents the cyclic loss of remote sensing images, λ is the fine-tuning coefficient, L1u represents the L1 loss in the unsupervised stage,/> It is the pixel L1 loss. This cycle loss calculates the difference between pixels through L1 loss, so that the generated map and remote sensing image maintain cycle consistency in content. /> represents sampling from remote sensing image samples, G _Y→X (G _X→Y (x))-x represents the cycle loss of calculating the fake map image generated by G _X→Y (x), and then generating the fake remote sensing image and the real remote sensing image x through G _Y→X .

式中，表示地图的循环损失，/>表示从地图样本中采样。通过引入拓扑一致性损失/>来保持生成图像的拓扑结构与目标图像的拓扑结构的循环一致性。G_X→Y(G_Y→X(y))-y表示计算G_Y→X(y)生成的假遥感影像后，再经G_X→Y生成假地图图像与真地图图像y的循环损失。In the formula, represents the cycle loss of the map,/> Represents sampling from map samples. By introducing topological consistency loss/> To maintain the cyclic consistency between the topological structure of the generated image and the topological structure of the target image. _{G X→Y} (G _Y→X (y))-y represents the cyclic loss of the fake remote sensing image generated by G _Y→X (y), and then the fake map image and the real map image y generated by G _X→Y .

地图到遥感影像的直接损失通过L1损失函数来保持内容一致性。/>中λ为微调系数，L1表示L1损失，/>表示像素L1损失。G_Y→X(y)-x表示计算生成的假遥感影像与真遥感影像之间的损失。Direct loss from maps to remote sensing images The L1 loss function is used to maintain content consistency. /> Where λ is the fine-tuning coefficient, L1 represents the L1 loss, /> Represents the pixel L1 loss. _{G Y→X} (y)-x represents the loss between the calculated fake remote sensing image and the real remote sensing image.

遥感影像到地图的直接损失通过/>损失保持生成的地图与遥感影像在拓扑结构上一致。G_X→Y(x)-y表示计算生成的假地图图像与真地图图像之间的损失。Direct loss from remote sensing images to maps By/> The loss keeps the generated map consistent with the remote sensing image in topological structure. _{G X→Y} (x)-y represents the loss between the generated fake map image and the real map image.

3)对抗损失：通过判别器来判别生成图像与真实图像之间的差距。生成器G的目的是使损失函数值最小化，判别器D的目的则是最大化损失函数的值。其公式为：3) Adversarial loss: The discriminator is used to determine the difference between the generated image and the real image. The purpose of the generator G is to minimize the value of the loss function, while the purpose of the discriminator D is to maximize the value of the loss function. The formula is:

遥感影像到地图的对抗损失G_X→Y为遥感影像到地图的生成器，生成的图像输入到判别器D_Y进行判别。Adversarial Loss from Remote Sensing Images to Maps G _X→Y is a generator for remote sensing images to maps, and the generated image is input to the discriminator D _Y for discrimination.

地图到遥感影像的对抗损失G_Y→X为地图到遥感影像的生成器，生成的图像送到判别器D_X进行判别。Adversarial Loss from Map to Remote Sensing Imagery G _Y→X is a map-to-remote sensing image generator, and the generated image is sent to the discriminator D _X for discrimination.

4)身份损失：用于确保转换后的图像与原始图像之间的一致性。例如，将地图传入生成器G_X→Y生成的地图应尽可能与输入地图保持一致，即尽为地图本身内容与颜色。其公式为： 4) Identity loss: used to ensure the consistency between the converted image and the original image. For example, the map generated by passing the map into the generator G _X→Y should be as consistent as possible with the input map, that is, the content and color of the map itself. The formula is:

步骤4：将遥感影像输入经过步骤3训练后得到的最优模型中，输出相应的地图图像。Step 4: Input the remote sensing image into the optimal model obtained after training in step 3 and output the corresponding map image.

与现有技术相比，本发明的显著优点为：采用残差模块与多头自注意力机制Multi-headed Self-attention相结合的方式来捕捉特征之间的长程依赖关系，为上采样操作提供更精准的特征信息。其次，提出了将传统转置卷积和Carafe算子相结合的新型上采样方法，通过上采样核预测和特征重组进行上采样操作，使生成的地图更加清晰和完整。Compared with the prior art, the significant advantages of the present invention are: the residual module is combined with the multi-headed self-attention mechanism to capture the long-range dependencies between features, providing more accurate feature information for upsampling operations. Secondly, a new upsampling method combining traditional transposed convolution and Carafe operator is proposed, and the upsampling operation is performed through upsampling kernel prediction and feature reorganization, making the generated map clearer and more complete.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的流程图；Fig. 1 is a flow chart of the present invention;

图2为本发明的地图生成网络模型模块图；FIG2 is a diagram of a map generation network model module of the present invention;

图3为本发明的风格转换器模块图；FIG3 is a block diagram of a style converter according to the present invention;

图4为本发明的上采样模块图；FIG4 is a diagram of an upsampling module of the present invention;

图5为本发明的判别器模块图；FIG5 is a block diagram of a discriminator of the present invention;

图6为本发明的基于遥感影像生成地图方法与其他方法的视觉效果对比结果图。FIG6 is a diagram showing a comparison of visual effects between the method for generating a map based on remote sensing images of the present invention and other methods.

具体实施方式Detailed ways

以下结合附图及具体实施步骤对本发明进行详细说明。The present invention is described in detail below with reference to the accompanying drawings and specific implementation steps.

如图1所示，一种基于遥感影像生成地图方法，具体包括如下步骤：As shown in FIG1 , a method for generating a map based on remote sensing images specifically includes the following steps:

步骤1：本实施例中使用公开的纽约市数据集，将数据集按照8:1:1的比例划分为训练集、验证集和测试集。对训练集和验证集进行数据增强处理，其缩放级别为16，其中包含2194张遥感图像及其对应的地图，图像大小为256×256。Step 1: In this example, the public New York City dataset is used and divided into a training set, a validation set, and a test set in a ratio of 8:1:1. Data augmentation is performed on the training set and the validation set, with a zoom level of 16, containing 2194 remote sensing images and their corresponding maps, and an image size of 256×256.

步骤2：构建地图生成网络模型，如图2所示，该模型包括编码器、风格转换模块和解码器；Step 2: Build a map generation network model, as shown in Figure 2. The model includes an encoder, a style transfer module, and a decoder.

编码器包括1个卷积核大小为7×7的卷积层、批归一化层(BN)、ReLU激活函数和2个下采样层(包括卷积核大小为3×3的卷积层、批归一化层(BN)、ReLU激活函数)；输入一张C×H×W遥感影像，通过编码器中1个卷积核大小为7×7的卷积层和2个卷积核大小为3×3下采样层来减小特征图的大小和冗余信息。The encoder includes one convolution layer with a convolution kernel size of 7×7, a batch normalization layer (BN), a ReLU activation function, and two downsampling layers (including a convolution layer with a convolution kernel size of 3×3, a batch normalization layer (BN), and a ReLU activation function). A C×H×W remote sensing image is input, and the size and redundant information of the feature map are reduced through one convolution layer with a convolution kernel size of 7×7 and two downsampling layers with a convolution kernel size of 3×3 in the encoder.

如图3所示，风格转换模块使用残差块与多头自注意力机制(MHSA)相结合作为风格转换器的主要结构；经编码器处理后的得到的特征图Feature_R依次通过9个由3×3卷积层、ReLU激活函数、3×3卷积层组成的残差块进行初步的风格转换。在风格转换器的最后一层，依次使用一个1×1卷积层、ReLU激活函数、4个头的自注意力机制模块、ReLU激活函数、1×1卷积层和ReLU激活函数后输出Feature_M。通过引入多头自注意力机制可以在多个子空间上并行地学习特征表示，捕捉图像中像素之间的长程依赖关系，以此增加模型的非线性能力。As shown in Figure 3, the style conversion module uses a residual block combined with a multi-head self-attention mechanism (MHSA) as the main structure of the style converter; the feature map Feature _R obtained after being processed by the encoder is sequentially passed through 9 residual blocks consisting of 3×3 convolution layers, ReLU activation functions, and 3×3 convolution layers for preliminary style conversion. In the last layer of the style converter, a 1×1 convolution layer, a ReLU activation function, a 4-head self-attention mechanism module, a ReLU activation function, a 1×1 convolution layer, and a ReLU activation function are used in sequence to output Feature _M. By introducing the multi-head self-attention mechanism, feature representations can be learned in parallel on multiple subspaces to capture the long-range dependencies between pixels in the image, thereby increasing the nonlinear ability of the model.

解码器包括上采样模块和一个卷积核大小为7×7的卷积层，如图4所示，将Feature_M通过一个传统的3×3转置卷积作为第一层的上采样方法，其用公式表示为：The decoder includes an upsampling module and a convolution layer with a convolution kernel size of 7×7, as shown in Figure 4. Feature _M is upsampled by a traditional 3×3 transposed convolution as the first layer, which is expressed as:

F₁＝TC(Feature_M)；F ₁ =TC(Feature _M );

式中，TC表示转置卷积操作，F₁表示经过3×3转置卷积后的结果。Where TC represents the transposed convolution operation, and _F1 represents the result after 3×3 transposed convolution.

在第二层使用Carafe上采样算子，用于更好地提高图像的生成细节和清晰度。Carafe上采样具有感受野大、轻量级、计算速度快的优点。Carafe模块由三个卷积层组成，并分为上采样核预测模块和特征重组模块两个部分。在上采样核预测模块中，输入的数据是大小为C×H×W的特征图。C表示一张图像中的通道数，H表示图像垂直维度的像素数，W表示图像水平维度的像素数；首先，使用一个1×1的卷积层将输入特征图的通道数从C压缩到C_n。其用公式表示为：F₂＝C₁(F₁)；The Carafe upsampling operator is used in the second layer to better improve the generated details and clarity of the image. Carafe upsampling has the advantages of large receptive field, light weight and fast calculation speed. The Carafe module consists of three convolutional layers and is divided into two parts: the upsampling kernel prediction module and the feature reconstruction module. In the upsampling kernel prediction module, the input data is a feature map of size C×H×W. C represents the number of channels in an image, H represents the number of pixels in the vertical dimension of the image, and W represents the number of pixels in the horizontal dimension of the image; first, a 1×1 convolutional layer is used to compress the number of channels of the input feature map from C _{to Cn} . It is expressed by the formula: _F2 = _C1 ( _F1 );

式中，C₁表示通道压缩操作，F₂表示将F₁进行通道压缩后得到的结果。Where _C1 represents the channel compression operation, and _F2 represents the result obtained after performing channel compression _{on F1} .

其次，用一个3×3的卷积核来进行上采样核预测，其中输入为H×W×C_n，输出为参数设置σ＝2，k_up＝3，σ表示上采样因子，k_up表示重组内核大小。通过该操作能够增加编码器的接收域，并在更大的区域内充分利用上下文信息。同时，将通道维度在空间维度上展开，得到形状为/>的上采样核，并对每个大小为k_up×k_up的重组核进行空间归一化处理。其用公式表示为：/> Secondly, a 3×3 convolution kernel is used to perform upsampling kernel prediction, where the input is H×W×C _n and the output is The parameters are set as σ = 2, k _up = 3, where σ represents the upsampling factor and k _up represents the size of the reorganization kernel. This operation can increase the encoder's receptive field and make full use of context information in a larger area. At the same time, the channel dimension is expanded in the spatial dimension to obtain a shape of/> The upsampling kernel is then used, and each recombined kernel of size k _up ×k _up is spatially normalized. It is expressed as:/>

式中，KPM表示进行上采样核预测操作，Unfolding表示对其进行空间展开，softmax表示对其进行归一化操作，F₃表示上采样核预测的结果。In the formula, KPM represents the upsampling kernel prediction operation, Unfolding represents the spatial expansion, softmax represents the normalization operation, and F ₃ represents the result of the upsampling kernel prediction.

在特征重组模块中，将上采样预测核与输入特征图的某特征点为中心的k_up×k_up区域进行点积运算，以实现特征重组。最后，使用卷积核为1×1的卷积层压缩通道并输出Feature_CM。其用公式表示为： In the feature recombination module, the up-sampled prediction kernel is dot-producted with the k _up ×k _up region centered on a feature point of the input feature map to achieve feature recombination. Finally, a convolution layer with a convolution kernel of 1×1 is used to compress the channel and output Feature _CM . It is expressed as:

式中，CARM表示特征重组操作，N(F_1i,k_up)表示输入的特征图中以F_1i特征点为中心的大小为k_up×k_up的区域，F_3i表示上采样核预测模块预测得到的该点的上采样核，C₂表示通道压缩操作，Feature_CM表示TC-Carafe输出的结果。Where CARM represents the feature recombination operation, N(F _1i ,k _up ) represents the area of size k _up ×k _up centered on the feature point F _1i in the input feature map, F _3i represents the upsampling kernel of the point predicted by the upsampling kernel prediction module, C ₂ represents the channel compression operation, and Feature _CM represents the result output by TC-Carafe.

步骤3：将步骤1得到的数据集中的训练集和验证集送入步骤2搭建的网络中，按调整好的参数进行训练，并保存最优模型；Step 3: Send the training set and validation set in the data set obtained in step 1 to the network built in step 2, train according to the adjusted parameters, and save the optimal model;

具体来说，在训练期间，Batch Size设置为1，使用Adam优化器来对网络进行优化，本文设置β₁＝0.5，β₂＝0.999。学习率策略使用linear，初始学习率为0.0002，在1到100epoch学习率保持不变，在101到200epoch学习率逐渐衰减为0。所有的模型训练epoch均为200。Specifically, during training, the batch size is set to 1, and the Adam optimizer is used to optimize the network. In this paper, β ₁ = 0.5 and β ₂ = 0.999 are set. The learning rate strategy uses linear, and the initial learning rate is 0.0002. The learning rate remains unchanged from 1 to 100 epochs, and the learning rate gradually decays to 0 from 101 to 200 epochs. All models are trained for 200 epochs.

将生成器生成的图像输入到如图5所示PatchGAN判别器时，判别器的任务被细分为对图像中多个局部区域(patch)的判断任务。这种分解的方法有助于判别器更精细地理解地图图像的局部结构和细节。具体而言，图像通过5个4×4的卷积、LeakyReLu和BN层后，将输入映射为70×70矩阵，对70×70矩阵中每个patch进行真假判别。判别器会逐个判断每个局部区域的真实性，若为判别该patch真，则标为1，若判别该patch为假，则标为0，最后计算出整体图像的为真的概率。这样的判别方式能够促使生成器更好地捕捉地图的局部特征，提升生成地图的真实感和细节表现，并通过对抗损失来对模型训练进行优化。When the image generated by the generator is input into the PatchGAN discriminator as shown in Figure 5, the task of the discriminator is subdivided into the task of judging multiple local areas (patches) in the image. This decomposition method helps the discriminator understand the local structure and details of the map image more finely. Specifically, after the image passes through 5 4×4 convolutions, LeakyReLu and BN layers, the input is mapped to a 70×70 matrix, and each patch in the 70×70 matrix is judged as true or false. The discriminator will judge the authenticity of each local area one by one. If the patch is judged to be true, it is marked as 1, and if the patch is judged to be false, it is marked as 0, and finally the probability of the overall image being true is calculated. This judgment method can enable the generator to better capture the local features of the map, improve the realism and detail performance of the generated map, and optimize the model training through adversarial loss.

1)拓扑一致性损失(Topological Consistency Loss)：用来保证G_X→Y(遥感到地图)拓扑关系的正确。其中为图像梯度L1损失，L_grastr为图像梯度结构损失。1) Topological Consistency Loss: used to ensure the correct topological relationship of G _X→Y (remote sensing map). is the image gradient L1 loss, and L _grastr is the image gradient structure loss.

2)内容损失(Content Loss)：旨在确保生成地图与Ground Truth在内容上相似，其中和/>为循环损失，/>和/>为直接损失。在无监督阶段，采用循环损失。在有监督阶段，采用直接损失。2) Content Loss: Aims to ensure that the generated map is similar to the Ground Truth in content. and/> is the circulation loss, /> and/> is the direct loss. In the unsupervised stage, the cycle loss is used. In the supervised stage, the direct loss is used.

3)对抗损失(Adversarial Loss)：通过判别器来判别生成图像与真实图像之间的差距。生成器G的目的是使损失函数值最小化，判别器D的目的则是最大化损失函数的值。其公式为：3) Adversarial Loss: The discriminator is used to determine the difference between the generated image and the real image. The purpose of the generator G is to minimize the value of the loss function, while the purpose of the discriminator D is to maximize the value of the loss function. The formula is:

4)身份损失(Identity Loss)：用于确保转换后的图像与原始图像之间的一致性。例如，将地图传入生成器G_X→Y生成的地图应尽可能与输入地图保持一致，即尽为地图本身内容与颜色。其公式为： 4) Identity Loss: It is used to ensure the consistency between the converted image and the original image. For example, the map generated by passing the map into the generator G _X→Y should be as consistent as possible with the input map, that is, the content and color of the map itself. The formula is:

步骤4：将步骤3中得到的最优模型进行遥感影像生成地图测试，即可得到地图图像；Step 4: Test the optimal model obtained in step 3 on the remote sensing image generation map to obtain a map image;

如图6所示为地图生成模型的生成结果，可以看到，CycleGAN和Pix2pix均存在内容的错误生成，SmapGAN存在生成内容模糊，边缘不清晰的问题。而本文发明通过在风格转换器部分采用MHSA-ResBlock，即残差模块与Multi-headed Self-attention相结合的方式来捕捉特征之间的长程依赖关系，为上采样操作提供更精准的特征信息。其次，使用传统转置卷积和Carafe算子相结合的新型上采样方法TC-Carafe，通过上采样核预测和特征重组进行上采样操作，使生成的地图在道路、建筑物、边缘细节和地图内容色彩饱和度等方面有良好的视觉提升效果。本发明的地图生成方法与其他方法的指标对比结果如表1所示，可以看出，本发明的优势更为明显，各项评价指标(峰值信噪比(PSNR)、结构相似度(SSIM)、均方根误差(RMSE))均优于其他方法。As shown in Figure 6, the generation result of the map generation model, it can be seen that both CycleGAN and Pix2pix have the problem of incorrect content generation, and SmapGAN has the problem of blurred generated content and unclear edges. The invention of this paper adopts MHSA-ResBlock in the style converter part, that is, the residual module is combined with Multi-headed Self-attention to capture the long-range dependency between features, and provide more accurate feature information for upsampling operations. Secondly, a new upsampling method TC-Carafe combining traditional transposed convolution and Carafe operator is used to perform upsampling operations through upsampling kernel prediction and feature reorganization, so that the generated map has a good visual improvement effect in terms of roads, buildings, edge details and map content color saturation. The comparison results of the indicators of the map generation method of the present invention and other methods are shown in Table 1. It can be seen that the advantages of the present invention are more obvious, and various evaluation indicators (peak signal-to-noise ratio (PSNR), structural similarity (SSIM), root mean square error (RMSE)) are better than other methods.

表1Table 1

ModelModel PSNRPSNR SSIMSSIM RMSERMSE Pix2pixPix2pix 19.925519.9255 0.67190.6719 28.918328.9183 CycleGANCycleGAN 24.527124.5271 0.81570.8157 18.316218.3162 SmapGANSmapGAN 27.501427.5014 0.87420.8742 12.468412.4684 OursOurs 28.114728.1147 0.87840.8784 11.748411.7484

以上结合附图对本发明的具体实施方式做了详细说明，但是本发明并不限于上述实施方式，在本领域普通技术人员所具备的知识范围内，还可以在不脱离本发明宗旨的前提下做出各种变化。The specific implementation modes of the present invention have been described in detail above in conjunction with the accompanying drawings, but the present invention is not limited to the above implementation modes, and various changes can be made within the knowledge scope of ordinary technicians in this field without departing from the purpose of the present invention.

Claims

1. A method for generating a map based on style transfer and remote sensing images, characterized in that:

The specific steps include:

Step 1: First, any public remote sensing image dataset is divided into training set, validation set and test set according to the set ratio; the dataset includes multiple paired images, each of which contains a remote sensing image and a corresponding map image; at the same time, data enhancement is performed on the training set and validation set, while no data enhancement is performed on the test set;

Step 2: Build a map generation network model, which includes an encoder, a style transfer module, and a decoder;

The encoder includes a convolution layer with a convolution kernel size of 7×7, a batch normalization layer, a ReLU activation function, and two downsampling layers. The downsampling layer includes a convolution layer with a convolution kernel size of 3×3, a batch normalization layer, and a ReLU activation function.

The style conversion module uses a residual block combined with a multi-head self-attention mechanism as the main structure of the style converter. The feature map Feature _R obtained after the encoder processing is sequentially passed through 9 residual blocks consisting of 3×3 convolutional layers, ReLU activation functions, and 3×3 convolutional layers for preliminary style conversion. In the last layer of the style converter, a 1×1 convolutional layer, a ReLU activation function, a 4-head self-attention mechanism module, a ReLU activation function, a 1×1 convolutional layer, and a ReLU activation function are used in sequence to output Feature _M.

The decoder includes an upsampling module and a convolution layer with a convolution kernel size of 7×7. Feature _M is upsampled by a traditional 3×3 transposed convolution as the first layer, which is expressed by the formula: F ₁ =TC(Feature _M ); where TC represents the transposed convolution operation and F ₁ represents the result after the 3×3 transposed convolution. The Carafe upsampling operator is used in the second layer. The Carafe module consists of three convolution layers and is divided into two parts: an upsampling kernel prediction module and a feature reconstruction module.

Step 3: Input the training set and validation set in step 1 into the network model built in step 2 for optimization training to obtain the final optimized model;

Step 4: Input the remote sensing image into the optimal model trained in step 3 and output the corresponding map image.

2. The method for generating maps from remote sensing images as described in claim 1, wherein the data enhancement content described in step 1 includes flipping and rotation operations.

3. The method for generating a map from remote sensing images as claimed in claim 1, wherein the processing of the Carafe module in step 2 is as follows:

First, a 1×1 convolutional layer is used to compress the number of channels of the input feature map from C to C _n ; it is expressed by the formula: F ₂ =C ₁ (F ₁ ); where C ₁ represents the channel compression operation, and F ₂ represents the result obtained after channel compression of F ₁ ;

Secondly, a 3×3 convolution kernel is used to perform upsampling kernel prediction, where the input is H×W×C _n and the output is The parameters are set as σ = 2, k _up = 3, where σ represents the upsampling factor and k _up represents the size of the reorganization kernel. This operation can increase the encoder's receptive field and make full use of context information in a larger area. At the same time, the channel dimension is expanded in the spatial dimension to obtain a shape of / > The upsampling kernel is used, and each recombined kernel of size k _up ×k _up is spatially normalized; it is expressed by the formula:/> In the formula, KPM represents the upsampling kernel prediction operation, Unfolding represents the spatial expansion, softmax represents the normalization operation, and F ₃ represents the result of the upsampling kernel prediction;

In the feature recombination module, the up-sampled prediction kernel is dot-producted with the k _up ×k _up region centered on a feature point of the input feature map to achieve feature recombination; finally, a convolution layer with a convolution kernel of 1×1 is used to compress the channel and output Feature _CM ; it is expressed as: Where CARM represents the feature recombination operation, N(F _1i ,k _up ) represents the area of size k _up ×k _up centered on the feature point F _1i in the input feature map, F _3i represents the upsampling kernel of the point predicted by the upsampling kernel prediction module, C ₂ represents the channel compression operation, and Feature _CM represents the result output by TC-Carafe;

Finally, Feature _CM is fed into a 7×7 convolutional layer and Tanh activation is performed to output the generated map image.

4. The method for generating maps from remote sensing images as claimed in claim 1, wherein the step 3 uses a loss function and a discriminator to optimize the model constructed in step 2:

The discriminator uses PatchGAN, which is composed of multiple layers of convolution. The judgment task of the whole image is decomposed into the judgment tasks of multiple local areas in the image.

The loss function used in the training process is as follows:

1) Topological consistency loss: is the image gradient L1 loss, L _grastr is the image gradient structure loss;

In the formula, Indicates sampling from remote sensing image samples; C ₁ and C ₂ are constant terms; M and N indicate that the input image has an M×N scale of N columns; /> is the covariance of G _j (y) and G _j (G _X→Y (x));/> is the standard deviation of G _j (y);/> is the standard deviation of G _j (G _X→Y (x));/> is the covariance of _Gi (y) and _Gi (G _X→Y (x));/> is the standard deviation of _Gi (y);/> is the standard deviation of _Gi (G _X→Y (x)); for the 255×255 gradient images of the real map and the generated map, their pixel matrices G(y) and G(G _X→Y (x)) have 255 rows and 255 columns; for the jth column (the ith row), the pixel value of the point on the column (the ith row) is composed of an i-dimensional (j-dimensional) random variable _Gj ( _Gi ); in G(G _X→Y (x))-G(y), G(y) is the gradient image of the real map y, and G(G _X→Y (x)) is the gradient image of the generated fake map G _X→Y (x);

2) Content loss: aims to ensure that the generated map is similar to the ground truth in content. and/> is the circulation loss, /> and/> is a direct loss; in the unsupervised stage, a cyclic loss is used; in the supervised stage, a direct loss is used;

In the formula, represents the cyclic loss of remote sensing images, λ is the fine-tuning coefficient, L1u represents the L1 loss in the unsupervised stage,/> The pixel L1 loss is used to calculate the difference between pixels through the L1 loss, so that the generated map and the remote sensing image maintain the cycle consistency in content;/> represents sampling from remote sensing image samples, G _Y→X (G _X→Y (x))-x represents the cycle loss of calculating the fake map image generated by G _X→Y (x), and then generating the fake remote sensing image and the real remote sensing image x through G _Y→X ;

In the formula, represents the cycle loss of the map,/> Represents sampling from map samples; by introducing topological consistency loss/> To maintain the cyclic consistency between the topological structure of the generated image and the topological structure of the target image; G _X→Y (G _Y→X (y))-y represents the cyclic loss of the fake map image and the real map image y generated by G _X→Y after calculating the fake remote sensing image generated by G _Y→X (y);

Direct loss from maps to remote sensing images Maintain content consistency through L1 loss function; /> Where λ is the fine-tuning coefficient, L1 represents the L1 loss, /> represents the pixel L1 loss; G _Y→X (y)-x represents the loss between the calculated false remote sensing image and the true remote sensing image;

Direct loss from remote sensing images to maps By/> The loss keeps the generated map consistent with the remote sensing image in topological structure; G _X→Y (x)-y represents the loss between the calculated fake map image and the real map image;

3) Adversarial loss: The discriminator is used to determine the difference between the generated image and the real image. The purpose of the generator G is to minimize the loss function value, while the purpose of the discriminator D is to maximize the loss function value. The formula is:

Adversarial Loss from Remote Sensing Images to Maps G _X→Y is a generator from remote sensing images to maps, and the generated image is input to the discriminator D _Y for discrimination;

Adversarial Loss from Map to Remote Sensing Imagery G _Y→X is a map-to-remote sensing image generator, and the generated image is sent to the discriminator D _X for discrimination;

4) Identity loss: used to ensure the consistency between the converted image and the original image; for example, the map generated by passing the map into the generator G _X→Y should be as consistent as possible with the input map, that is, the content and color of the map itself; its formula is: