CN116309022A

CN116309022A - Ancient architecture image self-adaptive style migration method based on visual encoder

Info

Publication number: CN116309022A
Application number: CN202310216506.5A
Authority: CN
Inventors: 王耀南; 曾凯; 王蔚; 毛建旭; 张辉; 钟杭; 吴昊天
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2023-03-08
Filing date: 2023-03-08
Publication date: 2023-06-23

Abstract

The invention discloses an ancient building image self-adaptive style migration method based on a visual encoder, which is based on a visual encoding-decoding style migration neural network, and comprises the steps of obtaining multi-stage relative position encoding characteristics of building images in a characteristic encoder stage, designing a self-adaptive color and structural characteristic migration fusion module, carrying out multi-stage multi-scale fusion on extracted building content characteristics and building style characteristics, and carrying out characteristic migration fusion by the self-adaptive color and structural characteristic migration fusion module by adopting a plurality of style characteristic migration methods; the building content features and building style features are then further migrated and fused at the feature decoder stage. The model predicts the historic building style migration test set, greatly improves the style migration estimation precision, solves the problems of building scene drawing refinement, morphological standardization, texture color standardization and the like, and realizes the reproduction of the building heritage space narrative.

Description

Ancient architecture image self-adaptive style migration method based on visual encoder

Technical Field

The invention relates to the technical field of image processing, in particular to an ancient architecture image self-adaptive style migration method based on a visual encoder.

Background

The ancient architecture is an important component of cultural heritage, and has important ancient value, scientific value and artistic value.

The digital depiction, the drawing narrative, the drawing recreating and the digital packaging of the ancient architecture are important means for transferring the ancient architecture culture. In the existing large-scale dense architecture heritage digital depiction and style narrative, the manual experience is still seriously relied on, the degree of automation is low, and the problems of messy creation scenes, irregular forms, non-uniform local textures and colors, time and labor waste and the like exist. These problems place high demands on the skill and experience of the practitioner.

Disclosure of Invention

The invention provides a visual encoder-based self-adaptive style migration method for ancient architecture images, which aims to solve the problems that in the prior art, the digital depiction of the ancient architecture depends on manual experience and the degree of automation is low.

In order to achieve the above purpose, the technical scheme of the invention specifically comprises:

a visual encoder-based self-adaptive style migration method for ancient architecture images is characterized in that: comprises the following steps:

s1, acquiring a plurality of ancient architecture images, style images, color reference images, structure reference images and style transfer learning real images, and constructing a data set by utilizing the five types of images; dividing the data set to obtain a training set and a testing set;

s2, constructing a visual coding-decoding style migration neural network; the visual coding-decoding style migration neural network comprises an encoder and a decoder;

s3, randomly extracting a group of samples in a training set, encoding the ancient architecture image and the style image in the samples by using an encoder, respectively extracting the characteristics of the ancient architecture image and the style image at different layers, outputting the architecture content characteristics and the architecture style characteristics, and fusing the architecture content characteristics and the architecture style characteristics to obtain fusion characteristics;

s4, inputting the fusion characteristics and the style image in the step S3 into a decoder for reconstruction in the process of extracting the coding characteristics to obtain a predicted style migrated picture;

s5, performing iterative training on the visual coding-decoding style migration neural network according to the picture subjected to style migration and the corresponding style migration learning real image in the data set to obtain a trained visual coding-decoding style migration neural network;

and S6, testing the trained visual coding-decoding style migration neural network by using the test set.

Preferably, the step S1 specifically includes the following steps:

step S11, collecting a plurality of ancient architecture images, style images, color reference images, structure reference images and style transfer learning real images, wherein the five types of images are equal in number and respectively correspond to each other, and are formed into data sets in a one-to-one correspondence mode, each group contains five types of different images, and the data sets can be expressed as follows:

wherein D represents a dataset, I _C Representing ancient architecture image, I _S Representing a style image, θ _C Representing a color reference image, θ _cs Representing structural reference images, I _G Representing a style transfer learning real image;

and step S12, randomly extracting images according to groups in the data set, and constructing a training set and a testing set according to the proportion of 8:2.

Preferably, the encoder in the step S2 is specifically a visual transducer encoder, and the visual transducer encoder includes two coding feature extraction sub-networks sharing weights, which are respectively an adaptive color adjustment network and an adaptive structure adjustment network;

the self-adaptive color adjustment network comprises i self-adaptive color adjustment modules which are sequentially connected and based on color reference images, and the i self-adaptive color adjustment modules sequentially perform color adjustment on the characteristics of different dimensions from low dimension to high dimension;

the self-adaptive structure adjustment network comprises i self-adaptive structure adjustment modules which are sequentially connected and based on the structure reference images, and the i self-adaptive structure adjustment modules sequentially perform structure adjustment on the characteristics of different dimensions from low dimension to high dimension.

Preferably, the decoder in the step S2 is specifically a visual transducer decoder, and the visual transducer decoder includes i feature decoding modules sequentially connected, where the i feature decoding modules sequentially decode features in different dimensions from high dimension to low dimension.

Preferably, the step S3 specifically includes the following steps:

step S31, randomly extracting a group of samples in a training set, and respectively inputting the extracted samples into a self-adaptive color adjustment network and a self-adaptive structure adjustment network;

s32, in the self-adaptive color adjustment network, i self-adaptive color adjustment modules are adopted, wherein i is more than or equal to 1 and less than or equal to 4, and the image characteristics of the ancient architecture are extracted

And (3) the image features of style>

Feature dimension is->

B is the batch of extracted features, < > and->

For extracting the channel number of the characteristic, W and H are the width and the height of the input image;

embedding codes at appointed positions of a visual coding-decoding style migration neural network to characterize ancient architecture images

And style image feature->

Input into the ith self-adaptive color adjustment module, and output the inquiry feature of the ancient architecture image by using embedded codes>

Queried feature of ancient architecture image->

And content features of the ancient building image->

Query feature of style image +.>

Queried features of style images->

And content features of the stylistic image features +.>

Fusion is carried out by utilizing a feature migration fusion AdaIN function, and the fusion process is defined as follows:

wherein μ (·) and σ (·) are the mean and variance functions, respectively;

and->

The query characteristics of the fused ancient building image and the queried characteristics of the fused ancient building image are respectively; />

And->

Query features of the fused style image and queried features of the fused style image are respectively;

next, the characteristic migration fusion MixAdaIN function is utilized to perform characteristic analysis on the content of the ancient building image

And content features of the stylistic image features +.>

Feature fusion is performed, and the fusion is defined as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,

and->

The content characteristics of the fused ancient building image and the content characteristics of the fused style image are respectively; gamma ray _C1 、γ _S1 And beta _C1 、β _S1 Parameters of the affine transformation of the self-adaptive color adjustment network are defined as follows:

wherein μ (·) and σ (·) are the mean and variance functions, respectively; lambda (lambda) ₁ Is constant and has a value range of [0,1 ]]；

Is a color reference image; />

Representing cross-attention feature extraction;

creating a first self-attention module, and outputting a reconstructed feature by the fused feature through the first self-attention module, wherein the definition is as follows:

wherein the function softmax (·) is the feature normalization function, d _k For the feature dimension, T represents the feature matrix transpose,

and->

The reconstruction characteristics of the ancient building image output by the ith self-adaptive color adjustment and the reconstruction characteristics of the style image output by the ith self-adaptive color adjustment are respectively [ B, C, W/2 ] ⁱ ,H/2 ⁱ ]；

Step S33, in the self-adaptive structure adjustment network, the characteristics of the ancient building image are extracted after i self-adaptive structure adjustment modules are adopted

And (3) the image features of style>

Feature dimension is->

Image features of ancient architecture

And (3) the image features of style>

Respectively inputting the two images into an ith self-adaptive structure adjustment module, and respectively outputting query characteristics of the ancient building images by utilizing embedded codes>

Query feature of style image->

Queried feature of ancient architecture image->

And style imageQueried feature->

Content features of ancient building images

And content characteristics of style image->

Fusion is carried out through feature migration fusion AdaIN functions, and the fusion process is defined as follows:

and->

The method comprises the steps of respectively inquiring characteristics of the fused ancient building image, inquiring characteristics of the fused style image, inquired characteristics of the fused ancient building image and inquired characteristics of the fused style image; sigma (·) and μ (·) are variance and mean functions of the feature, respectively;

next, the self-adaptive structural feature migration fusion MixAdaIN function is utilized to fuse the content features of the ancient building image

And content characteristics of style image->

Feature fusion is performed as defined below:

and->

The content characteristics of the fused ancient building image and the content characteristics of the fused style image are respectively; gamma ray _C2 、γ _S2 And beta _C2 、β _S2 Parameters of affine transformation are integrated for feature cross attention migration, and are defined as follows:

here the number of the elements is the number,

is a structural reference image; mu (·) and sigma (·) are mean and variance functions; lambda (lambda) ₂ Is constant and has a value range of [0,1 ]]；

Representing cross-attention feature extraction;

creating a second self-attention module, and outputting a reconstructed feature by the fused feature through the second self-attention module, wherein the definition is as follows:

and->

The reconstruction fusion characteristics of the ancient building image reconstructed and output by the ith self-adaptive structure adjusting module and the reconstruction fusion characteristics of the style image reconstructed and output by the ith self-adaptive structure adjusting module are respectively, and the dimensions of the reconstruction characteristics are respectively

Step S34, utilizing a pair of visual transducer encoders

And->

Fusion was performed and the pair +.>

And->

Fusion is carried out, and the fusion process is defined as follows:

wherein Conv (·) is the convolutional layer; function [. Cndot. ]]Representing feature stitching operations;

and->

The output historic building fusion characteristics and style fusion characteristics are respectively provided for the visual transducer encoder.

Preferably, the step S4 specifically includes the following steps:

s41, extracting the decoding characteristics of the ancient building image through the i-1 th characteristic decoding module

Decoding features of picture with style->

And input to the ith feature decoding module to output query features of the ancient building image by embedded coding>

Query feature of style image->

Queried feature of ancient architecture image->

Queried features of style images->

Content characteristics of ancient architecture image->

Content features of style images/>

Creating a third self-attention module by which the reconstructed feature is output, defined as follows:

and->

The reconstruction features are the reconstruction features of the decoding features of the ancient building image and the reconstruction features of the decoding features of the style image, and the dimension of the reconstruction features is +.>

Step S42: for a pair of

And->

And performing feature migration fusion, wherein the definition of the feature migration fusion process is as follows:

ancient architecture image output by ith characteristic decoding moduleIs a migration fusion feature of->

Representing migration fusion characteristics of the style image output by the ith characteristic decoding module;

step S43: using visual transducer decoder pairs

And processing and outputting the picture with the predicted style transferred, wherein the grid transfer estimation output module is defined as follows:

wherein Conv (·) is the convolutional layer;

for predicting the picture after style migration, the dimension is consistent with the input image.

Preferably, the step S5 specifically includes the following steps:

step S51, respectively calculating content characteristic loss functions L according to the generated pictures after the prediction style migration _C Style characteristic loss function L _S Semantic feature loss function L _I And style reconstruction loss function L _G ；

Step S52, respectively determining content characteristic loss functions L _C Style characteristic loss function L _S Semantic feature loss function L _I And style reconstruction loss function L _G Is a supervisory training weight lambda ₁ 、λ ₂ 、λ ₃ And lambda (lambda) ₄ ；

Step S53, establishing a total loss function L _ALL Repeating steps S3-S5, and iterating training to minimize the total loss function until training 50epoch or the loss value is less than 10 ^-3 。

Preferably, the content feature loss function L in the step S51 _C Style characteristic loss function L _S Semantic feature loss function L _I And style reconstruction loss function L _G The method comprises the following steps:

wherein the content characteristic loss function L _C The definition is as follows:

and->

The building content image features and style features extracted by the first layer encoder feature extraction module are obtained; l is more than or equal to 1 and less than or equal to 4; />

Reconstructing an image for style migration; the function II is L2 norm; function->

Normalizing the mean variance channel; n (N) _e Representing the number of layers of the feature encoder module, set to 4 in this example;

style characteristic loss function L _S The definition is as follows:

and->

The image features and style features of the building content extracted for the first layer encoder are represented by the mean and variance functions, N _d Representing the number of layers of the feature decoder module, set to 4 in this example;

semantic feature loss function L _I The definition is as follows:

style reconstruction loss function L _G The definition is as follows:

preferably, lambda in the step S52 ₁ 、λ ₂ 、λ ₃ And lambda (lambda) ₄ The following formula is satisfied:

λ ₁ + ₂ + ₃ + ₄ =1, and λ ₁ ＝0.1，λ ₂ ＝0.2，λ ₃ ＝0.2，λ ₄ ＝0.5。

Preferably, the total loss function in the step S53 is:

L _ALL ＝ ₁ L _C + ₂ L _S + ₃ L _I + ₄ L _G 。

the beneficial effects are that:

compared with the prior art, the invention has the advantages that the following aspects are mainly realized:

(1) The invention provides a visual coding-decoding style migration neural network, which can fully learn the relative position texture and structural feature information of a building image, greatly improve the feature representation capability of colors and structures during style migration, and solve the problems of irregular local structure form, non-uniform texture and color and the like during style migration of the building image.

(2) The visual coding-decoding style migration neural network adopts multi-stage extraction of shallow layer features and deep layer features of an image, gradual global context fusion of building images and style image features, and gradual coding reconstruction of the fused features.

(3) The invention designs a self-adaptive color adjustment module which is integrated into a visual coding-decoding style migration neural network and can fuse style characteristics and color reference characteristics in a self-adaptive global supervision migration mode.

(4) The invention designs a self-adaptive structure adjusting module which is integrated into a visual coding-decoding style migration neural network and can be used for fusing building structure characteristics and structure reference characteristics in a self-adaptive global supervision mode.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a portion of a sample of an ancient architecture style migration simulation learning dataset according to an embodiment of the present invention, with FIG. (a) being an architecture image and FIG. (b) being a style image;

FIG. 3 is a block diagram of a visual coding-decoding style migrating neural network employed in the present invention;

fig. 4 is a block diagram of a visual coding-decoding style migration neural network, wherein fig. (a) to (c) correspond to an adaptive color adjustment network, an adaptive structure adjustment network, and a visual transducer decoder, respectively.

Detailed Description

The invention will be further described with reference to the drawings and examples.

Referring to fig. 1, a flow chart of the method of the present invention is shown, and the method for adaptive style migration of ancient architecture images based on a visual encoder comprises the following steps:

s5, performing iterative training on the visual coding-decoding style migration neural network according to the predicted picture transferred by the style and the corresponding style transfer learning real image in the data set to obtain a trained visual coding-decoding style migration neural network;

In this embodiment, the step S1 specifically includes the following steps:

specifically, the ancient building images are taken from different angles in the plurality of ancient building groups, and it is to be noted that the example uses the ancient buildings such as the tai-level street and the copper official kiln as the shooting scenes, but the ancient building groups used in the data set constructed in the example are not limited to this. Fig. 2 is a shooting scene diagram. And acquiring the building style image to be learned, the color reference image and the structure reference image, and inviting professionals to draw style transfer learning real images based on the images.

In this embodiment, the encoder in step S2 is specifically a visual transducer encoder, where the visual transducer encoder includes two coding feature extraction sub-networks with shared weights, which are an adaptive color adjustment network and an adaptive structure adjustment network respectively;

Specifically, the adaptive color adjustment network and the adaptive structure adjustment network are as shown in fig. 4 (a) and (b). The adaptive color adjustment module feature migration fusion is as follows: the output characteristic of the i-1 self-adaptive color adjustment module is characterized by the ancient architecture image

And (3) the image features of style>

Respectively inputting the two images into an ith self-adaptive color adjustment module, and respectively outputting query features of the ancient building images by utilizing a feature local position embedded coding layer>

Query feature of style image->

Queried feature of ancient architecture image->

And queried features of style images +.>

Content characteristics of the ancient architecture image +.>

And content characteristics of style image->

Performing attention migration fusion on the feature migration fusion AdaIN function and the MixAdaIN function to obtain the feature +.>

And (3) the image features of style>

Likewise, the adaptive structure adjustment module feature migration fusion is as follows: the output characteristic of the i-1 self-adaptive structure adjusting module is the ancient architecture image characteristic +.>

And style image features

Respectively inputting the two images into an ith self-adaptive structure adjusting module, and respectively outputting query features of the ancient building images by utilizing the feature local position embedded coding layer>

Query feature of style image->

Queried feature of ancient architecture image->

And queried features of style images +.>

Content characteristics of the ancient architecture image +.>

And content characteristics of style image->

Performing attention migration fusion on the AdaIN function and the MixAdaIN function through feature migration fusion to obtain the ancient timesBuilding image feature->

And (3) the image features of style>

In this embodiment, the decoder in step S2 is specifically a visual transducer decoder, where the visual transducer decoder includes i feature decoding modules sequentially connected to each other, and the i feature decoding modules sequentially decode features in different dimensions from high dimension to low dimension.

In this embodiment, the step S3 specifically includes the following steps:

selecting an ancient building image needing style migration in a training set, a style image serving as a style source, a color reference image and a structure reference image serving as inputs of a self-adaptive color adjustment network and a self-adaptive structure adjustment network, wherein the sizes of all input images are W.H, W is the width of the image, and H is the height of the image;

And (3) the image features of style>

Feature dimension is->

B is the batch of extracted features, < > and->

in visual encoding-decodingEmbedding codes in appointed positions of style migration neural network to characterize ancient architecture images

And style image feature->

Queried feature of ancient architecture image->

And content features of the ancient building image->

Query feature of style image +.>

Queried features of style images->

And content features of the stylistic image features +.>

wherein μ (·) and σ (·) are the mean and variance functions, respectively;

and->

And->

And content features of the stylistic image features +.>

Feature fusion is performed, and the fusion is defined as follows:

and->

Is a color reference image; />

Representing cross-attention feature extraction;

/>

and->

And (3) the image features of style>

Feature dimension is->

Image features of ancient architecture

And (3) the image features of style>

Query feature of style image->

Queried feature of ancient architecture image->

And queried features of style images +.>

Content features of ancient building images

And content characteristics of style image->

and->

And content characteristics of style image->

Feature fusion is performed as defined below:

/>

and->

here the number of the elements is the number,

Representing cross-attention feature extraction;

and->

Step S34, utilizing a pair of visual transducer encoders

And->

Fusion was performed and the pair +.>

And->

Fusion is carried out, and the fusion process is defined as follows:

and->

In this embodiment, the step S4 specifically includes the following steps:

Decoding features of picture with style->

Query feature of style image->

Queried feature of ancient architecture image->

Queried features of style images->

Content characteristics of ancient architecture image->

Content characteristics of style image->

and->

Step S42: for a pair of

And->

representing migration fusion characteristics of the ancient architecture image output by the ith characteristic decoding module, and (I)>

step S43: using visionPerceptual transform decoder pair

wherein Conv (·) is the convolutional layer;

for predicting the picture after style migration, the dimension is consistent with the input image. See fig. 4 (c) for a visual transducer decoder;

preferably, the step S5 specifically includes the following steps:

and->

style characteristic loss function L _S The definition is as follows:

and->

semantic feature loss function L _I The definition is as follows:

style reconstruction loss function L _G The definition is as follows:

λ ₁ +λ ₂ +λ ₃ +λ ₄ =1, and λ ₁ ＝0.1，λ ₂ ＝0.2，λ ₃ ＝0.2，λ ₄ ＝0.5。

Preferably, the total loss function in the step S53 is:

L _ALL ＝λ ₁ L _C +λ ₂ L _S +λ ₃ L _I +λ ₄ L _G 。

specifically, the network training employs a back propagation algorithm, and the training optimizer employs an Adam algorithm, wherein the optimizer parameter β ₁ ＝0.9，β ₂ =0.999. The initial learning rate of the network training is set to 0.001, and the learning rate is reduced to 1/2 of the original learning rate when the training is performed to 20, 30 and 40 epochs. The network model trains the software platform to be PyTorch.

Claims

1. The utility model provides a ancient building image self-adaptation style migration method based on visual encoder which is characterized in that the method comprises the following steps:

s4, inputting the fusion characteristics and the style images in the step S3 into a decoder for reconstruction in the process of extracting the characteristics of the ancient architecture images and the style images in different layers by the encoder, and obtaining a predicted style transferred picture;

2. The method for adaptive style migration of ancient architecture image according to claim 1, wherein the step S1 specifically comprises the steps of:

3. The method for adaptive style migration of ancient architecture image according to claim 2, wherein the encoder in step S2 is specifically a visual transducer encoder, and the visual transducer encoder includes two coding feature extraction sub-networks sharing weights, which are an adaptive color adjustment network and an adaptive structure adjustment network, respectively;

4. The method for adaptive style migration of ancient architecture image according to claim 3, wherein the visual transducer decoder in step S2 comprises i feature decoding modules connected in sequence, and the i feature decoding modules decode features of different dimensions from high dimension to low dimension in sequence.

5. The method for adaptive style migration of ancient architecture image according to claim 4, wherein the step S3 specifically comprises the steps of:

And (3) the image features of style>

Feature dimension is->

B is the batch of extracted features, < > and->

specifically, ancient architecture image features

And style image feature->

Inputting the query feature of the ancient building image output by the i self-adaptive color adjustment module and utilizing the feature local position embedded coding layer>

Queried feature of ancient architecture image->

And content features of the ancient building image->

Query feature of style image +.>

Queried features of style images->

And content features of the stylistic image features +.>

wherein μ (·) and σ (·) are the mean and variance functions, respectively;

and->

And->

And content features of the stylistic image features +.>

Feature fusion is carried outThe definition is as follows:

and->

Is a color reference image; />

Representing cross-attention feature extraction;

and

And (3) the image features of style>

Feature dimension is->

Image features of ancient architecture

And (3) the image features of style>

Query feature of style image->

Queried feature of ancient architecture image->

And queried features of style images +.>

Content characteristics of the ancient architecture image +.>

And content characteristics of style image->

and->

And content characteristics of style image->

Feature fusion is performed as defined below:

and->

here the number of the elements is the number,

is a structural reference image; mu (·) and sigma (·) are mean and variance functions; lambda (lambda) ₂ Is constant and has a value range of [0,1 ]]；/>

Representing cross-attention feature extraction;

and->

Step S34, utilizing a pair of visual transducer encoders

And->

Fusion was performed and the pair +.>

And->

Fusion is carried out, and the fusion process is defined as follows:

and->

6. The method for adaptive style migration of ancient architecture image according to claim 5, wherein the step S4 specifically comprises the steps of:

Decoding features of picture with style->

Query feature of style image->

Queried feature of ancient architecture image->

Queried features for style images

Content characteristics of ancient architecture image->

Content characteristics of style image->

and->

Step S42: for a pair of

And->

step S43: using visual transducer decoder pairs

And processing and outputting the predicted picture after style migration, wherein a style migration estimation output module is defined as follows:

wherein Conv (·) is the convolutional layer;

7. The method for adaptive style migration of ancient architecture image according to claim 6, wherein the step S5 specifically comprises the steps of:

step S51, respectively calculating content characteristic loss functions L according to the generated pictures after the prediction style migration _C Style characteristic loss function L _s Semantic featuresLoss function L _I And style reconstruction loss function L _G ；

8. The method for adaptive style migration of building images according to claim 7, wherein the content feature loss function L in step S51 _C Style characteristic loss function L _S Semantic feature loss function L _I And style reconstruction loss function L _G The method comprises the following steps:

and->

style characteristic loss function L _S The definition is as follows:

and->

semantic feature loss function L _I The definition is as follows:

style reconstruction loss function L _G The definition is as follows:

9. the method for adaptive style migration of ancient architecture image according to claim 8, wherein λ in step S52 is ₁ 、λ ₂ 、λ ₃ And lambda (lambda) ₄ The following formula is satisfied:

10. The method for adaptive style migration of ancient architecture images according to claim 8 or 9, wherein the total loss function in step S53 is: l (L) _ALL ＝ ₁ L _C + ₂ L _S + ₃ L _I + ₄ L _G 。