CN114119351A

CN114119351A - Image processing method, image processing device, electronic equipment and storage medium

Info

Publication number: CN114119351A
Application number: CN202111316220.1A
Authority: CN
Inventors: 舒叶芷; 刘永进; 李强; 张国鑫
Original assignee: Tsinghua University; Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Tsinghua University; Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-03-01

Abstract

The disclosure provides an image processing method, an image processing device, electronic equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: performing feature extraction on the first image and the second image to obtain a plurality of first image features and a plurality of second image features; determining the similarity between the first image feature and the second image feature corresponding to the same channel position to obtain a plurality of feature similarities; determining a first target image and a second target image based on the plurality of first image features, the plurality of second image features, and the plurality of feature similarities. By the scheme, the influence on style conversion caused by unbalanced information abundance between the images can be avoided, so that the result obtained by conversion is reasonable, and the quality of the converted image is improved.

Description

Image processing method, image processing device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

The conversion of graphics to images can be applied to many computer vision tasks such as image segmentation, image modification, image style transformation, etc. For example, based on the countermeasure generation network in the deep learning, it is possible to convert the style of an image of a target domain (composed of one set of images having a similar style) into the style of a source domain (composed of another set of images having a similar style) while keeping the image content of the source domain unchanged, such as converting a real face in a photographic image into a cartoon face, and keeping the content in the photographic image unchanged. Therefore, how to improve the quality of the image obtained by style conversion is an important research direction.

Currently, in the UGATIT (unknown generated Adaptive Networks with Adaptive Layer-to-Image transformation, Image-to-Image transformation under Adaptive Layer Instance Normalization) method, a network is focused on a region with a larger difference between a source domain and a target domain by means of class activation mapping (classmativationmap), and then a large-shape large-span style transformation task is completed by using an Instance Normalization module in an Adaptive Layer.

In the above technical solution, the source domain and the target domain usually have different information richness degrees, which may cause unreasonable conversion result and poor image quality.

Disclosure of Invention

The present disclosure provides an image processing method, an image processing apparatus, an electronic device, and a storage medium, which can avoid an influence on style conversion due to an imbalance of information abundance between images, so that a result obtained by conversion is more reasonable, and the quality of an image obtained by conversion is improved. The technical scheme of the disclosure is as follows:

according to a first aspect of embodiments of the present disclosure, there is provided an image processing method, the method including:

performing feature extraction on a first image and a second image to obtain a plurality of first image features and a plurality of second image features, wherein the plurality of first image features represent the features of the first image, the plurality of second image features represent the features of the second image, and the plurality of first image features and the plurality of second image features are in one-to-one correspondence;

determining the similarity between the first image feature and the second image feature corresponding to the same channel position to obtain a plurality of feature similarities;

determining a first target image and a second target image based on the plurality of first image features, the plurality of second image features and the plurality of feature similarities, the first target image having the same style as the second image, the second target image having the same style as the first image.

In some embodiments, the determining the similarity between the first image feature and the second image feature corresponding to the same channel position to obtain a plurality of feature similarities includes:

determining the dot product of a first image feature and a second image feature corresponding to the same channel position as the feature similarity of the first image feature and the second image feature; alternatively, the first and second electrodes may be,

and determining the cross entropy of the first image feature and the second image feature corresponding to the same channel position as the feature similarity of the first image feature and the second image feature.

In some embodiments, the extracting features of the first image and the second image to obtain a plurality of first image features and a plurality of second image features includes:

coding the first image to obtain a first coded image;

performing feature extraction on the first coded image to obtain a plurality of first image features;

coding the second image to obtain a second coded image;

and performing feature extraction on the second coded image to obtain a plurality of second image features.

In some embodiments, said encoding said first image to obtain a first encoded image comprises:

and performing downsampling on the first image, and inputting the sampled image into a first residual error network to obtain the first coded image.

In some embodiments, said extracting features of said first encoded image to obtain said plurality of first image features comprises:

based on an embedding function, carrying out feature extraction on the first coded image to obtain a plurality of first embedding features;

and respectively carrying out average pooling on the plurality of first embedded features to obtain the plurality of first image features.

In some embodiments, the method further comprises:

rearranging a channel relationship between the plurality of first image features based on the projection function.

In some embodiments, said encoding said second image to obtain a second encoded image comprises:

and performing downsampling on the second image, and inputting the sampled image into a second residual error network to obtain the second coded image.

In some embodiments, said extracting features of said second encoded image to obtain said plurality of second image features comprises:

based on an embedding function, carrying out feature extraction on the second coded image to obtain a plurality of second embedding features;

and respectively carrying out average pooling on the plurality of second embedded features to obtain the plurality of second image features.

In some embodiments, the method further comprises:

rearranging a channel relationship between the plurality of second image features based on the projection function.

In some embodiments, the determining a first target image and a second target image based on the plurality of first image features, the plurality of second image features, and the plurality of feature similarities comprises:

determining a plurality of first target features based on the plurality of first image features and the plurality of feature similarities;

decoding the plurality of first target features to obtain a first target image;

determining a plurality of second target features based on the plurality of second image features and the plurality of feature similarities;

and decoding the plurality of second target characteristics to obtain the second target image.

In some embodiments, the determining a plurality of first target features based on the plurality of first image features and the plurality of feature similarities comprises:

for any first image feature, determining the product of the any first image feature and the corresponding feature similarity as a first target feature corresponding to the any first image feature.

In some embodiments, the decoding the plurality of first target features to obtain the first target image includes:

and inputting the plurality of first target features into a third residual error network, and performing up-sampling on a result output by the third residual error network to obtain the first target image.

In some embodiments, the determining a plurality of second target features based on the plurality of second image features and the plurality of feature similarities comprises:

and for any second image feature, determining the product of the any second image feature and the corresponding feature similarity as a second target feature corresponding to the any second image feature.

In some embodiments, the decoding the plurality of second target features to obtain the second target image includes:

and inputting the second target characteristics into a fourth residual error network, and performing up-sampling on a result output by the fourth residual error network to obtain the second target image.

According to a second aspect of embodiments of the present disclosure, there is provided a training method of an image processing model, the image processing model including a generator and a discriminator, the method including:

inputting a first sample image and a second sample image into an encoder and an alignment forgetting layer in the generator to obtain a plurality of sample image features, wherein the alignment forgetting layer is used for extracting the image features;

inputting the characteristics of the plurality of sample images into a decoder in the generator to obtain a first sample target image and a second sample target image, wherein the first sample target image and the second sample image have the same style, and the second sample target image and the first sample image have the same style;

inputting the first sample image, the second sample image, the first sample target image and the second sample target image into the discriminator to obtain a training loss;

and carrying out model training according to the training loss.

In some embodiments, the inputting the first sample image, the second sample image, the first sample target image, and the second sample target image into the discriminator to obtain a training loss includes:

inputting the first sample image, the second sample image, the first sample target image, and the second sample target image into the discriminator to obtain a first loss, a second loss, and a third loss, the first loss including a first countermeasure loss and a second countermeasure loss, the first countermeasure loss representing a loss for converting the first sample image into the second sample target image, the second countermeasure loss representing a loss for converting the second sample image into the first sample target image, the second loss including a first image loss and a second image loss, the first image loss representing a difference between the first sample image and the first sample target image, the second image loss representing a difference between the second sample image and the second sample target image, the third loss including a first coincidence loss and a second coincidence loss, the first loss of conformity represents the difference between the first sample image and the second sample target image, and the second loss of conformity represents the difference between the second sample image and the first sample target image;

and carrying out weighted summation on the first loss, the second loss and the third loss to obtain the training loss.

According to a third aspect of the embodiments of the present disclosure, there is provided an image processing apparatus including:

the image processing device comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is configured to perform feature extraction on a first image and a second image to obtain a plurality of first image features and a plurality of second image features, the plurality of first image features represent features of the first image, the plurality of second image features represent features of the second image, and the plurality of first image features and the plurality of second image features are in one-to-one correspondence;

the similarity determining module is configured to determine similarity between the first image feature and the second image feature corresponding to the same channel position to obtain a plurality of feature similarities;

an image determination module configured to determine a first target image and a second target image based on the plurality of first image features, the plurality of second image features, and the plurality of feature similarities, the first target image having a same style as the second image, the second target image having a same style as the first image.

In some embodiments, the similarity determination module is configured to determine a dot product of a first image feature and a second image feature corresponding to a same channel position as a feature similarity of the first image feature and the second image feature; or determining the cross entropy of the first image feature and the second image feature corresponding to the same channel position as the feature similarity of the first image feature and the second image feature.

In some embodiments, the feature extraction module comprises:

a first encoding unit configured to encode the first image to obtain a first encoded image;

a first feature extraction unit configured to perform feature extraction on the first encoded image to obtain the plurality of first image features;

a second encoding unit configured to encode the second image to obtain a second encoded image;

a second feature extraction unit configured to perform feature extraction on the second encoded image to obtain the plurality of second image features.

In some embodiments, the first encoding unit is configured to down-sample the first image, and input the sampled image into a first residual network to obtain the first encoded image.

In some embodiments, the first feature extraction unit is configured to perform feature extraction on the first encoded image based on an embedding function, resulting in a plurality of first embedded features; and respectively carrying out average pooling on the plurality of first embedded features to obtain the plurality of first image features.

In some embodiments, the apparatus further comprises:

a first rearranging module configured to rearrange a channel relationship between the plurality of first image features based on a projection function.

In some embodiments, the second encoding unit is configured to down-sample the second image, and input the sampled image into a second residual network to obtain the second encoded image.

In some embodiments, the second feature extraction unit is configured to perform feature extraction on the second encoded image based on an embedding function, resulting in a plurality of second embedded features; and respectively carrying out average pooling on the plurality of second embedded features to obtain the plurality of second image features.

In some embodiments, the apparatus further comprises:

a second rearranging module configured to rearrange a channel relationship between the plurality of second image features based on the projection function.

In some embodiments, the image determination module comprises:

a first determination unit configured to determine a plurality of first target features based on the plurality of first image features and the plurality of feature similarities;

a first decoding unit configured to decode the plurality of first target features to obtain the first target image;

a second determination unit configured to determine a plurality of second target features based on the plurality of second image features and the plurality of feature similarities;

a second decoding unit configured to decode the plurality of second target features to obtain the second target image.

In some embodiments, the first determining unit is configured to determine, for any first image feature, a product of the any first image feature and a corresponding feature similarity as a first target feature corresponding to the any first image feature.

In some embodiments, the first decoding unit is configured to input the plurality of first target features into a third residual network, and perform upsampling on a result output by the third residual network to obtain the first target image.

In some embodiments, the second determining unit is configured to determine, for any second image feature, a product of the any second image feature and the corresponding feature similarity as a second target feature corresponding to the any second image feature.

In some embodiments, the second decoding unit is configured to input the second target feature into a fourth residual network, and perform upsampling on a result output by the fourth residual network to obtain the second target image.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an apparatus for training a graphics processing model, the graphics processing model including a generator and a discriminator, the apparatus including:

the first training module is configured to perform feature extraction on a first sample image and a second sample image based on an encoder and an alignment forgetting layer in a generator, wherein the plurality of sample images are characterized, and the alignment forgetting layer is used for extracting image features;

a second training module configured to decode the sample image features and the similarities based on a decoder in the generator to obtain a first sample target image and a second sample target image, where the first sample target image and the second sample target image have the same style, and the second sample target image and the first sample target image have the same style;

a third training module configured to determine a training loss based on the discriminator, the first sample image, the second sample image, the first sample target image, and the second sample target image;

the third training module is further configured to perform model training based on the training loss.

In some embodiments, the fourth training module is configured to determine a first loss based on the discriminator, the first sample image, the second sample image, the first sample target image, and the second sample target image, the first loss including a first countermeasure loss and a second countermeasure loss, the first countermeasure loss representing a loss of converting the first sample image into the second sample target image, the second countermeasure loss representing a loss of converting the second sample image into the first sample target image; determining a second loss based on the first sample image, the second sample image, the first sample target image, and the second sample target image, the second loss including a first image loss and a second image loss, the first image loss representing a difference between the first sample image and the first sample target image, the second image loss representing a difference between the second sample target image and the second sample image; determining a third loss based on the first sample image, the second sample image, the first sample target image, and the second sample target image, the third loss including a first loss of consistency representing a difference between the first sample image and the second sample target image and a second loss of consistency representing a difference between the second sample image and the first sample target image.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including:

one or more processors;

a memory for storing the processor executable program code;

wherein the processor is configured to execute the program code to implement the image processing method or to implement the training method of the image processing model.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, in which program code is enabled, when executed by a processor of an electronic device, to execute the above-described image processing method or to execute a training method of the above-described image processing model.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the above-mentioned image processing method, or which, when executed by the processor, implements the above-mentioned training method of an image processing model.

The technical scheme provided by the embodiment of the disclosure at least has the following beneficial effects:

the embodiment of the disclosure provides an image processing method, which can reduce redundant and inconsistent feature influences in two images and retain common features in the two images by respectively determining a plurality of feature similarities between a plurality of first image features of a first image and a plurality of second image features of a second image based on the plurality of feature similarities, thereby avoiding the influence on style conversion caused by unbalanced information abundance between the images, ensuring that the conversion result is more reasonable, and improving the quality of the converted image.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a schematic diagram illustrating an implementation environment of an image processing method according to an exemplary embodiment.

FIG. 2 is a flow diagram illustrating an image processing method according to an exemplary embodiment.

FIG. 3 is a flow diagram illustrating another method of image processing according to an exemplary embodiment.

FIG. 4 is a flow diagram illustrating another method of training an image processing model in accordance with an exemplary embodiment.

Fig. 5 is a schematic diagram illustrating a structure of a generator according to an example embodiment.

Fig. 6 is a schematic structural diagram illustrating an alignment forgetting layer according to an exemplary embodiment.

FIG. 7 is a diagram illustrating a conversion result according to an exemplary embodiment.

Fig. 8 is a diagram illustrating style conversion results for a map and a microsatellite map according to an exemplary embodiment.

Fig. 9 is a diagram illustrating a street view and a style conversion result of a split street view according to an example embodiment.

FIG. 10 is a diagram illustrating style conversion results for a building and segmented buildings according to an exemplary embodiment.

FIG. 11 is a graph comparing results of one variation experiment shown in accordance with an exemplary embodiment.

Fig. 12 is a block diagram illustrating an image processing apparatus according to an exemplary embodiment.

Fig. 13 is a block diagram illustrating another image processing apparatus according to an exemplary embodiment.

FIG. 14 is a block diagram illustrating an apparatus for training an image processing model according to an exemplary embodiment.

FIG. 15 is a block diagram illustrating a server in accordance with an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims. The plurality in the embodiments of the present disclosure means two or more.

As used in this disclosure, the terms "at least one," "a plurality," "each," "any," at least one includes one, two, or more than two, and a plurality includes two or more than two, each referring to each of the corresponding plurality, and any referring to any one of the plurality. For example, the plurality of first image features includes 3 first image features, each of the 3 first image features is referred to, and any one of the 3 first image features is referred to as any one of the 3 first image features, which can be a first one, or a second one, or a third one.

It should be noted that the user information (including but not limited to user device information, user personal information, etc.) referred to in the present disclosure is information authorized by the user or sufficiently authorized by each party.

The following are explanations of terms relating to the embodiments of the present disclosure.

A return (Rectified Linear Unit), also called a modified Linear Unit, is an Activation Function (Activation Function) commonly used in artificial neural networks, and generally refers to a nonlinear Function represented by a ramp Function and its variants.

The Leaky ReLU (linear rectification function with leakage) is a variation of the linear rectification function on the basis of a ramp function, and the gradient of the Leaky ReLU is a constant instead of 0. When the input value is positive, the linear rectification function with leakage and the ordinary slope function are consistent.

The image processing method provided by the embodiment of the present disclosure is executed by an electronic device, and in some embodiments, the electronic device is a terminal, for example, the terminal is a mobile phone, a tablet computer, a computer, or a plurality of types of terminals. In some embodiments, the electronic device is a server, for example, the server is a server, or a server cluster composed of several servers, or a cloud computing service center.

Fig. 1 is a schematic diagram illustrating an implementation environment of an image processing method according to an exemplary embodiment. Taking the electronic device as an example provided as a server, referring to fig. 1, the implementation environment specifically includes: a terminal 101 and a server 102. The terminal 101 and the server 102 are capable of interacting via a network connection.

The terminal 101 may be at least one of a smartphone, a desktop computer, a laptop portable computer, and the like. An application may be installed and run on the terminal 101, and a user may log in the application through the terminal 101 to obtain a service provided by the application. The terminal 101 may be connected to the server 102 through a wireless network or a wired network, and may further transmit the first image and the second image to be processed to the server 102.

In some embodiments, the terminal 101 may refer to one of a plurality of terminals, and the embodiment is only illustrated by the terminal 101. Those skilled in the art will appreciate that the number of terminals described above may be greater or fewer. For example, the number of the terminals may be only a few, or the number of the terminals may be several tens or hundreds, or more, and the number of the terminals and the type of the device are not limited in the embodiments of the present disclosure.

The server 102 may be at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. The server 102 may be connected to the terminal 101 and other terminals through a wireless network or a wired network, and the server 102 may receive the first image and the second image transmitted by the terminal 101. In some embodiments, the number of the servers may be more or less, and the embodiments of the present disclosure do not limit this. Of course, the server 102 may also include other functional servers to provide more comprehensive and diverse services.

Fig. 2 is a flowchart illustrating an image processing method according to an exemplary embodiment, which is performed by a server apparatus, as illustrated in fig. 2, and includes the steps of:

in step S201, feature extraction is performed on a first image and a second image to obtain a plurality of first image features and a plurality of second image features, where the plurality of first image features represent features of the first image, the plurality of second image features represent features of the second image, and the plurality of first image features and the plurality of second image features are in one-to-one correspondence.

The first image is an image to be converted into a target style, and the second image is an image with the target style; or the second image is an image to be converted into a target style, and the first image is an image with the target style.

In step S202, the similarity between the first image feature and the second image feature corresponding to the same channel position is determined, and a plurality of feature similarities are obtained.

The method comprises the steps of determining the feature similarity between a first image feature and a second image feature corresponding to the same channel position, wherein the feature similarity can be used for judging whether the corresponding image features are common features.

In step S203, a first target image and a second target image are determined based on a plurality of first image features, a plurality of second image features and a plurality of feature similarities, the first target image and the second image having the same style, and the second target image and the first image having the same style.

The first target image is obtained by converting the first image, that is, the first image is converted to have the same style as the second image, so that the first target image is obtained. The second target image is obtained by converting the second image, that is, converting the second image to have the same style as the first image, and then obtaining the second target image.

According to the scheme provided by the embodiment of the disclosure, the multiple feature similarities between the multiple first image features of the first image and the multiple second image features of the second image are respectively determined, so that the influence of redundant and inconsistent features in the two images can be reduced based on the multiple feature similarities, and common features in the two images are kept, thereby avoiding the influence on style conversion due to the unbalanced information abundance between the images, ensuring that the conversion result is reasonable, and improving the quality of the converted image.

determining the dot product of the first image feature and the second image feature corresponding to the same channel position as the feature similarity of the first image feature and the second image feature; alternatively, the first and second electrodes may be,

In the embodiment of the disclosure, by calculating the dot product or cross entropy of the first image feature and the second image feature, the feature similarity between two image features from different images but corresponding to the same channel position can be determined, and further, whether the image features at the channel position are consistent or not can be determined according to the feature similarity.

In some embodiments, the extracting features of the first image and the second image to obtain a plurality of first image features and a plurality of second image features comprises:

coding the first image to obtain a first coded image;

coding the second image to obtain a second coded image;

In the embodiment of the present disclosure, by encoding the first image and the second image respectively, feature extraction can be performed on the encoded images to obtain image features corresponding to each channel position, that is, features of a channel layer are extracted.

In some embodiments, the encoding the first image to obtain a first encoded image comprises:

and downsampling the first image, and inputting the sampled image into a first residual error network to obtain the first coded image.

In the embodiment of the present disclosure, by performing downsampling on the first image and then performing residual processing, useful information can be extracted from the image to obtain a coded image, thereby facilitating subsequent feature extraction from the coded image.

In some embodiments, the extracting the features of the first encoded image to obtain the plurality of first image features comprises:

and averaging and pooling the plurality of first embedded features respectively to obtain the plurality of first image features.

In the embodiment of the disclosure, by performing feature extraction and average pooling on the encoded image based on the embedding function, global information of each channel position in the encoded image can be extracted.

In some embodiments, the method further comprises:

the channel relationships between the plurality of first image features are rearranged based on the projection function.

In the embodiment of the present disclosure, since the relationship between the image features may have non-alignment and non-linear characteristics, by providing the projection function, the channel relationship between the image features can be rearranged, so that the image features are aligned.

In some embodiments, the encoding the second images respectively to obtain second encoded images includes:

and downsampling the second image, and inputting the sampled image into a second residual error network to obtain the second coded image.

In the embodiment of the present disclosure, by performing downsampling on the second image and then performing residual processing, useful information can be extracted from the image to obtain a coded image, thereby facilitating subsequent feature extraction from the coded image.

In some embodiments, the performing feature extraction on the channels of the second encoded image to obtain the second image features includes:

based on the embedding function, carrying out feature extraction on the second coded image to obtain a plurality of second embedding features;

and averaging and pooling the plurality of second embedded features respectively to obtain a plurality of second image features.

In some embodiments, the method further comprises:

rearranging the channel relationship between the plurality of second image features based on the projection function.

In some embodiments, determining the first target image and the second target image based on the plurality of first image features, the plurality of second image features, and the plurality of feature similarities comprises:

decoding the plurality of first target features to obtain a first target image;

In the embodiment of the disclosure, the target feature is determined based on the image feature and the feature similarity, so that the target image after the style conversion can be obtained by decoding the target feature, and the quality of the converted target image can be improved due to the reduction of the influence of the image feature with low similarity.

and for any first image feature, determining the product of the any first image feature and the corresponding feature similarity as a first target feature corresponding to the any first image feature.

In the embodiment of the present disclosure, the product of the image feature and the corresponding feature similarity is used as the corresponding target feature, so that the target image with the corresponding style can be obtained by decoding based on the target feature.

and inputting the plurality of first target characteristics into a third residual error network, and performing up-sampling on a result output by the third residual error network to obtain the first target image.

In the embodiment of the present disclosure, the decoding of the target feature is implemented by performing residual error processing on the first target feature and then performing upsampling, so as to obtain the first target image.

and inputting the plurality of second target characteristics into a fourth residual network, and performing up-sampling on a result output by the fourth residual network to obtain the second target image.

In the embodiment of the present disclosure, the second target feature is subjected to residual error processing and then upsampled, so that the target feature is decoded, and thus the second target image is obtained.

Fig. 2 is a basic flow of an image processing method according to an embodiment of the present disclosure, which is further described below based on a specific implementation manner, and fig. 3 is a flow chart of another image processing method according to an exemplary embodiment. Taking the electronic device as a server, the image processing method is executed by the server, and referring to fig. 3, the method includes:

in step S301, a first image is encoded to obtain a first encoded image.

The first image is an image to be processed, that is, an image to be subjected to style conversion, and for convenience of description, a style that the first image has is represented as a first style. The server is capable of encoding the first image based on the first encoder, resulting in a first encoded image.

In some embodiments, the first encoder comprises a first downsampling block and a first residual network, the first residual network comprises at least one residual block, and the server is capable of downsampling the first image based on the first downsampling block and inputting the sampled image into the first residual network, and the sampled image is processed by the first residual network to obtain the first encoded image. Wherein the first downsampling block comprises at least one convolution layer for extracting useful information, i.e. extracting features, in the first image. Note that, the normalization is performed for each convolution layer and the residual error network by using the example normalization, and the ReLU function is used as the activation function.

For example, the first encoder includes three convolutional layers in series and three residual blocks. The convolution kernel of the first convolution layer is 7 × 7, and the step size is 1; the convolution kernel of the second convolution layer is 3 x 3, and the step length is 2; the convolution kernel of the third convolutional layer is 3 × 3, with a step size of 2. Where concatenation means that the output of the first convolutional layer is the input of the second convolutional layer, the output of the second convolutional layer is the input of the third convolutional layer, the output of the third convolutional layer is the input of the first residual block, the output of the first residual block is the input of the second residual block, and the output of the second residual block is the input of the third residual block.

In step S302, feature extraction is performed on the first encoded image to obtain a plurality of first image features, where the plurality of first image features represent features of the first image.

The coded first coded image comprises a plurality of channels, and the server can respectively extract the characteristics of each channel to obtain the first image characteristics corresponding to each channel.

In some embodiments, the server can map the information in the encoded image into a semantically consistent potential space based on an embedding function built from 2 residual blocks sharing weights. The server can extract features of the multiple channels of the first encoded image based on the embedding function to obtain multiple first embedded features, then perform average pooling on the multiple first embedded features in the channel direction, and extract global information of the first embedded features at the channel level to obtain the multiple first image features.

Wherein the embedding function is represented as q (-) and the first embedding feature is represented as z_x＝q(E_x(x))，z_x∈R^c×w×hWhere x denotes a first picture, c denotes the number of channels of the encoded picture, c is a positive integer, and w and h denote the width and length of the first encoded picture. The first image feature is denoted as e_xThe first image feature is calculated according to the formula (1):

wherein the content of the first and second substances,

represents the first image feature of the c-th channel, avg _ posing (·) represents the average pooling,

the first embedded feature representing the c-th channel, h represents the length of the first encoded image, and w represents the width of the first encoded image.

In some embodiments, the relationship between the plurality of first image features may be non-aligned and non-linear, and the server is capable of rearranging the channel relationship of the plurality of first image features based on the projection function. Accordingly, the server rearranges the channel relationship between the plurality of first image features based on the projection function.

Wherein the projection function is represented as g (·), and the projection function is built by convolution layer of 1 × 1 convolution kernel or multilayer perceptron. The way the first image features are rearranged is shown in equation (2):

f_x＝g(e_x)，f_x∈R^c (2)；

wherein f is_xRepresenting the aligned first image feature, g (-) represents a projection function,e_xrepresenting the first image feature and c the number of channels.

In step S303, the second image is encoded to obtain a second encoded image.

The second image is a comparison image, that is, a comparison image with a converted style, and for convenience of description, the style of the second image is represented as a second style. The server is capable of encoding the second image based on the second encoder, resulting in a second encoded image. The second encoder has the same structure as the first encoder.

In some embodiments, the second encoder comprises a second downsampling block and a second residual network, the second residual network comprises at least one residual block, and the server is capable of downsampling the second image based on the second downsampling block and then inputting the sampled image into the second residual network, and the second residual network processes the sampled image to obtain the second encoded image. Wherein the second down-sampling block comprises at least one convolution layer for extracting useful information, i.e. features, in the second image. Note that, the normalization is performed for each convolution layer and the residual error network by using the example normalization, and the ReLU function is used as the activation function.

For example, the second encoder includes three convolutional layers in series and three residual blocks. The convolution kernel of the first convolution layer is 7 × 7, and the step size is 1; the convolution kernel of the second convolution layer is 3 x 3, and the step length is 2; the convolution kernel of the third convolutional layer is 3 × 3, with a step size of 2. Where concatenation means that the output of the first convolutional layer is the input of the second convolutional layer, the output of the second convolutional layer is the input of the third convolutional layer, the output of the third convolutional layer is the input of the first residual block, the output of the first residual block is the input of the second residual block, and the output of the second residual block is the input of the third residual block.

It should be noted that, in the embodiments of the present disclosure, the first image is exemplarily taken as an image to be processed, and the second image is taken as a comparison image for description.

In step S304, feature extraction is performed on the second encoded image to obtain a plurality of second image features, which represent features of the second image.

The coded second coded image comprises a plurality of channels, and the server can respectively extract the features of each channel to obtain the second image features corresponding to each channel.

In some embodiments, the server can map the information in the encoded image into a semantically consistent potential space based on an embedding function built from 2 residual blocks sharing weights. The server can extract features of the multiple channels of the second encoded image based on the embedding function to obtain multiple second embedded features, then perform average pooling on the multiple second embedded features in the channel direction, and extract global information of the second embedded features at the channel level to obtain the multiple second image features.

Wherein the embedding function is represented as q (-) and the second embedding feature is represented as z_y＝q(E_y(y))，z_y∈R^c×w×hWhere y denotes the second image, c denotes the channel of the encoded image, and w and h denote the width and length of the second encoded image. The second image feature is denoted as e_yThe second image feature is calculated according to the formula (3):

wherein the content of the first and second substances,

second image feature representing the c-th channel, avg _ posing (·) representing average pooling,

represents the second embedded feature of the c-th channel, h represents the length of the second encoded image, and w represents the width of the second encoded image.

In some embodiments, the relationship between the plurality of second image features may be non-aligned and non-linear, and the server is capable of rearranging the channel relationship of the plurality of second image features based on the projection function. Accordingly, the server rearranges the channel relationship between the plurality of second image features based on the projection function.

Wherein the projection function is represented as g (·), and the projection function is built by convolution layer of 1 × 1 convolution kernel or multilayer perceptron. The way the second image features are rearranged is shown in equation (4):

f_y＝g(e_y)，f_y∈R^c (4)；

wherein f is_yRepresenting the aligned second image feature, g (-) representing a projection function, e_yRepresenting the second image feature and c representing the number of channels.

It should be noted that the server can process the first encoded image and the second encoded image based on the same embedding function and the same projection function, that is, the first encoded image and the second encoded image share one embedding function and one projection function.

It should be noted that the server can perform encoding and feature extraction on the first image and the second image simultaneously, that is, the above step S301 and step S303 do not have a sequence, and the encoding of the step is for convenience of description.

In step S305, the similarity between the first image feature and the second image feature corresponding to the same channel position is determined, and a plurality of feature similarities are obtained.

The server can establish a relation between the plurality of first image features and the plurality of second image features based on a mutual learning mode, and the first image features and the second image features corresponding to the same channel position are related.

In some embodiments, the server is capable of determining a feature similarity between the first image feature and the second image feature corresponding to the same channel location. The server can determine the dot product of the first image feature and the second image feature corresponding to the same channel position as the feature similarity of the first image feature and the second image feature; alternatively, the server can determine the cross entropy of the first image feature and the second image feature corresponding to the same channel as the feature similarity of the first image feature and the second image feature.

The calculation method of the feature similarity is shown in formula (5):

s＝sim(f_x，f_y) (5)；

wherein s represents the feature similarity, sim (-) represents the similarity function, f_xRepresenting the aligned first image feature, f_yRepresenting the aligned second image feature.

It should be noted that, according to the above formula (5),

representing the first image feature corresponding to channel c

And a second image feature

Feature similarity between them. s^cThe higher the value of (A), the first image feature is described

And a second image feature

The higher the similarity, that is, the feature on the channel is a feature shared by the first image and the second image; s^cThe lower the value of (A), the first image feature is described

And a second image feature

The lower the similarity, i.e. the feature of the first image and the feature of the second image are not consistent on the channel.

The server may further normalize the feature similarities to obtain

Included

c is a positive integer.

In step S306, a plurality of first target features are determined based on the plurality of first image features and the plurality of feature similarities.

For any first image feature, the server can determine the product of the any first image feature and the corresponding feature similarity as a first target feature corresponding to the any first image feature, so as to obtain a plurality of first target features. The first target feature is expressed as

In step S307, the plurality of first target features are decoded to obtain a first target image, and the first target image and the second target image have the same style.

After obtaining the plurality of first target features, the server can decode the plurality of first target features based on a first decoder to obtain a first target image.

In some embodiments, the first decoder comprises a first upsampling block and a third residual network, the third residual network comprises at least one residual block, and the server is capable of inputting the plurality of first target features into the third residual network, and upsampling a result output by the third residual network to obtain the first target image, wherein the first target image has a second style, that is, the same style as the second image. Wherein the first upsampling block comprises at least one convolutional layer. Note that, the normalization is performed for each convolution layer and the residual error network by using the example normalization, and the ReLU function is used as the activation function.

For example, the first decoder includes three residual blocks and convolutional layers connected in series. The convolution kernel of the first convolution layer is 3 x 3, and the step size is 0.5; the convolution kernel of the second convolution layer is 3 x 3, and the step size is 0.5; the convolution kernel of the third convolutional layer is 7 × 7, with a step size of 1. Where concatenation means that the output of the first residual block is the input of the second residual block, the output of the second residual block is the input of the third residual block, the output of the third residual block is the input of the first convolutional layer, the output of the first convolutional layer is the input of the second convolutional layer, and the output of the second convolutional layer is the input of the third convolutional layer.

In step S308, a plurality of second target features are determined based on the plurality of second image features and the plurality of feature similarities.

For any second image feature, the server can determine the product of the any second image feature and the corresponding feature similarity as a second target feature corresponding to the any second image feature, so as to obtain a plurality of second target features. The second target feature is expressed as

In step S309, the plurality of second target features are decoded to obtain a second target image, and the second target image has the same style as the first target image.

After obtaining the plurality of second target features, the server may decode the plurality of second target features based on a second decoder to obtain a second target image.

In some embodiments, the second decoder includes a second upsampling block and a fourth residual network, the fourth residual network includes at least one residual block, and the server is capable of inputting the plurality of second target features into the fourth residual network, and upsampling a result output by the fourth residual network to obtain the second target image, wherein the second target image has the first style, that is, the same style as the first image. Wherein the second upsampling block comprises at least one convolutional layer. Note that, the normalization is performed for each convolution layer and the residual error network by using the example normalization, and the ReLU function is used as the activation function.

For example, the second decoder includes three residual blocks and convolutional layers connected in series. The convolution kernel of the first convolution layer is 3 x 3, and the step size is 0.5; the convolution kernel of the second convolution layer is 3 x 3, and the step size is 0.5; the convolution kernel of the third convolutional layer is 7 × 7, with a step size of 1. Where concatenation means that the output of the first residual block is the input of the second residual block, the output of the second residual block is the input of the third residual block, the output of the third residual block is the input of the first convolutional layer, the output of the first convolutional layer is the input of the second convolutional layer, and the output of the second convolutional layer is the input of the third convolutional layer.

It should be noted that the server can perform decoding to determine the first target image and the second target image at the same time, that is, the above steps S306 and S308 do not have a sequence, and the step numbers are for convenience of description.

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

FIG. 4 is a flow diagram illustrating another method of training an image processing model in accordance with an exemplary embodiment. Taking as an example that the electronic device is provided as a server, and the training method of the image processing model is executed by the server, referring to fig. 4, the image processing model includes a generator and a discriminator, and the method includes:

in step S401, a first sample image and a second sample image are input to an encoder and an alignment forgetting layer in the generator, resulting in a plurality of sample image features, and the alignment forgetting layer is used to extract image features.

In step S402, the plurality of sample image features are input into a decoder in the generator, resulting in a first sample target image having the same style as the second sample image and a second sample target image having the same style as the first sample image.

In step S403, the first sample image, the second sample image, the first sample target image, and the second sample target image are input to the discriminator to obtain a training loss.

In step S404, model training is performed based on the training loss.

In the implementation of the present disclosure, the image processing model is trained based on the training loss, so that the image processing model obtained by training can generate a reasonable conversion result, the attribute of the original picture is not lost during the style conversion, and the output of the model is stable and robust.

In some embodiments, the sample images used for training include a source domain data set X and a target domain data set Y, where the source domain data set includes N first sample images, denoted as

The target domain data set includes M second sample images, denoted

N and M are both positive integers.

Fig. 5 is a schematic diagram illustrating a structure of a generator according to an example embodiment. Referring to fig. 5, the generator includes a first encoder E_xA second encoder E_YAlignment forgetting layer U, first decoder G_xAnd a second decoder G_Y. Wherein the first encoder is configured to encode a picture in the source domain data set, i.e. the first sample image(ii) a The second encoder is configured to encode a picture in the target domain data set, i.e., a second sample image. And the alignment forgetting layer is used for carrying out feature extraction on the coded image to obtain the sample image features. The alignment forgetting layer is also used to determine the similarity between features of the sample image. The alignment forgetting layer is also used for determining target sample characteristics to be decoded based on the sample image characteristics and the similarity. The first decoder is used for decoding the sample target features determined based on the first sample image features to obtain a first sample target image. The second decoder is used for performing interface on the sample target characteristics determined based on the second sample image characteristics to obtain a second sample target image.

In some embodiments, fig. 6 is a schematic structural diagram illustrating an alignment forgetting layer according to an exemplary embodiment. Referring to fig. 6, the alignment forgetting layer includes an embedding module, a projection module, and a mutual learning module. Wherein, the embedding module is used for extracting information answering sounds of different domains to the same potential space based on the embedding function q (-) and recording the information answering sounds as

And

embedded features representing c channels of the source domain image;

embedded features representing c channels of the target domain image. The embedding module is also used for carrying out average pooling on the embedded features, extracting global information of the embedded features from a channel level and recording the global information as

And

first image features representing c channels of a source domain image;

the second image features of c channels representing the target domain image are calculated according to the above formula (1) and formula (3), which is not described herein again. The projection module is used for rearranging the channel relation of the image characteristics based on the projection function g (-) to obtain f_xAnd f_y，f_xRepresenting aligned first image features, f_yRepresenting the aligned second image feature. The mutual learning module is used for determining the similarity between the image features and is marked as sim (f)_x，f_y). The mutual learning module is also used for normalizing the feature similarity to obtain

And c is a positive integer, and k is equal to c. The mutual learning module is also used for multiplying the feature similarity and the image features and outputting a multiplication result.

In some embodiments, the arbiter comprises a first arbiter D_XAnd a second discriminator D_YThe first and second discriminators have the same structure. The discriminator is used to determine whether the image is a real image or an image generated by the generator. The discriminator is obtained by connecting 4 convolution layers in series. The first three convolutional layers are convolutional layers with three convolutional kernels of 4 × 4 and a step size of 0.5, and the fourth convolutional layer has convolutional kernels of 4 × 4 and a step size of 1. The normalization of each convolution layer is carried out by adopting an example, and the activation function uses a LeakyReLU activation function.

It should be noted that the embodiment of the present disclosure provides an image processing model, which is implemented by the first encoder E_XAn alignment forgetting layer U and a second decoder G_YThe style of the image X in the source domain data set X can be migrated to the style corresponding to the target domain data set Y, i.e. mapping

By means of a second encoder E_YAlignment forgetting layer U and first decoder G_XThe style of the image Y in the target domain data set Y can be transferred to the style corresponding to the source domain data set X, i.e. mapping

Note that the first encoder E_XAlignment forgetting layer U and first decoder G_XReconstruction of an image X in a source domain data set X can be realized; second encoder E_YAlignment forgetting layer U G_XAnd a second decoder G_YThe reconstruction of the image Y in the target domain data set Y can be realized, wherein the reconstruction refers to that after the input image is processed, the output image keeps the original style and content unchanged so as to improve the mapping stability, and the image obtained by reconstruction is an intermediate image and is not used as an output result.

In the model training, the server can convert the image x having the first genre into the image x having the second genre

Namely, it is

The representation image x is converted into an image after being processed by a first encoder, an alignment forgetting layer and a second decoder

This process is called forward conversion. The server is also capable of rendering images having a second style

Then converting the image back to the first style to obtain the image

Namely, it is

Representing images

After being processed by a second encoder, an alignment forgetting layer and a first decoder, the image is converted into an image

This process is called back conversion. Likewise, the server is also capable of converting the image y having the second style into an image having the first style

Namely, it is

The representation image y is converted into an image after being processed by the second encoder, the alignment forgetting layer and the first decoder

This process is called forward conversion. The server is also capable of rendering images having a first style

Then converting the image back to the second style to obtain the image

Namely, it is

Representing images

After being processed by a first encoder, an alignment forgetting layer and a second decoder, the images are converted into images

This process is called back conversion. The forward conversion is used for completing the style conversion task, and the backward conversion is used for ensuring the stability of network mapping, namely ensuring that the input picture can still be converted into the input picture after the two times of style conversion.

In some embodiments, since the output result of the forward conversion at the initial stage of training is unstable, when the backward conversion is performed, the embedded module part in the alignment forgetting module is used, and the projection module and the mutual learning module part are not used, so that the stability of training can be ensured.

In some embodiments, the training loss comprises a first loss, a second loss, and a third loss. The first loss includes a first countermeasure loss representing a loss of converting the first sample image into the second sample target image and a second countermeasure loss representing a loss of converting the second sample image into the first sample target image. The second loss includes a first image loss representing a difference between the first sample image and the first sample target image and a second image loss representing a difference between the second sample image and the second sample target image. The third loss includes a first loss of consistency representing a difference between the first sample image and the second sample target image and a second loss of consistency representing a difference between the second sample image and the first sample target image.

Correspondingly, inputting the first sample image, the second sample image, the first sample target image and the second sample target image into a discriminator to obtain a training loss, including: and inputting the first sample image, the second sample image, the first sample target image and the second sample target image into a discriminator to obtain a first loss, a second loss and a third loss, and performing weighted summation on the first loss, the second loss and the third loss to obtain the training loss.

In some embodiments, the first loss is an antagonistic loss, which is used to enable the model to output a reasonable conversion result. Wherein the generator is based on { E_X，U，G_YAnd { E } and_Y，U，G_Xthe input image x and image y are processed separately, the aim of the generator being to generate images that are as realistic as possible. The goal of the discriminator is to discriminate as correctly as possible whether the image is an input image or an image generated by the generator. The first loss includes a first pair loss and a second pair loss calculated based on the following equations (6) and (7).

Wherein l_adv(E_X，U，G_Y，D_Y) Denotes the first confrontation loss, E_XRepresenting a first encoder, U representing an alignment forgetting layer, G_YRepresenting a second decoder, D_YDenotes a second discriminator, S_data(y) represents the target domain data set, y represents the image in the target domain data set, S_data(x) Representing the source domain dataset and x representing an image in the source domain dataset.

Wherein l_adv(E_Y，U，G_X，D_X) Denotes a second countermeasure loss, E_YRepresenting a second encoder, U representing an alignment forgetting layer, G_XDenotes a first decoder, Dx denotes a first discriminator, S_data(x) Representing a source domain data set, x representing an image in the source domain data set, S_data(y) represents the target domain data set, y represents the image in the target domain data set.

In some embodiments, the second loss is an image loss, which may also be referred to as an identity loss, and the image loss is used to enable the model to retain important information of the image, so as to ensure that the input attribute of the original image is not lost when the image is subjected to the style conversion, i.e., the content in the original image is not lost. The generator is based on { E_X，U，G_XAnd { E } and_Y，U，G_Yimplements the task of reconstructing the input image x and image y, where { E }_X，U，G_xAnd { E } and_Y，U，G_Yit may also be called an autoencoder. The second loss includes a first image loss and a second image loss calculated based on the following equations (8) and (9).

Wherein l_identity(E_X，U，G_X) Representing a first image loss, S_data(x) Representing a source domain data set, x representing an image in the source domain data set, E_XRepresenting a first encoder, U representing an alignment forgetting layer, G_XRepresenting a first decoder, | · | | non-calculation₁Representing a 1-norm.

Wherein l_identity(E_Y，U，G_Y) Representing a second image loss, S_data(y) denotes the target domain data set, y denotes the image in the target domain data set, E_YRepresenting a second encoder, U representing an alignment forgetting layer, G_YRepresenting a second decoder, | · | | non-conducting phosphor₁Representing a 1-norm.

In some embodiments, the third penalty is a round-robin consistency penalty, which is used to ensure the stability of the model map. This third loss is calculated based on the following formula (10).

Wherein l_cyc(G, F) represents a third loss; s_data(x) Representing a source domain data set; x represents an image in the source domain dataset;

the representation is based on a first encoder E_XAlignment forgetting layer U and second decoder G_YProcessing the image; s_data(y) represents a target domain data set; y represents an image in the target domain dataset;

the representation is based on a second encoder E_YAlignment forgetting layer U and first decoder G_XProcessing the image; i | · | purple wind₁Representing a 1-norm.

In summary, the training loss in the embodiment of the present disclosure is calculated based on the following formula (11).

Wherein l (E)_X，E_Y，U，G_X，G_Y，D_X，D_Y) Represents a loss of training,/_adv(E_X，U，G_Y，D_Y) Denotes the first confrontation loss,/_adv(E_Y，U，G_X，D_X) Denotes the second confrontation loss,/_identity(E_X，U，G_X) Representing a first image loss,/_identity(E_Y，U，G_Y) Representing a second image loss,/_cyc(E_X，E_Y，U，G_X，G_Y)＝l_cyc(G, F) represents a third loss, λ₁And λ₂Representing the weight of the corresponding loss term.

For example, the server calculates the learning rate as 0.0002, the batch training size as 1, the iteration cycle as 200, and the weight of the loss function as λ₁＝5，λ₂Model training is performed 10 until the above-described image processing model is obtained.

It should be noted that, in order to verify the processing effect of the Image processing method provided by the embodiments of the present disclosure, the Image processing method provided by the present disclosure is further combined with the existing representative methods, namely, CycleGAN (cyclic generated Adaptive Networks), UNIT (UNsupervised Image-to-Image Translation Networks), GcGAN (Geometry-dependent Adaptive Networks for One-Sided UNsupervised Domain Mapping to generate an antagonistic network), drive + (dimensional Image-to-Image transformation, which realizes different Image-to-Image conversions by means of decoupling representation) and ittance (user-to-Image transformation) to perform Adaptive Image-to-Image conversion with Adaptive Normalization function for Normalization, the data set adopted for qualitative comparison is a cartoon human face data set selfie2 animal. Conversion results referring to fig. 7, fig. 7 is a diagram illustrating a conversion result according to an exemplary embodiment. The first column represents an input image, the second column represents an image output after being subjected to style conversion by CycleGAN, the third column represents an image output after being subjected to style conversion by UNIT, the fourth column represents an image output after being subjected to style conversion by GcGAN, the fifth column represents an image output after being subjected to style conversion by DRIT + +, and the sixth column represents an image output after being subjected to style conversion by ugatt + +. The seventh column represents an image output after performing a style conversion based on the image processing method provided by the embodiment of the present disclosure.

It should be noted that, the image processing method provided by the present disclosure is also quantitatively compared with the existing representing methods, namely CycleGAN, UNIT, GcGAN, drill + +, and ugatt, where the comparison parameter is a FID score (Frechet inclusion Distance score), the FID score is a measure for calculating the Distance between the feature vectors of the real image and the generated image, and the lower the FID score is, the closer the real image and the generated image are. The comparison result is shown in table 1, the method of the present disclosure obtains the lowest score, which indicates that the conversion result of the present disclosure is closer to the real cartoon face.

TABLE 1

Method	CycleGAN	UNIT	GcGAN	DRIT++	UGATIT	The disclosure of the invention
							FID score	84.63	104.86	147.30	118.48	94.59	81.39

It should be noted that, the image processing method provided by the present disclosure is compared with the existing representation methods, namely CycleGAN, UNIT, GcGAN, DRIT + +, and ugatt, for user evaluation, a total of 37 tested users participate in the experiment in the user evaluation, and the tested users in each test question are required to select the best one of the conversion results in the representative algorithm, namely, 25 questions. The evaluation results are shown in Table 2. The highest support rate is obtained, and the conversion result of the method is more consistent with the cognition and the liking of the user.

TABLE 2

Method

CycleGAN

UNIT

GcGAN

DRIT++

UGATIT

The disclosure of the invention

User support rate

8.43％

2.48％

7.56％

2.91％

32.64％

46.16％

It should be noted that the image processing method provided by the embodiment of the present disclosure can be applied to style conversion tasks of various scenes. Fig. 8 is a diagram illustrating a result of a style conversion for a map and a satellite map, according to an example embodiment. As shown in fig. 8, 6 pairs of images are included, and the image on the left side of each pair of images is an input image and the image on the right side is a conversion result. Fig. 9 is a diagram illustrating a street view and a style conversion result of a split street view according to an example embodiment. As shown in fig. 9, a total of 6 pairs of images are included, and the image on the left side of each pair of images is an input image and the image on the right side is a conversion result. FIG. 10 is a diagram illustrating style conversion results for a building and segmented buildings according to an exemplary embodiment. As shown in fig. 10, a total of 6 pairs of images are included, and the image on the left side of each pair of images is an input image and the image on the right side is a conversion result.

It should be noted that the image processing method provided in the embodiment of the present application can further improve the network capability to improve the quality of style conversion. In some embodiments, the alignment forgetting layer has multiple variations: if the common features of the two domains can be extracted by using different numbers of residual blocks in the embedded module, the convolution layer of a 1 × 1 convolution kernel or a multilayer perceptron building can be used in the projection module, and the cross entropy or dot product can be used in the mutual learning module to measure the similarity, so that the image processing method has expandability and flexibility.

FIG. 11 is a graph comparing results of one variation experiment shown in accordance with an exemplary embodiment. As shown in fig. 11, the first column is an input image; the second column is a conversion result obtained by using two residual blocks in an alignment forgetting layer, using a convolution layer of 1 multiplied by 1 convolution kernel in a projection module and using cross entropy in a mutual learning module; the third column is a conversion result obtained by using a residual block in an alignment forgetting layer, using a convolution layer of 1 multiplied by 1 convolution kernel in a projection module and using cross entropy in a mutual learning module; the fourth column is a conversion result obtained by using three residual blocks in an alignment forgetting layer, using a convolution layer of 1 multiplied by 1 convolution kernel in a projection module and using cross entropy in a mutual learning module; the fifth column is a conversion result obtained by using two residual blocks in an alignment forgetting layer, using a convolution layer of 1 multiplied by 1 convolution kernel in a projection module and using point multiplication in a mutual learning module; the sixth column is a conversion result obtained by using only two residual blocks in the alignment forgetting layer, using two layers of multilayer perceptrons in the projection module and using cross entropy in the mutual learning module; the seventh column is that two residual blocks are used in the alignment forgetting layer, two layers of multilayer perceptrons are used in the projection module, and the conversion result of point multiplication is used in the mutual learning module.

In addition, when the graphic processing model is trained, needed training resources (GPU) are few, only the video memory is less than 3000M, the training time is 40h, the training efficiency is greatly improved, the training cost is obviously reduced, and the training efficiency is high.

Fig. 12 is a block diagram illustrating an image processing apparatus according to an exemplary embodiment. Referring to fig. 12, the apparatus includes: a feature extraction module 1201, a similarity determination module 1202, and an image determination module 1203.

A feature extraction module 1201 configured to perform feature extraction on a first image and a second image to obtain a plurality of first image features and a plurality of second image features, where the plurality of first image features represent features of the first image, the plurality of second image features represent features of the second image, and the plurality of first image features and the plurality of second image features are in one-to-one correspondence;

a similarity determination module 1202 configured to determine a similarity between a first image feature and a second image feature corresponding to the same channel position, resulting in a plurality of feature similarities;

an image determination module 1203 configured to determine a first target image and a second target image based on the plurality of first image features, the plurality of second image features and the plurality of feature similarities, the first target image having the same style as the second image, the second target image having the same style as the first image.

The device provided by the embodiment of the disclosure can reduce the influence of redundant and inconsistent features in two images and keep common features in the two images based on the multiple feature similarities by respectively determining the multiple feature similarities between the multiple first image features of the first image and the multiple second image features of the second image, thereby avoiding the influence on style conversion caused by the unbalanced information abundance between the images, ensuring that the conversion result is more reasonable, and improving the quality of the converted image.

In some embodiments, the similarity determination module 1202 is configured to determine a dot product of a first image feature and a second image feature corresponding to a same channel position as a feature similarity of the first image feature and the second image feature; or determining the cross entropy of the first image feature and the second image feature corresponding to the same channel position as the feature similarity of the first image feature and the second image feature.

In some embodiments, fig. 13 is a block diagram of another image processing apparatus according to an exemplary embodiment, and referring to fig. 13, the feature extraction module 1201 includes:

a first encoding unit 12011 configured to encode the first image to obtain a first encoded image;

a first feature extraction unit 12012, configured to perform feature extraction on the first encoded image, so as to obtain a plurality of first image features;

a second encoding unit 12013 configured to encode the second image, resulting in a second encoded image;

a second feature extraction unit 12014, configured to perform feature extraction on the second encoded image, resulting in the plurality of second image features.

In some embodiments, referring to fig. 13, the first encoding unit 12011 is configured to down-sample the first image, and input the sampled image into the first residual network to obtain the first encoded image.

In some embodiments, referring to fig. 13, the first feature extraction unit 12012 is configured to perform feature extraction on the first encoded image based on an embedding function, so as to obtain a plurality of first embedded features; and averaging and pooling the plurality of first embedded features respectively to obtain the plurality of first image features.

In some embodiments, referring to fig. 13, the apparatus further comprises:

a first rearranging module 1205 configured to rearrange the channel relationships between the plurality of first image features based on the projection function.

In some embodiments, the second encoding unit 12013 is configured to down-sample the second image, and input the sampled image into a second residual network to obtain the second encoded image.

In some embodiments, the second feature extraction unit 12014 is configured to perform feature extraction on the second encoded image based on an embedding function, so as to obtain a plurality of second embedded features; and averaging and pooling the plurality of second embedded features respectively to obtain a plurality of second image features.

In some embodiments, referring to fig. 13, the apparatus further comprises:

a second rearranging module 1206 configured to rearrange a channel relationship between the plurality of second image features based on the projection function.

In some embodiments, referring to fig. 13, the image determining module 1203 includes:

a first determining unit 12031 configured to determine a plurality of first target features based on the plurality of first image features and the plurality of feature similarities;

a first decoding unit 12032, configured to decode the plurality of first target features to obtain the first target image;

a second determining unit 12033 configured to determine a plurality of second target features based on the plurality of second image features and the plurality of feature similarities;

a second decoding unit 12034 configured to decode the plurality of second target features to obtain the second target image.

In some embodiments, the first determining unit 12031 is configured to, for any first image feature, determine a product of the any first image feature and a corresponding feature similarity as a first target feature corresponding to the any first image feature.

In some embodiments, the first decoding unit 12032 is configured to perform upsampling on the plurality of first target features, and input the sampled features into a third residual network to obtain the first target image.

In some embodiments, the second determining unit 12033 is configured to, for any second image feature, determine a product of the any second image feature and the corresponding feature similarity as a second target feature corresponding to the any second image feature.

In some embodiments, the second decoding unit 12034 is configured to perform upsampling on the plurality of second target features, and input the sampled features into a fourth residual network to obtain the second target image.

It should be noted that, when the image processing apparatus provided in the above embodiment processes an image, only the division of the above functional units is illustrated, and in practical applications, the above function distribution may be completed by different functional units according to needs, that is, the internal structure of the electronic device may be divided into different functional units to complete all or part of the above described functions. In addition, the image processing apparatus and the image processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

FIG. 14 is a block diagram illustrating an apparatus for training an image processing model according to an exemplary embodiment. Referring to fig. 14, the image processing model includes a generator and a discriminator, and the apparatus includes: a first training module 1401, a second training module 1402, and a third training module 1403.

A first training module 1401 configured to input a first sample image and a second sample image into an encoder and an alignment forgetting layer in the generator, resulting in a plurality of sample image features, the alignment forgetting layer being used for extracting image features;

a second training module 1402 configured to input the plurality of sample image features into a decoder in the generator, resulting in a first sample target image having the same style as the second sample image and a second sample target image having the same style as the first sample image;

a third training module 1403 configured to input the first sample image, the second sample image, the first sample target image, and the second sample target image into the discriminator, resulting in a training loss;

the third training module 1403 is further configured to perform model training according to the training loss.

In some embodiments, the third training module 1403 is configured to input the first sample image, the second sample image, the first sample target image and the second sample target image into the discriminator to obtain a first loss, a second loss and a third loss, the first loss includes a first countermeasure loss and a second countermeasure loss, the first countermeasure loss represents a loss of converting the first sample image into the second sample target image, the second countermeasure loss represents a loss of converting the second sample image into the first sample target image, the second loss includes a first image loss and a second image loss, the first image loss represents a difference between the first sample image and the first sample target image, the second image loss represents a difference between the second sample target image and the second sample target image, the third loss includes a first consistency loss and a second consistency loss, the first loss of conformity represents the difference between the first sample image and the second sample target image, the second loss of conformity represents the difference between the second sample image and the first sample target image; and carrying out weighted summation on the first loss, the second loss and the third loss to obtain the training loss.

It should be noted that, when training the image processing model, the training apparatus for the image processing model provided in the above embodiment only exemplifies the division of the above functional units, and in practical applications, the above functions may be distributed to different functional units according to needs, that is, the internal structure of the electronic device may be divided into different functional units to complete all or part of the above described functions. In addition, the training apparatus for an image processing model and the embodiment of the training method for an image processing model provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in the embodiment of the method for details, and are not described herein again.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

When the electronic device is provided as a server, fig. 15 is a block diagram of a server 1500 according to an exemplary embodiment, where the server 1500 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 1501 and one or more memories 1502, where at least one program code is stored in the memory 1502, and the at least one program code is loaded and executed by the processor 1501 to implement the image Processing methods provided by the above method embodiments, or loaded and executed by the processor 1501 to implement the training methods of the image Processing models provided by the above method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server 1500 may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 1502 comprising instructions, executable by the processor 1501 of the server 1500 to perform the above-described method is also provided. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A computer program product comprising a computer program which, when being executed by a processor, implements the above-mentioned image processing method or which, when being executed by a processor, implements a training method for the above-mentioned image processing model.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An image processing method, characterized in that the method comprises:

2. The image processing method according to claim 1, wherein the determining a similarity between the first image feature and the second image feature corresponding to the same channel position to obtain a plurality of feature similarities comprises:

3. The image processing method according to claim 1, wherein the performing feature extraction on the first image and the second image to obtain a plurality of first image features and a plurality of second image features comprises:

coding the first image to obtain a first coded image;

coding the second image to obtain a second coded image;

4. The image processing method according to claim 3, wherein said encoding the first image to obtain a first encoded image comprises:

5. A method of training an image processing model, the image processing model comprising a generator and an arbiter, the method comprising:

and carrying out model training according to the training loss.

6. An image processing apparatus, characterized in that the apparatus comprises:

7. An apparatus for training an image processing model, the image processing model including a generator and an arbiter, the apparatus comprising:

a first training module configured to input a first sample image and a second sample image into an encoder and an alignment forgetting layer in the generator to obtain a plurality of sample image features, wherein the alignment forgetting layer is used for extracting image features;

a second training module configured to input the plurality of sample image features into a decoder in the generator, resulting in a first sample target image and a second sample target image, the first sample target image having a same style as the second sample image, the second sample target image having a same style as the first sample image;

a third training module configured to input the first sample image, the second sample image, the first sample target image, and the second sample target image into the discriminator to obtain a training loss;

the third training module is further configured to perform model training according to the training loss.

8. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a memory for storing the processor executable program code;

wherein the processor is configured to execute the program code to implement the image processing method of any one of claims 1 to 4 or to implement the training method of the image processing model of claim 5.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the image processing method of any one of claims 1 to 4 or perform the training method of the image processing model of claim 5.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the image processing method of any one of claims 1 to 4 when executed by a processor or the computer program realizes the training method of the image processing model of claim 5 when executed by the processor.