CN117315090A

CN117315090A - Cross-modal style learning-based image generation method and device

Info

Publication number: CN117315090A
Application number: CN202311265075.8A
Authority: CN
Inventors: 董晶; 王伟; 彭勃; 王建文; 吕月明; 江玥; 杨嵩林
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2023-12-29

Abstract

The disclosure relates to an image generation method and device based on cross-modal style learning, wherein the method comprises the following steps: generating a target generation image through a pre-training generation countermeasure network, and carrying out image enhancement processing on the target training image; inputting the target generated image and the processed target training image into a discriminator of the countermeasure network, inputting target characteristics output by the discriminator into a trained prototype space, and determining exchange prediction loss according to the output of the prototype space; acquiring original countermeasures loss between a target training image and a target generation image, contrast learning loss and text guidance loss between a source generation image and the target generation image, and determining fine tuning loss according to the losses; and fine tuning is carried out on the countermeasure network according to the fine tuning loss, a fine-tuned image generation model is obtained, and an image consistent with a training image is generated through the image generation model, so that the problem of poor target domain style learning under the condition of a current very small amount of samples can be solved.

Description

Cross-modal style learning-based image generation method and device

Technical Field

The disclosure relates to the technical field of image processing, in particular to an image generation method and device based on cross-modal style learning.

Background

With the continued development of artificial intelligence, it has become easier to generate diverse and high quality images, such as facial images, protein molecular figures, and drawings of a given artist. The technology has strong research and application value in the fields of virtual reality and the like. However, conventional generative models require large amounts of high quality data for generating the user desired images, such as the StyleGAN model requires thousands to tens of thousands of high quality image training. However, it is difficult to acquire a data set in some cases, such as acquiring images or medical images of some artists, with only 10 works per artist in the artists-Faces data set. In the case of very few training data (less than 10 samples), such generative models with huge parameters have difficulty generating high quality results similar to the training data.

To reduce the need for training data, many researchers have explored to generate high quality images with few samples. The main stream few sample generation method utilizes a pre-trained generation model on a large-scale data set, and enables the generation model after fine adjustment to generate images consistent with a training target through various fine adjustment technologies, meanwhile, strong generation capacity of the model is maintained, and personalized few sample generation is achieved. The TGAN model demonstrates that fine tuning the pre-training generation model in a small amount of training data can effectively generate images similar to the training data, and this process only requires the use of a conventional contrast loss function. From this point on, fine tuning of the pre-trained model over a small number of data sets to generate new target domain images is increasingly under study. In the related technology, the data enhancement mode increases the semantic invariance of training data under the condition of not introducing data noise, thereby preventing the deviation of a discriminator to a training sample and having better effect in the training set of thousand pieces of data. In other related techniques, the weights of the pre-training model are mapped to orthogonal parameter spaces, and changes brought by training samples are modified in new spaces, and the method can still ensure the generating capability when the training data is less than 100. Some related technologies find that the lower layer of the discriminator learns the general features of the image, and the higher layer learns the distinguishing features of the image, and can be applied to image classification. In other related techniques, a novel network MineGAN model is proposed, which first determines a parameter layer closest to a target training domain in a pre-training model, and adjusts parameters in a subsequent fine-tuning process to guide generation of a target domain image. The method lays a research foundation for generating few samples.

However, their drawbacks are also evident, i.e. in case of extremely small sample numbers, such as less than 10 training images, the above method can only generate images that are completely consistent with the target training samples, and it is difficult to generate various data individually. This is because in the case of few training samples, the model may be overfitted to the training set and lose its original generating ability.

To address this problem, the CDC method (cross-domain correspondence, cross-domain interaction generation) constrains the similarity between the pre-training domain and the target domain using a contrast learning method, thereby avoiding model overfitting. In the related art, the CDC method is found to lose basic information of a source image generated by the pre-training model in the learning process. In order to ensure that the model can generate a source image similar to that generated by a pre-training model while having the same domain characteristics as a training sample, the RSSA model establishes a high-dimensional mapping from a target domain to a source domain, and a multi-scale contrast learning method is used for keeping the generated image similar to the source image. In the related technology, the deviation problem of the generation of few samples is also related, the previous method excessively pays attention to the diversity capability of the generation, the generation quality is possibly influenced in the process, and the generator and the discriminator are used as two feature extractors to extract the features of different layers and put forward a double contrast learning method to maximize the mutual information among domains, so that the generation quality is ensured.

In summary, to avoid the model generating images consistent with the target training samples only, early work generally constrains the model, edits key layers of the model or regularizes model parameters; in recent years, the generated image is kept similar to the source image, the understanding of a small number of samples is improved by using a contrast learning method, and the source information of the original pre-training model is kept as much as possible.

However, these methods do not construct an explicit representation of the target domain information by the generation model, resulting in insufficient domain style information learning, and they ignore the style characteristics contained in the text information of the target sample, and the generated image has a problem of insufficient domain style expression. Furthermore, these methods remain equally poor for the content of the source image. In the case of few training samples, the details generated by these methods remain poor, and in the case of single sample training, these methods fail to generate results that are similar to the identity of the source image.

Disclosure of Invention

In order to solve the above technical problems or at least partially solve the above technical problems, embodiments of the present disclosure provide an image generation method and apparatus based on cross-modal style learning.

In a first aspect, embodiments of the present disclosure provide an image generation method based on cross-modal style learning, the method comprising:

selecting a target training image from a plurality of preset training images, generating a target generating image through a pre-training generating countermeasure network, and performing image enhancement processing on the target training image, wherein the input of the pre-training generating countermeasure network is noise of a preset noise space;

inputting the target generated image and the processed target training image into a discriminator of a pre-training generation countermeasure network, inputting target characteristics respectively corresponding to the target generated image and the processed target training image output by the discriminator into a pre-trained prototype space, and determining exchange prediction loss of maintaining a target style at an image level according to the output of the pre-trained prototype space;

acquiring original contrast loss between a target training image and a target generation image, contrast learning loss and text guidance loss between a source generation image and the target generation image, and determining fine tuning loss according to the original contrast loss, the exchange prediction loss, the text guidance loss and the contrast learning loss, wherein the source generation image is generated through a pre-training generation contrast network, and the input of the source generation image and the target generation image is noise obtained by sampling the same noise space;

And fine tuning the pre-training generation countermeasure network according to the fine tuning loss to obtain a fine-tuned image generation model so as to generate an image consistent with a preset training image based on the fine-tuned image generation model.

In one possible embodiment, the prototype space is trained by:

performing image enhancement on each preset training image to obtain a target training image set, selecting two images from the target training image set, and inputting the two images into a discriminant of a pre-training generation countermeasure network, wherein the discriminant does not comprise a last linear layer;

inputting target features respectively corresponding to the two selected images according to the output of the discriminator into a preset prototype space, and determining the exchange prediction loss of maintaining the target style at the image layer according to the output of the preset prototype space;

and adjusting parameters of a preset prototype space according to the exchange prediction loss until the exchange prediction loss is smaller than a preset threshold value.

In one possible implementation manner, the inputting the target features corresponding to the target generated image and the processed target training image output by the discriminator into the pre-trained prototype space, and determining the exchange prediction loss of the image-level preserving target style according to the output of the pre-trained prototype space includes:

Mapping target characteristics respectively corresponding to the target generated image and the processed target training image to a prototype space scale by using mapping modules of two full-connection layers of the pre-trained prototype space to obtain mapping vectors respectively corresponding to the two images;

embedding the mapping vector into a prototype space through a preset linear layer;

respectively inputting vectors taken out of the prototype space into a preset linear layer, and respectively aggregating and normalizing the vectors respectively output by the preset linear layer and the corresponding mapping vectors;

and calculating the exchange prediction loss of the target style of the image level maintenance target of the target generated image and the processed target training image according to the normalization result.

In one possible implementation, the exchange prediction loss of the target style is maintained at the image level of the target generated image and the processed target training image is calculated from the normalization result by the following expression:

L _swap (z _i ，z _j )＝l(z _i ，q _j )+l(z _j ，q _i )

wherein L is _swap Preserving the exchange prediction loss of target style for the image level of target generated image and processed target training image, z _i And z _j Mapping vectors respectively corresponding to two images, q _i And q _j For the normalization results corresponding to the mapping vectors, τ is the temperature coefficient used for determining the importance of the parameter, c _k Is the number of prototypes in the prototype space.

In one possible embodiment, the acquiring text guidance loss between the source-generated image and the target-generated image includes:

respectively corresponding text labels of the source generated image and the target generated image and a text encoder of an input CLIP model to obtain a first output result of the text encoder;

inputting the text label corresponding to the target generated image into a large language generation model to obtain an enhanced text label corresponding to the target generated image;

the enhanced text label corresponding to the target generated image is input into a text encoder of the CLIP model, and a second output result of the text encoder is obtained;

respectively inputting the source generated image and the target generated image into an image encoder of the CLIP model to obtain an output result of the image encoder;

a text guidance loss between the source generated image and the target generated image is calculated from the first output result and the second output result of the text encoder and the output result of the image encoder.

In one possible embodiment, the text guidance loss between the source generated image and the target generated image is calculated from the first output result and the second output result of the text encoder and the output result of the image encoder by the following expression:

ΔT _i ＝E _T (T _t )-E _T (T _s )

ΔI＝E _I (I _gen )-E _I (I _s )

L _td ＝1-CosSim(ΔT _i ，ΔI)

ΔT _s ＝E _T (T _t )-E _T (T _ts )

L _tsd ＝1-CosSim(ΔT _s ，ΔI)

Wherein I is _gen And I _s Cos Sim (·, ·) is the cosine similarity of the two vectors, L, for the target generated image and the source generated image, respectively _td And L _tsd Text guidance loss between generating images for target and source, T _t And T _s Text labels respectively corresponding to the target generation image and the source generation image E _T And E is _I T is a text encoder and an image encoder _ts And generating an enhanced text label corresponding to the image for the target.

In one possible embodiment, the acquiring the contrast learning penalty between the source-generated image and the target-generated image includes:

selecting different scale areas on the feature graphs of the source generated image and the target generated image respectively, wherein each area selects a designated number of anchor points as the area center, and a square area is determined by taking a designated length as the side length;

and determining the contrast learning loss according to the difference value between the anchor point m corresponding to the feature images of the source generated image and the target generated image and the jth boundary pixel.

In one possible implementation manner, the contrast learning loss is determined according to the difference value between the anchor point m corresponding to the feature map of the source generated image and the target generated image and the jth boundary pixel by the following expression:

Wherein,and->Difference values of anchor point m and jth boundary pixel corresponding to the source generation image and the target generation image respectively, < ->And a _m Points and anchor points on square boundaries, f _s And f _t Feature map of source-generated image and target-generated image, respectively,>the pixel difference value consistent in position for the source generated image and the target generated image is a positive sample pair,/>The pixel difference value for the source generated image and the target generated image with inconsistent positions is a negative sample pair, L _sda To compare learning loss, N and N are the number of tiles and M are the number of anchor points.

In one possible implementation, the fine tuning loss is determined from the original contrast loss, the exchange predicted loss, the text guided loss, and the contrast learning loss by the following expression:

wherein,to fine tune the loss, L _swap In exchange for the predicted loss,L _td and L _tsd For text guiding loss, < >>To combat losses originally, L _sda To compare learning loss lambda _s 、λ _d 、λ _t And lambda (lambda) _p To adjust the coefficient parameters of the learning rate, G _t Is equal to D _t Generator and discriminant for generating models for fine-tuned images, x and I _t Training images for targets, z and p _z Is the noise of the sample and the noise space.

In a second aspect, embodiments of the present disclosure provide an image generation apparatus based on cross-modal style learning, including:

The enhancement module is used for selecting a target training image from a plurality of preset training images, generating a target generating image through a pre-training generating countermeasure network, and carrying out image enhancement processing on the target training image, wherein the input of the pre-training generating countermeasure network is noise of a preset noise space;

the input module is used for inputting the target generated image and the processed target training image into a discriminator of the pre-training generation countermeasure network, inputting target characteristics respectively corresponding to the target generated image and the processed target training image output by the discriminator into a pre-trained prototype space, and determining the exchange prediction loss of the target style of the image layer retention according to the output of the pre-trained prototype space;

the system comprises a determining module, a processing module and a processing module, wherein the determining module is used for acquiring original contrast loss between a target training image and a target generating image, contrast learning loss and text guiding loss between a source generating image and the target generating image, and determining fine tuning loss according to the original contrast loss, exchange prediction loss, text guiding loss and contrast learning loss, wherein the source generating image is generated through a pre-training generating contrast network, and the input of the source generating image and the target generating image is noise obtained by sampling the same noise space;

And the fine tuning module is used for carrying out fine tuning on the pre-training generation countermeasure network according to the fine tuning loss to obtain a fine-tuned image generation model so as to generate an image consistent with a preset training image based on the fine-tuned image generation model.

In a third aspect, embodiments of the present disclosure provide an electronic device including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the image generation method based on the cross-modal style learning when executing the program stored in the memory.

In a fourth aspect, embodiments of the present disclosure provide a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the above-described image generation method based on cross-modal style learning.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has at least part or all of the following advantages:

according to the image generation method based on cross-modal style learning, style characteristics of target samples are learned from two fields of images and texts, a style space is built by utilizing a prototype learning mode from limited training samples on an image field, the training data after image enhancement still keep the target style characteristics no matter how many, the consistency invariance is kept in the prototype space, and the generated images are restrained to be similar to the target images in the prototype space as much as possible in subsequent training; in addition, on the text field, as the CLIP model is fully trained on a large number of paired image text pairs, the hidden space of the CLIP model has rich image-text alignment information, and the style similarity between the image and the target image is generated by utilizing the feature auxiliary style prototype space constraint, so that the problem of poor style learning of the target field under the condition of a current very small number of samples is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the related art will be briefly described below, and it will be apparent to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 schematically illustrates a flow diagram of an image generation method based on cross-modal style learning in accordance with an embodiment of the present disclosure;

FIG. 2 schematically illustrates a schematic diagram of a swap predictive loss acquisition method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow diagram of a text guided loss retrieval method according to an embodiment of the disclosure;

FIG. 4 schematically illustrates a flow diagram of a contrast learning loss acquisition method according to an embodiment of the disclosure;

FIG. 5 schematically illustrates a block diagram of a cross-modal style learning based image generation apparatus in accordance with an embodiment of the present disclosure; and

Fig. 6 schematically shows a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some, but not all, embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the disclosure, are within the scope of the disclosure.

Referring to fig. 1, an embodiment of the present disclosure provides an image generation method based on cross-modal style learning, the method including:

s1, selecting a target training image from a plurality of preset training images, generating a target generating image through a pre-training generating countermeasure network, and carrying out image enhancement processing on the target training image, wherein the input of the pre-training generating countermeasure network is noise of a preset noise space.

In some embodiments, the image enhancement processing includes, but is not limited to, random cropping, erasure, affine, flipping, and the like.

S2, inputting the target generated image and the processed target training image into a discriminator of a pre-training generation countermeasure network, inputting target characteristics respectively corresponding to the target generated image and the processed target training image output by the discriminator into a pre-trained prototype space, and determining exchange prediction loss of the target style of the image layer retention according to the output of the pre-trained prototype space.

S3, acquiring original contrast loss between a target training image and a target generation image, contrast learning loss and text guiding loss between a source generation image and the target generation image, and determining fine tuning loss according to the original contrast loss, the exchange prediction loss, the text guiding loss and the contrast learning loss, wherein the source generation image is generated through a pre-training generation contrast network, and the input of the source generation image and the target generation image is noise obtained through sampling from the same noise space.

And S4, fine tuning is carried out on the pre-training generation countermeasure network according to the fine tuning loss, a fine-tuned image generation model is obtained, and an image consistent with a preset training image is generated based on the fine-tuned image generation model.

In this embodiment, in step S2, the prototype space is trained by the following steps:

In some embodiments, referring to fig. 2, in order to learn target attribute features in limited training data, the feature that the data enhancement does not change the domain features of the data (this feature is also applicable to single sample generation) is utilized, however, since the enhanced data is distributed in the original data distribution in only the features of the discriminator, a prototype space is constructed for enhancing the processed data, in order to learn the prototype space, the last linear layer of the discriminator is discarded, and the target features output by the discriminator are mapped to the prototype space scale by using the mapping modules of two fully connected layers.

In some embodiments, the vectors extracted from the prototype space are respectively input into a preset linear layer, and the vectors respectively output by the preset linear layers and the corresponding mapping vectors are aggregated and normalized, so that the prototype space can be optimized to contain the information of the target domain as much as possible.

In this embodiment, in step S2, the inputting the target features corresponding to the target generated image and the processed target training image output by the discriminator into the pre-trained prototype space, and determining the exchange prediction loss of the image-level preserving target style according to the output of the pre-trained prototype space includes:

In the present embodiment, the exchange prediction loss of the target style is maintained at the image level of the target generated image and the processed target training image according to the normalization result by the following expression:

L _swap (z _i ，z _j )＝l(z _i ，q _j )+l(z _j ，q _i )

In some embodiments, q _i And q _j Calculated by a Sinkhom-Knopp algorithm.

In some embodiments, referring to fig. 3, the prototype space facilitates overall style learning between target samples, but images in the same dataset may have distinct styles, which may hinder domain style learning, text may represent more general style information, CLIP models are trained on massive data, their hidden space contains rich information, and in addition, simple, limited text descriptions contain limited information, which is unfavorable for various and detailed style learning, a GPT-3 large language generation model may be employed, which describes the style of target text in a manner that the CLIP model can understand, and generating text labels corresponding to target generated images typically includes aspects such as color, shape, texture, etc. the text of the style is enhanced by GPT-3. Thus, in the present embodiment, in step S3, the acquiring text guidance loss between the source generated image and the target generated image includes:

In this embodiment, considering that the style characteristics contained in the simple text are limited, the method generates the style text description of the enhancement version by using the large language model GPT-3, and is used for guiding more various and robust style information.

In the present embodiment, text guidance loss between a source generated image and a target generated image is calculated from the first output result and the second output result of the text encoder and the output result of the image encoder by the following expression:

ΔT _i ＝E _T (T _t )-E _T (T _s )

ΔI＝E _I (I _gen )-E _I (I _s )

L _td ＝1-CosSim(ΔT _i ，ΔI)

ΔT _s ＝E _T (T _t )-E _T (T _ts )

L _tsd ＝1-CoSSim(ΔT _s ，ΔI)

In some embodiments, referring to fig. 4, in order to keep the generated image and the source image sufficiently similar in visual and detailed structure, different scale patches may be selected for feature maps of the target image and the source image acquired by the target generator and the source generator, each patch selecting m anchor points as region centers, determining a square region with 2r as a side length, eight points b on the square boundary _m ^j (j e (0, 1,.,. 7)) and anchor point a in sequence _m Calculating pixel differences, and comparing losses by using an infonnce loss function for the selected positive and negative sample pairs, so in this embodiment, in step S3, the obtaining a comparison learning loss between the source generated image and the target generated image includes:

In the present embodiment, the contrast learning loss is determined from the difference value of the anchor point m and the jth boundary pixel corresponding to the feature maps of the source generated image and the target generated image by the following expression:

In order to generate an image which is intuitively similar to a source image (namely, similar identity information), the method focuses on multi-scale pixel difference consistency of a characteristic image layer, learns overall and local difference structures in a multi-scale area, can solve the problem that identity and detail information of the source image are not well kept when few samples are generated at the present stage, and has better global identity information and local detail information keeping capability.

Further, in some embodiments, to smooth the loss, the value of the anchor point is calculated by the following expression:

wherein,for the anchor value, neighbor (f _{s，t} (a _m ) A neighborhood of anchor points (pixel 1), avg (·) is the average of·.

In this embodiment, the present disclosure ensures better global identity retention and structural detail retention capabilities by calculating contrast learning loss, obtaining structural associations between pixels on different scales.

In the present embodiment, in step S4, the fine tuning loss is determined from the original countermeasure loss, the exchange prediction loss, the text guidance loss, and the contrast learning loss by the following expression:

wherein,to fine tune the loss, L _swap To exchange predictive losses, L _td And L _tsd For text guiding loss, < >>To combat losses originally, L _sda To compare learning loss lambda _s 、λ _d 、λ _t And lambda (lambda) _p To adjust the coefficient parameters of the learning rate, G _t Is equal to D _t Generator and discriminant for generating models for fine-tuned images, x and I _t Training images for targets, z and p _z Is the noise of the sample and the noise space.

Referring to fig. 5, an embodiment of the present disclosure provides an image generation apparatus based on cross-modal style learning, including:

the enhancement module 11 is configured to select a target training image from a plurality of preset training images, generate a target generating image through a pre-training generating countermeasure network, and perform image enhancement processing on the target training image, where the input of the pre-training generating countermeasure network is noise in a preset noise space;

The input module 12 is configured to input the target generated image and the processed target training image into a discriminator of the pre-training generation countermeasure network, input target features corresponding to the target generated image and the processed target training image output by the discriminator into a pre-trained prototype space, and determine an exchange prediction loss of maintaining the target style at the image level according to an output of the pre-trained prototype space;

a determining module 13, configured to acquire an original contrast loss between a target training image and a target generated image, a contrast learning loss and a text guiding loss between a source generated image and a target generated image, and determine a fine tuning loss according to the original contrast loss, the exchange prediction loss, the text guiding loss and the contrast learning loss, where the source generated image is generated by pre-training a generated contrast network, and inputs of the source generated image and the target generated image are noise obtained by sampling from the same noise space;

and the fine tuning module 14 is configured to fine tune the pre-training generating countermeasure network according to the fine tuning loss, so as to obtain a fine-tuned image generating model, so as to generate an image consistent with the preset training image based on the fine-tuned image generating model.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

In the above embodiment, any of the enhancement module 11, the input module 12, the determination module 13, and the fine adjustment module 14 may be incorporated in one module to be implemented, or any of them may be split into a plurality of modules. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. At least one of the enhancement module 11, the input module 12, the determination module 13 and the trimming module 14 may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or in hardware or firmware such as any other reasonable way of integrating or packaging the circuits, or in any one of or a suitable combination of three of software, hardware and firmware. Alternatively, at least one of the enhancement module 11, the input module 12, the determination module 13 and the fine tuning module 14 may be at least partially implemented as computer program modules which, when executed, may perform the respective functions.

Referring to fig. 6, an electronic device provided by an embodiment of the present disclosure includes a processor 1110, a communication interface 1120, a memory 1130, and a communication bus 1140, where the processor 1110, the communication interface 1120, and the memory 1130 perform communication with each other through the communication bus 1140;

a memory 1130 for storing a computer program;

processor 1110, when executing the program stored in memory 1130, implements the image generation method based on cross-modal style learning as follows:

The communication bus 1140 may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The communication bus 1140 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface 1120 is used for communication between the electronic device and other devices described above.

The memory 1130 may include random access memory (Random Access Memory, simply RAM) or may include non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. Optionally, the memory 1130 may also be at least one storage device located remotely from the processor 1110.

The processor 1110 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

Embodiments of the present disclosure also provide a computer-readable storage medium. The computer-readable storage medium stores thereon a computer program which, when executed by a processor, implements the image generation method based on cross-modal style learning as described above.

The computer-readable storage medium may be embodied in the apparatus/means described in the above embodiments; or may exist alone without being assembled into the apparatus/device. The computer-readable storage medium described above carries one or more programs that, when executed, implement the image generation method based on cross-modal style learning according to the embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing is merely a specific embodiment of the disclosure to enable one skilled in the art to understand or practice the disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An image generation method based on cross-modal style learning, the method comprising:

2. The method of claim 1, wherein the prototype space is trained by:

3. The method according to claim 1, wherein inputting target features respectively corresponding to the target generated image and the processed target training image output by the discriminator into the pre-trained prototype space, and determining the exchange prediction loss of the image-level preserving target style according to the output of the pre-trained prototype space, comprises:

4. A method according to claim 3, characterized in that the image-level preserving target style of the target generated image and the processed target training image is calculated from the normalized result by the following expression:

L _swap (z _i ，z _j )＝l(z _i ，q _j )+l(z _j ，q _i )

5. The method of claim 1, wherein the obtaining text guidance loss between the source-generated image and the target-generated image comprises:

6. The method of claim 5, wherein text guidance losses between the source generated image and the target generated image are calculated from the first output result and the second output result of the text encoder and the output result of the image encoder by the following expression:

△T _i ＝E _T (T _t )-E _T (T _s )

△I＝E _I (I _gen )-E _I (I _s )

L _td ＝1-CosSim(△T _i ，△I)

△T _s ＝E _T (T _t )-E _T (T _ts )

L _tsd ＝1-CosSim(△T _s ，△I)

7. The method of claim 1, wherein the obtaining a contrast learning penalty between the source-generated image and the target-generated image comprises:

8. The method of claim 7, wherein the contrast learning penalty is determined from the difference value of the anchor point m and the j-th boundary pixel corresponding to the feature map of the source generated image and the target generated image by the following expression:

9. The method of claim 1, wherein the fine tuning loss is determined from the original challenge loss, the exchange prediction loss, the text guidance loss, and the contrast learning loss by the following expression:

10. An image generation device based on cross-modal style learning, comprising:

11. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the cross-modal style learning-based image generation method of any one of claims 1 to 9 when executing a program stored on a memory.

12. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the cross-modal style learning-based image generation method of any one of claims 1 to 9.