CN117057310A

CN117057310A - Font generation method and device based on diffusion model

Info

Publication number: CN117057310A
Application number: CN202310893386.2A
Authority: CN
Inventors: 庄佳扬; 白树海
Original assignee: Shenzhen Yipin Information Technology Co ltd
Current assignee: Shenzhen Yipin Information Technology Co ltd
Priority date: 2023-07-20
Filing date: 2023-07-20
Publication date: 2023-11-14

Abstract

The invention discloses a font generating method and device based on a diffusion model, wherein Gaussian noise blurring processing is carried out on a target font picture to obtain noise with Gaussian distribution; and carrying out inverse diffusion process treatment on the noise with Gaussian distribution, and finally generating the font picture with the target style in a mode of gradually eliminating the noise. Extracting the local style characteristics of fonts and the semantic information of pictures based on the diffusion model, and avoiding the problems of unstable training and mode collapse without antagonism training; a high quality and multi-style font picture can be generated.

Description

Font generation method and device based on diffusion model

Technical Field

The present invention relates to the field of machine learning, and in particular, to a font generating method and apparatus based on a diffusion model.

Background

Personalized fonts are fonts that can be customized according to the user's preferences and styles. It can reflect the individuality, emotion and creativity of people. Personalized fonts can be used in a variety of applications, such as signatures, signs, posters, cards, and the like. The whole set of fonts with uniform styles can be generated through a part of fonts, so that the font development cost of a designer can be reduced.

The font generation methods in the prior art are all based on a structure of font characteristics and style characteristics fusion, as shown in fig. 1: first, the glyph features are extracted from the standard word using one network and then the style features are extracted from the style word using another network. Then, the two features are fused together, and finally, the Chinese characters with the same font but different styles are reconstructed.

However, the network structure based on Auto-Encoder can generate a relatively blurred picture, so that a discriminator is generally added, and the Auto-Encoder is modified into GAN to improve the definition through countermeasure learning. The training of the GAN network is unstable, and the problems of gradient disappearance, mode collapse and the like are easy to occur, so that the super parameters and the network structure need to be carefully adjusted. And requires additional training of a arbiter, which increases training complexity and overhead.

The existing method is that after the font generation algorithm parameters are initialized, the whole font generation network is trained at the same time, and special operation on the style characteristic extraction part is lacked. Thus, the local style characteristics and the semantic information of the picture cannot be fully extracted.

Disclosure of Invention

Aiming at the technical problems, the invention provides a font generation method and device based on a diffusion model.

In a first aspect of the present invention, there is provided a font generating method based on a diffusion model, including:

acquiring a font package, and converting fonts in the font package into font pictures, wherein the font package comprises a style font package and a font style package;

extracting style characteristics from the style reference picture by using a style characteristic extractor, extracting font characteristics from the font picture and the noise picture by using a font characteristic extractor, fusing the style characteristics and the font characteristics in a characteristic fusion module, and obtaining a noise target font picture by using a characteristic decoder;

starting from the noise, a font picture with the same font picture content and a style matching the style reference picture is generated by gradually removing the noise.

For the noise x with Gaussian distribution _T And carrying out reverse diffusion process treatment, and finally generating the font picture with the target style in a mode of gradually eliminating noise.

In one embodiment, a forward diffusion process q (x _t |x _t-1 ) Carrying out Gaussian noise blurring processing on the font picture, wherein the Gaussian noise blurring processing comprises the following steps: carrying out Gaussian noise blurring processing by adopting the following formula;

wherein beta is _t Is a super parameter for controlling the variance of noise, and the value is between 0 and 1; satisfy beta ₁ <β ₂ <…<β _t T is a positive integer between 0 and T, T is the total number of diffusion steps, x _t For the noise sample of step t, I represents the identity matrix, z _t To obey standard normal distributionIs a random variable of (a).

In one embodiment, the pair of noise x with Gaussian distribution _T The reverse diffusion process is carried out by adopting the following modes:

from x using forward diffusion process ₀ Start to gradually generate noise sequenceColumn x ₁ ，x ₂ ，…，x _T ；

Defining a reverse denoising process p _θ (x _t-1 |x _t ,(r ₁ ,r ₂ ,…,r _n ) C) to noise x _T Denoising until a font image is generated, the back diffusion process can be expressed by the following formula:

where θ is a parameter of the neural network, (r) ₁ ,r ₂ ,…,r _Z ) Reference character picture for providing style characteristics, n is any number of reference characters, c is picture for providing font content, m _t (x _t ,(r ₁ ,r ₂ ,…,r _Z ) V) and v _t (x _t ,(r ₁ ,r ₂ ,…,r _Z ) The neural network parameterized mean and variance functions are responsible for predicting a denoising image at the previous moment according to the noise image at the current moment;

finally generating a font picture with a target style by gradually eliminating noise by utilizing a reverse denoising process;

the first element x of the denoising sequence ₀ As a generated data sample output.

In one embodiment, the style feature extractor training process includes:

preprocessing the font picture, dividing the font picture into picture blocks, and randomly shielding a part of the picture blocks;

coding the segmented uncovered picture blocks and the covered picture blocks by using a style feature extractor of a transform structure to obtain hidden layer representations;

inputting the hidden layer representation into a decoder for decoding;

performing mean square error calculation on the output of the decoder and the original font picture, and training by adopting a random gradient descent algorithm;

the stylistic feature extractor is trained until the extractor is able to generate artwork through the uncovered picture tiles.

In a second aspect of the present invention, there is provided a font generating apparatus based on a diffusion model, comprising:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a font packet and converting fonts in the font packet into font pictures, and the font packet comprises a style font packet and a font packet; sampling a noise from the gaussian distributed noise;

the system comprises a feature fusion module, a feature decoder and a style reference image processing module, wherein the feature fusion module is used for extracting style features from the style reference image by utilizing a style feature extractor, extracting font features from a font image and a noise image by utilizing a font feature extractor, fusing the style features and the font features in the feature fusion module, obtaining a noise target font image by the feature decoder, and generating a font image with the same content as the font image and the style matched with the style reference image by gradually eliminating noise from the beginning of noise;

and the diffusion module is used for carrying out inverse diffusion process treatment on the noise with Gaussian distribution, and finally generating the font picture with the target style in a mode of gradually eliminating the noise.

In a third aspect of the present invention, there is provided an electronic apparatus comprising:

at least one processor; and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method according to the first aspect of the embodiments of the invention.

In a fourth aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored which, when run by a computer, performs the method according to the first aspect of the embodiment of the invention.

According to the invention, the local style characteristics of fonts and the semantic information of pictures are extracted based on the diffusion model, so that the antagonism training is not needed, and the problems of unstable training and mode collapse are avoided; a high quality and multi-style font picture can be generated.

Drawings

Fig. 1 is a schematic diagram of a font generating method in the prior art.

Fig. 2 is a flow chart of a font generating method based on a diffusion model in an embodiment of the present invention.

FIG. 3 is a schematic diagram of training a style encoder using a masking self-encoding algorithm in an embodiment of the present invention.

Fig. 4 is a schematic diagram of font generation in which a font image is subjected to a reverse diffusion process in an embodiment of the present invention.

Fig. 5 is a schematic block diagram of a font generating device based on a diffusion model in an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Referring to fig. 1, the style feature extractor extracts style features of fonts such as mountain, tun, write, etc., inputs the feature fusion device, the font feature extractor extracts font features of milk, inputs the feature fusion device, and outputs the milk word with style features by the feature decoder after style migration.

The invention provides a font generating method based on a diffusion model, which is different from the prior art that a characteristic fusion device is directly used for outputting fonts. The present invention uses an asymmetric encoder-decoder configuration in which the encoder operates only on the visible image region as a style feature encoder (i.e., feature extractor), and the decoder reconstructs the original image from the hidden layer representation and occlusion markers using a lightweight network.

The invention discloses a pre-training method adopting an MAE algorithm as a style characteristic encoder, which specifically comprises the following steps:

A. and preprocessing the font picture to obtain the segmented font small blocks and the segmented font small blocks with the covering parts.

B. And encoding the divided font small blocks and the divided font small blocks with the covering parts by using an encoder with a transform structure to obtain hidden layer representation.

C. The hidden layer representation and mask token are decoded by a decoder that inputs the transform structure.

Specifically, as shown in fig. 2, the input image is segmented into a plurality of small blocks, a certain proportion of small blocks are randomly blocked, and a shared learnable vector is used for replacing the blocked small blocks. The non-occluded tiles are subjected to a linear transformation and position coding and then sent to an encoder, which is a pure transform model, to obtain the hidden layer representation of each tile. The output of the encoder and the shielding mark are spliced together, the original sequence is restored, and the linear transformation and the position coding are carried out and then the obtained sequence is sent to a decoder which is a lightweight visual transducer model, so that the reconstruction output of each small block is obtained. The output of the decoder is subjected to a linear transformation, the dimension of which is mapped to the same dimension as the original small block, and then the image is reshaped into an image shape, so that a reconstructed image is obtained. And calculating the mean square error of the reconstructed image and the original image in the blocked area, and training model parameters as a loss function.

From the above, after the style feature encoder is trained, the parameters are input into the font generation network, and the font style is generated by using the back diffusion process. The method not only can extract the local style characteristics of the fonts, but also can acquire the semantic information of the pictures. Because a style characterization encoder trained with MAE can restore occluded portions by non-occluded portions of font pictures, it is explained that the encoder has obtained semantic information for the pictures.

The GAN and the diffusion model are two different generation models, which can be used for tasks such as image generation, and have similar network structures. GAN is an implicit density model that does not model the probability distribution of data directly, but rather countertraining by a generator and a arbiter, so that the generator can sample samples from a simple noise distribution that are close to the true data distribution. The invention takes a diffusion model as an algorithm core framework, which is an explicit density model, gradually adds data into noise through a diffusion process until the data becomes Gaussian noise, and then gradually removes the noise through a back diffusion process to recover the original data. Therefore, the invention does not need resistance training, and avoids the problems of unstable training and mode collapse.

The goal of the MAE algorithm is to train an efficient feature extractor that can be used for downstream tasks. Its decoder is only used during the pre-training phase and does not participate in downstream tasks. The encoder is based on a pure transducer model and only processes the non-occluded part, while the invention sets the occluded part as a learnable noise parameter so that the style feature extractor processes both the visible and occluded parts. And further, the Swin Transformer and CNN models which are required to keep the image structure can be used, and the extracted style characteristics still have certain image structure characteristics.

Further, after the style feature encoder training is completed, it is transplanted into the font generation network. Then, giving a true font picture data distribution, and performing a forward diffusion process, wherein the forward diffusion process refers to the process of distributing x from the true data ₀ Starting from q (x), gradually adding a small Gaussian noise, and forward diffusion process can be used for generating a high-frequency signalI.e. step by step the data samples and z _t Adding according to a certain proportion until x _t Becomes an isotropic normal distribution noise figure, wherein t=1, 2, …, T, where T is the total number of diffusion steps, x _t Is the noise sample of step t. Wherein beta is _t For the diffusion coefficient of the t step, satisfy 0<β ₁ <β ₂ <…<β _T <1，z _t To obey standard normal distributionIs a random variable of (a). The above linear procedure can also be seen as sampling from a gaussian distribution, in particular by means of the skill of the re-parameterization. The complete forward transfer process can be written:the entire forward process is a posterior estimate, expressed as:

from x using forward diffusion process ₀ Start to gradually generate noise sequence x ₁ ，x ₂ ，…，x _T The method comprises the steps of carrying out a first treatment on the surface of the Then the reverse diffusion process is carried outI.e. starting from the pure noise distribution, the noise is gradually removed and the original data is restored. The invention adds the reference character picture providing style characteristics and the picture providing font contents as conditions to guide the algorithm model to generate the picture with specific style and specific font.

The goal of the back-diffusion process is to maximize the posterior probability p of the data _θ (x _t-1 |x _t ,(r ₁ ,r ₂ ,…,r _n ) C), i.e. given the pure noise data x of the last step _T Find the original data x matching with it ₀ . This posterior probability can be written as: to calculate p _θ (x _t-1 |x _t ,(r ₁ ,r ₂ ,…,r _n ) C) we can exploit the properties of a markov chain to decompose it into a product of a series of conditional probabilities: p (x) _T |x ₀ )＝p(x ₁ |x ₀ )p(x ₂ |x ₁ )…p(x _T |x _T-1 ) Can also be written asWhere θ is a parameter of the neural network, (r) ₁ ,r ₂ ,…,r _Z ) Reference character picture for providing style characteristics, n is any number of reference characters, c is picture for providing font content, m _t (x _t ,(r ₁ ,r ₂ ,…,r _Z ) V) and v _t (x _t ,(r ₁ ,r ₂ ,…,r _Z ) The neural network parameterized mean and variance functions are responsible for predicting a denoising image at the previous moment according to the noise image at the current moment;

finally generating the target wind by gradually eliminating noise by utilizing the reverse denoising processA font picture of the grid; the first element x of the denoising sequence ₀ As a generated data sample output.

As shown in fig. 2, the present invention is specifically implemented by the following steps:

step 110: and acquiring a font package, and converting the fonts into font pictures, wherein the font package comprises a style font package and a font style package. Standard font library font packages, typically songzhi or bold containing all simplified and traditional characters, are selected as inputs for providing the content of the characters. The bold is selected as a font packet for providing font contents, and then the font packet is arbitrarily selected to generate font pictures of the same style as a style reference picture.

Step 120: as shown in fig. 3, the font picture is divided into small blocks using a masking self-coding algorithm, some of the small blocks are masked, and the style feature encoder and decoder are trained by calculating the mean square error and gradient descent method until the style feature encoder and decoder can restore the entire font picture by the unmasked portions.

According to the invention, a mask self-coding algorithm is adopted to train a style encoder, so that local style characteristics and semantic information of pictures can be extracted; the resistance training is not needed, and the problems of unstable training and mode collapse are avoided; font pictures with high quality and multiple wind lattices can be generated; the random gradient is utilized to sample data distribution, so that the sampling efficiency and accuracy are improved.

Step 130: then, given the real font picture data distribution, forward diffusion process is performed, wherein the forward diffusion process means that small Gaussian noise is gradually added from the real data distribution, so that the data distribution becomes an isotropic normal distribution noise figure.

Step 140: the reverse diffusion process is shown in fig. 4: and extracting style characteristics from the style reference picture by using a style characteristic extractor, extracting font characteristics from font contents and the noise picture by using a font characteristic extractor, fusing the style characteristics and the font characteristics in a characteristic fusion module, and then generating a denoising picture from a decoder.

And generating a font picture with the same content as the bold font picture but the style matched with the style reference picture by gradually eliminating noise from pure state noise distribution.

The back diffusion model requires a certain amount of data as training, the data coming from an already existing font package. The algorithm model is trained through existing data until the algorithm model is able to generate a complete set of font packages through a portion of the font pictures.

The network parameters in the inverse diffusion model are trained using variance inference, maximum likelihood estimation, or other optimization methods. This can be achieved by minimizing Loss functions, such as L1 Loss, L2 Loss, and Smooth L1 Loss, among others. L1 Loss is used here as a Loss function. The loss function is used to measure the difference between the data generated by the back diffusion process and the data generated by the forward diffusion process.

And generating a font picture by using a trained network, namely sampling noise data from noise distribution, and sequentially applying a transfer kernel in a reverse diffusion process (namely, incompletely denoised noise picture generated in an intermediate process) until the font picture with the same content as that of the bold font picture and the style matched with the style reference picture is obtained.

As shown in fig. 5, a font generating device based on a diffusion model includes:

The present invention also provides an electronic device including:

at least one processor; and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the diffusion model-based font generation method described above.

The present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the above-described diffusion model-based font generation method.

It is understood that the computer-readable storage medium may include: any entity or device capable of carrying a computer program, a recording medium, a USB flash disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a software distribution medium, and so forth. The computer program comprises computer program code. The computer program code may be in the form of source code, object code, executable files, or in some intermediate form, among others. The computer readable storage medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a software distribution medium, and so forth.

In some embodiments of the present invention, the apparatus may include a controller, which is a single-chip microcomputer chip, integrated with a processor, a memory, a communication module, etc. The processor may refer to a processor comprised by the controller. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A font generation method based on a diffusion model, comprising:

generating a font picture with the same content as the font picture and the style matched with the style reference picture by gradually eliminating the noise from the noise;

2. A diffusion model based font generation method according to claim 1, characterized in that a forward diffusion process q (x _t |x _t-1 ) Carrying out Gaussian noise blurring processing on the font picture, wherein the Gaussian noise blurring processing comprises the following steps: carrying out Gaussian noise blurring processing by adopting the following formula;

3. The diffusion model-based font generation method according to claim 1, wherein the pair of noise x having gaussian distribution _T The reverse diffusion process is carried out by adopting the following modes:

from x using forward diffusion process ₀ Start to gradually generate noise sequence x ₁ ，x ₂ ，…，x _T ；

Defining a reverse denoising process p _θ (x _t-1 |x _t ,(r ₁ ,r ₂ ,…,r _n ) C) to noise x _T Denoising until the target font image is generated, wherein the reverse diffusion process can be expressed by the following formula:

4. The diffusion model based font generation method of claim 1, wherein the style feature extractor training process comprises:

inputting the hidden layer representation into a decoder for decoding;

5. A diffusion model-based font generating apparatus, comprising:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a font packet and converting fonts in the font packet into font pictures, and the font packet comprises a style font packet and a font packet; sampling a noise x from Gaussian-distributed noise _T ；

6. An electronic device, comprising:

at least one processor; and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-4.

7. A computer-readable storage medium, on which a computer program is stored, which, when being run by a computer, performs the method according to any one of claims 1 to 4.