CN118014044B

CN118014044B - Diffusion model quantization method and device based on multi-base binarization

Info

Publication number: CN118014044B
Application number: CN202410411724.9A
Authority: CN
Inventors: 刘祥龙; 郑星宇; 马旭栋; 郝昊杰; 郭晋阳
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2024-04-08
Filing date: 2024-04-08
Publication date: 2024-06-14
Anticipated expiration: 2044-04-08
Also published as: CN118014044A

Abstract

The invention discloses a diffusion model quantization method and device based on multi-base binarization. The quantization method comprises the following steps: in forward propagation, a learnable multi-base binarizer is used to enhance the binarization weights of the diffusion model; in back propagation, the computation of multiple bases in the multi-base binarizer is performed in parallel; in the training process, the binary diffusion model imitates the representation result of the full-precision diffusion model by imitating the full-precision representation in a low-rank space. Compared with the prior art, the invention provides significant productivity advantages, and demonstrates the great potential of deploying diffusion models on edge hardware.

Description

Diffusion model quantization method and device based on multi-base binarization

Technical Field

The invention relates to a diffusion model quantization method based on multi-base binarization, and also relates to a corresponding diffusion model quantization device, belonging to the technical field of neural network quantization.

Background

Diffusion Models (DM) exhibit excellent capabilities in the generation tasks of various fields. Diffusion models have become one of the most popular generation model paradigms by virtue of superior generation quality and diversity. The diffusion model includes two processes: forward propagation and backward propagation, wherein forward propagation is also known as diffusion process; back propagation is used to generate data samples and is therefore also known as a generation process.

However, the diffusion model generation process is slow, and iterative steps as many as thousands of steps slow down the reasoning process and rely on expensive hardware resources, which constitutes a significant challenge for its widespread implementation. Quantization and binarization are popular neural network compression methods that can quantize the full-precision parameters of the neural network to a lower number of bits (e.g., 1-8 bits). By converting the floating point weight and the activation function into quantization values, the model size of the neural network can be reduced, and the calculation complexity can be reduced, so that the reasoning speed is obviously improved, the memory usage amount is saved, and the energy consumption is effectively reduced. In the prior art, quantization methods for diffusion models are generally classified into two categories: post-training quantization and quantized perceptual training. Post-training quantization is considered as a more practical solution as a training-free approach, which obtains a quantization model at lower cost by searching for optimal scaling factor candidates and optimizing calibration strategies. However, the diffusion model quantified by post-training can be significantly reduced in quality of generation. Therefore, quantization perception training appears to improve the accuracy of the diffusion model after quantization. Diffusion models obtained by quantization perception training are typically more accurate than diffusion models obtained by post-training quantization due to the benefits of the training/fine tuning process with sufficient data and training resources. However, 1-bit quantization (i.e., binarization) for the diffusion model weights is still far from being achieved.

In the chinese invention application with application number 202311540202.0, a light weight method of diffusion model is disclosed, comprising the following steps: obtaining target data and a pre-trained teacher model; inputting the target data into the teacher model to obtain first results respectively output by the plurality of intermediate blocks; inputting each intermediate block into the corresponding block of the intermediate block to obtain a plurality of second results output by each block under the condition of adopting different path operations; according to the first result output by each intermediate block and the second results output by the blocks corresponding to the intermediate blocks, performing block searching on each block to obtain a target block corresponding to each block; generating a subnet according to each target block; retraining the subnet to obtain a trained target subnet; and establishing a target diffusion model according to the target subnet.

Disclosure of Invention

The primary technical problem to be solved by the invention is to provide a diffusion model quantization method based on multi-base binarization.

Another technical problem to be solved by the present invention is to provide a diffusion model quantization device based on multi-base binarization.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

according to a first aspect of an embodiment of the present invention, there is provided a diffusion model quantization method based on multi-base binarization, including the steps of:

In forward propagation, a learnable multi-base binarizer is used to enhance the binarization weights of the diffusion model; in back propagation, the computation of multiple bases in the multi-base binarizer is implemented in parallel; wherein the learnable multi-base binarizer is represented by:

Wherein, Represents the approximate value after binary quantization, w represents the original weight before quantization,/>And/>Is a learnable scalar, and the initial values are respectively set as/>And/>Wherein/>Representation/>A norm;

In the training process, the binary diffusion model is made to simulate the representation result of the full-precision diffusion model by simulating the full-precision representation in a low-rank space.

Wherein preferably in the noise estimation network of the diffusion model a learnable multi-base binarizer is used only when the feature scale is greater than or equal to half the input scale.

Wherein preferably in said back propagation a straight-through estimator approximation is usedA function.

Wherein preferably the full precision representation is simulated in a low rank space comprising the sub-steps of:

the full-precision diffusion model is grouped by adopting a time step embedding module, and the intermediate representation result is projected to a low-rank space by using principal component analysis.

Wherein preferably said time step embedding module consists of a residual convolution and a transformer block.

Preferably, in the early stage of the training process, a gradual binarization strategy is adopted to enable the binarization diffusion model to stably converge.

Wherein preferably, the progressive binarization strategy comprises the following sub-steps:

in the initial iteration, for the first of the diffusion models The time step embedding module carries out quantization; in the following/>In the iteration, at the/>Will be at the time of iterationSum/>Binarization is performed on the time step embedding modules until all the time step embedding modules are binarized, wherein/>Representing a downward rounding function,/>Representing the number of time-step embedded modules in the noise estimation network of the diffusion model.

According to a second aspect of an embodiment of the present invention, there is provided a diffusion model quantization apparatus based on multi-base binarization, including a processor and a memory; wherein,

The memory is coupled to the processor for storing a computer program that, when executed by the processor, causes the processor to implement the diffusion model quantization method based on multi-base binarization described above.

Compared with the prior art, the diffusion model quantification method (BinaryDM) based on multi-base binarization provides significant productivity advantages on various models and data sets. Under the same 1-bit weight, especially under ultra-low bit activation, the present diffusion model quantization method can be continuously superior to the baseline of the pixel spatial diffusion model (DDIM) and the latent spatial diffusion model (LDM). For example, on CIFAR-10×32× 32 DDIM, the accuracy index of the present diffusion model quantization method is even higher than the baseline by 49.04%, so that the binary diffusion model avoids crashing. As a lead binarization method for diffusion models, the present diffusion model quantization method achieves dramatic 16.0-fold and 27.1-fold savings in FLOP (floating point operands) and model size, demonstrating the great advantages and potential of deploying diffusion models on edge hardware (e.g., cell phones, smartwatches, etc.).

Drawings

FIG. 1 is a logic frame diagram of a diffusion model quantization method according to an embodiment of the present invention;

FIG. 2 is a graph showing a comparison of the binarized weights of filters (channels) in a convolutional layer one by one in accordance with an embodiment of the present invention;

FIG. 3 is a diagram showing a comparison of BinaryDM and baseline details in generating an image, in accordance with an embodiment of the present invention;

FIG. 4 is a graph showing the effect of different distillation loss functions on the output characteristics of each block in the full-precision diffusion model and the binary diffusion model, in accordance with an embodiment of the present invention Measuring the distance;

Fig. 5 is a schematic structural diagram of a diffusion model quantization apparatus according to an embodiment of the present invention.

Detailed Description

The technical contents of the present invention will be described in detail with reference to the accompanying drawings and specific examples.

As shown in fig. 1, the diffusion model quantization method provided by the embodiment of the present invention mainly includes two technical contents: (1) A learnable multi-base binarizer (Learnable Multi-basis Binarizer, abbreviated LMB) is introduced in the forward and backward propagation for recovering the representation results generated by the binarized diffusion model. Wherein the LMB applies at least two sets of binary bases with a learnable scalar to enhance the feature extraction capability of the weights. (2) A Low-rank representation imitation (Low-rank Representation Mimicking, abbreviated LRM) solution is introduced in the optimization process to enhance the binary perceptual optimization capability of the diffusion model. The LRM projects the binarized and full-precision representation to a low rank, enabling optimization of the binarized diffusion model to focus on the main direction and to stably disambiguate under fine-grained supervision. On this basis, the present diffusion model quantization method further applies a gradual initialization strategy to the early training stage so that the optimization starts from a location that is prone to convergence.

In the following, a Baseline (Baseline) of a binary diffusion model is first introduced to facilitate an understanding of the improvements of the present invention over the prior art. Scheduling by controlling noise strength during forward propagation of diffusion modelsData ofGo/>The sub-gaussian noise superposition, the process can be expressed as:

Wherein, Represents the/>Step noise samples. Back propagation aims at generating samples by removing noise, with a learned distribution/>To approximate the distribution of unusable conditions/>Can be expressed as:

Mean value of Sum of variances/>Existing reparameterization techniques can be used to derive:

Wherein, And/>Representation with leachable parameters/>Is from/>Prediction/>. For training of diffusion models, a simplified variant lower bound variant is typically employed as a loss function to achieve high quality samples, which can be expressed as:

quantization compresses and accelerates the noise estimation model by discretizing the weights and activation values into low bits. In the baseline of the binarized diffusion model, weights Is binarized to 1 bit:

Wherein, The function will/>Limited between +1 and-1, with a threshold of 0,/>Representing binarized weights,/>Noise estimation network representing diffusion model,/>Is a floating point scalar initialized to/>（/>Representing the number of weight elements).

In the current basic binarization process, weights are quantized to binary values to save storage and computational resources at the time of reasoning. In addition, the activation value may also be quantized to further save resources. However, extensive discretization of the weights into 1 bit in the diffusion model leads to significant deterioration of the resulting result. Limiting the bit width of each weight element to a fraction of the original bit width results in a significant reduction in the richness of the filter or linear projection formed by these weights. Thus, the ability of the diffusion model to extract features from the input is greatly compromised. This compromise in extraction capability, coupled with the numerical discretization of the weights, makes it challenging to preserve details accurately in the representation results, which is critical to preserving complex details and textures of the synthesized data in the diffusion model. Thus, the current underlying binarization process results in a significant degradation of the quality of the diffusion model representation.

To solve this problem, in the embodiment of the present invention, a learnable multi-base binarizer (abbreviated as LMB) is first proposed to enhance the binarization weight of the diffusion model and synthesize a high-quality representation result. In forward propagation, the procedure of LMB is defined as follows:

Wherein, Represents the approximate value after binary quantization, w represents the original full-precision weight before quantization,/>And/>Are learnable scalar quantities whose initial values are respectively set as/>And/>Wherein/>Representation ofNorms.

The reasoning of the layer binarized by the LMB involves the computation of multiple bases. For example, in a binary diffusion model, the reasoning process of convolution is:

Wherein, Representing activation,/>And/>Respectively representing convolutions made up of multiply and add instructions. In the reasoning process, as a plurality of bases related to a certain LMB are independent, the calculation of the bases can be parallelized, so that the reasoning process of a binarized diffusion model is accelerated.

In the back propagation of LMBs, the gradient of the learnable scalar is calculated as follows:

Wherein, Representing a derivative function. In back propagation, embodiments of the present invention employ a pass-through estimator (Straight Through Estimator) to approximate/>A function. By having a binary basis of a plurality of learnable scalars, the representation ability of quantization weights is significantly enhanced. Residual initialization allows optimization of the binary diffusion model to begin with a state that minimizes error. The representation of weights is significantly diversified by LMB compared to the baseline of the binarized diffusion model, where a comparison graph of filter (channel) by binary weights in the convolutional layer is shown in fig. 2. The rich features of the LMB represent that the diffusion model quantization method provided by the embodiment of the invention can have more complete expression on the basis of overall denoising, as shown in various details in FIG. 3.

It should be noted that the inventor only applies LMBs in critical locations of the diffusion model, while maintaining a compact underlying binarization in other locations, avoiding unnecessary overhead, and achieving a balance between accuracy and efficiency. For example, in one embodiment of the invention, in a noise estimation network of a diffusion model, the inventors apply LMB only when the feature scale is greater than or equal to half the input scale, i.e., the first six layers and the last six layers in the U-Net. At other locations, the quantification pattern remains consistent with the baseline, as shown in equation (5). The above configuration can significantly reduce weight storage by 41.5% compared to directly applying LMBs at all locations. Thus, LMB significantly improves the binary diffusion model with little additional burden.

On the other hand, direct training of the binary diffusion model presents difficulties in terms of convergence due to the involvement of extreme weight binarization functions (and activation of quantization functions). Thus, the inventors applied representation simulations to assist in training of the binary diffusion model. During training, it aligns intermediate and output representations of the full-precision diffusion model and the binary diffusion model.

However, there is a problem in directly aligning the intermediate representation results of the binarized and full-precision diffusion models during the optimization process. First, fine-grained alignment of high-dimensional representations can lead to blurring of the direction of optimization of the diffusion model, especially when introducing simulations of intermediate features. Second, intermediate features in the binarized diffusion model are derived from discrete potential space compared to the full-precision diffusion model, as discretization of weights (and activations) makes it difficult to directly mimic the full-precision diffusion model.

Thus, in one embodiment of the invention a solution for low-rank representation emulation is presented, i.e. by emulating a full-precision representation in low-rank space, the binarized diffusion model is effectively optimized. In particular, the inventors have modeled full-precision diffusion based on a time-step embedding module (consisting of residual convolution and a transformer block)Grouping, the intermediate representation results can be recorded asWherein h, w, c represent height, width, channel dimensions, respectively. Next, the intermediate representation results are projected into a low rank space using Principal Component Analysis (PCA).

In one embodiment of the invention, the covariance matrix of the full-precision diffusion model representation is:

Wherein, Representing front/>A combination of modules.

Then, a feature vector matrix can be obtained by：

Wherein,Is/>Is arranged in descending order. The inventors will be described by/>Front/>The matrix of column eigenvectors is denoted/>, as the transformation matrixWherein/>Representing an upward rounding function,/>Indicating the number of dimension reductions.

The inventors useTo project intermediate representation results of the full-precision diffusion model and the binary diffusion model:

Wherein, Representation with binarized parameters/>Diffusion model of/>The middle of the layer represents the result,And/>Respectively represent/>, having the same shapeA full-precision diffusion model and a binary diffusion model.

Then, the obtained low-rank representation is utilized to promote the binary diffusion model to learn the full-precision counterpart. Specifically, the inventors constructed a full-precision diffusion model and a binary diffusion model betweenMean Square Error (MSE) loss between individual modular low rank representations:

The total loss consists of equation (4) together with equation (13):

Wherein, Representing the number of time step embedding modules in the noise estimation network of the diffusion model,/>Is a super-parametric coefficient used to balance the loss term. Since the transformation matrix/>, is obtained in a low-rank representation mimicking solutionThe calculation of the matrix by the first input and keeping it unchanged during the training process is quite expensive. The mapping between the fixed representations also helps to optimize the binary diffusion model from a stable perspective.

The low-rank representation emulation solution enables the binary diffusion model to emulate the representation of a full-precision diffusion model, improving the optimization of the binary diffusion model by introducing additional supervision. As shown in fig. 4, the low-rank representation emulation solution effectively brings the local module close to the full-precision module. Furthermore, by applying a low-rank projection based on the principal components of the full-precision representation before the representation mimics, the binarized diffusion model can be optimized in a clear and stable direction, accelerating the convergence of the diffusion model. Furthermore, the binarized diffusion model and the full-precision diffusion model have completely identical architecture, so that the representation between them mimics very naturally.

Next, a training strategy of the diffusion model quantization method provided by the embodiment of the present invention is further described. It has been described that the diffusion model quantization method binarizes the noise estimation network of the diffusion model with a learnable multi-base binarizer and optimizes the objective function with a low-rank representation-mimicking solution. However, despite significant improvements in architecture and optimization, convergence of the binary diffusion model during training is still slow and difficult to achieve.

Therefore, in the early stage of the training process, the embodiment of the invention adopts a progressive binarization strategy, so that the binarization diffusion model can accelerate and stably converge without generating extra cost. Specifically, in the initial iteration, for the first of the diffusion modelsThe individual time steps are quantized by the embedding module. In the following/>In the iteration, at the/>Will be at the time of iterationSum/>Binarization is performed on the time step embedding modules until all the time step embedding modules are binarized, wherein/>Representing a downward rounding function. This process enables the binary diffusion model to be tuned to a favorable starting position for optimization. This progressive binarization strategy exhibits significant performance advantages over binarizing the entire diffusion model directly at the beginning of training over the same training duration. Furthermore, it is worth noting that the entire progressive quantization process typically only occupies about 0.002% (5 iterations) of the total 200K iterations required for training, and can be considered as an extremely efficient initialization strategy for the binary diffusion model.

Specific steps of the diffusion model quantization method provided by the embodiment of the invention are described in detail above. In order to verify the practical performance of the diffusion model quantization method provided by the embodiment of the present invention, the inventors conducted experiments on a plurality of data sets, including CIFAR-10 32×32, LSUN-Bedrooms 256 ×256, LSUN-Churches 256 ×256, FFHQ 256 ×256, and ImageNet 256×256, for unconditional image generation tasks and conditional image generation tasks, covering a pixel spatial diffusion model (DDIM) and a latent spatial diffusion model (LDM). The specific experimental results are described below:

1. unconditional image generation tasks

The inventors first performed experiments on CIFAR-10.times.32. As shown in table 1, at this low resolution, the baseline of the binarized diffusion model suffered a severe performance decay, while the present diffusion model quantification method significantly restored performance. On CIFAR-10X 32X DDIM, the accuracy index of the method is even higher than the baseline by 49.04 (51.22-2.18)%, so that the binary diffusion model is prevented from collapsing. Under the W1A4 bit width, IS indexes on CIFAR-10 32 multiplied by 32 of the diffusion model quantification method obviously exceed the binary baseline, and the IS indexes in the W1A32 greatly exceed the binary diffusion model baseline from various evaluation indexes.

Comparison results of unconditional image generation tasks on tables 1 CIFAR-10

2. Conditional image generation task

For the conditional image generation task, the performance on the ImageNet dataset with a resolution of 256×256 is focused on LDM-4. The inventors generated images using three different samplers: DDIM, PLMS, and DPM-Solver. The results in table 2 highlight the significant effectiveness of the present diffusion model quantification method, continuing beyond baseline across nearly all evaluation metrics, and even in some cases, better than the full-precision diffusion model. The baseline of the binarized diffusion model appears relatively stable under the configurations W1a32 and W1A8, but drops significantly under W1 A4. Specifically, when using a DPM-Solver sampler, IS drops to 85.99 and FID increases dramatically to 25.85. In sharp contrast, the binarized diffusion model in the embodiments of the present invention maintains sustained high performance, achieves IS of 156.19 and FID of 11.15, and IS superior to baseline in most scenarios.

Table 2 quantification results of conditional image generation tasks on ImageNet 256×256

3. Ablation experiment results

Next, the inventors conducted a comprehensive ablation study of LDM-4 on LSUN-Bedrooms 256 x 256 datasets to evaluate the effectiveness of the present diffusion model quantification method.

Table 3 LSUN-Bedrooms 256 x 256 data set ablation results

Here, the effectiveness of LMBs and LRMs proposed by the inventors was evaluated, and the results are shown in table 3. When only the LMB provided by the embodiments of the present invention is applied to the binary diffusion model, performance is significantly recovered. On the other hand, the application of LRM only in optimization does not completely solve the performance bottleneck of the binary diffusion model, resulting in significant degradation of accuracy. The combination of the two technologies in the diffusion model quantization method provided by the embodiment of the invention can obviously improve the performance.

4. Inference efficiency analysis

For reasoning efficiency, the inventors demonstrate the parameter size and FLOPs of the diffusion model quantization method for different active bit widths. The results in table 4 show that the diffusion model quantization method provided by the embodiment of the invention can achieve space saving of up to 27.1 times in the reasoning process, and simultaneously achieves acceleration of up to 16.0 times in the reasoning process, thereby fully embodying the advantages of binarization calculation.

TABLE 4 inference efficiency of diffusion model quantization method

5. Training efficiency analysis

For training efficiency, although the diffusion model quantization method provided by the embodiments of the present invention generally generates higher overhead in the training process compared to the post-training quantization method, practical observations indicate that the diffusion model quantization method provides productivity advantages over various models and datasets. As shown in table 5, although the training time of the present Diffusion model quantization method is shorter than the calibration time required for the existing Q-Diffusion quantization method, the present Diffusion model quantization method achieves significantly superior generation quality at low bit widths.

Table 5 present Diffusion model quantization method and Q-Diffusion quantization method

Comparison results in training time cost

Based on the diffusion model quantization method based on multi-base binarization, the embodiment of the invention further provides a diffusion model quantization device based on multi-base binarization. As shown in fig. 5, the diffusion model quantization means comprises one or more processors and a memory. Wherein the memory is coupled to the processor for storing one or more computer programs that, when executed by the one or more processors, cause the one or more processors to implement the diffusion model quantization method based on multi-base binarization as in the above embodiments.

The processor is used for controlling the overall operation of the diffusion model quantifying device to complete all or part of the steps of the diffusion model quantifying method based on multi-base binarization. The processor may be a Central Processing Unit (CPU), a Graphics Processor (GPU), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processing (DSP) chip, or the like. The memory is used to store various types of data to support operations on the diffusion model quantizing device, which may include, for example, instructions for any application or method for diffusion model quantizing device operation, as well as application-related data. The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically Erasable Programmable Read Only Memory (EEPROM), erasable Programmable Read Only Memory (EPROM), programmable Read Only Memory (PROM), read Only Memory (ROM), magnetic memory, flash memory, and the like.

In an exemplary embodiment, the diffusion model quantization apparatus may be implemented by a computer chip or an entity, or by a product having a certain function, for performing the above diffusion model quantization method based on multi-base binarization, and achieving technical effects consistent with the above method. A typical embodiment is a computer, which may be, for example, a personal computer, a laptop computer, a tablet computer, an in-vehicle human-machine interaction device, or a wearable device, as well as a cellular telephone, a camera phone, a smart phone, a personal digital assistant, etc., or any combination of these devices.

Compared with the prior art, the diffusion model quantification method provided by the invention provides significant productivity advantages on various models and data sets. Under the same 1-bit weight, especially under ultra-low bit activation, the present diffusion model quantization method can be continuously superior to the baseline of the pixel spatial diffusion model (DDIM) and the latent spatial diffusion model (LDM). For example, on CIFAR-10×32× 32 DDIM, the accuracy index of the present diffusion model quantization method is even higher than the baseline by 49.04%, so that the binary diffusion model avoids crashing. As a lead binarization method for diffusion models, the present diffusion model quantization method achieves dramatic 16.0-fold and 27.1-fold savings in FLOP (floating point operands) and model size, demonstrating the great advantages and potential of deploying diffusion models on edge hardware (e.g., cell phones, smartwatches, etc.).

The diffusion model quantization method and the diffusion model quantization device based on multi-base binarization provided by the invention are described in detail. Any obvious modifications to the present invention, without departing from the spirit thereof, would constitute an infringement of the patent rights of the invention and would take on corresponding legal liabilities.

Claims

1. A diffusion model quantization method based on multi-base binarization is characterized by comprising the following steps:

2. The diffusion model quantization method according to claim 1, wherein:

In the noise estimation network of the diffusion model, a learnable multi-base binarizer is used only when the feature scale is greater than or equal to half the input scale.

3. The diffusion model quantization method according to claim 1, wherein:

In the back propagation, a straight-through estimator approximation is adopted A function.

4. The diffusion model quantization method according to claim 1, characterized by mimicking a full-precision representation in a low-rank space, comprising the sub-steps of:

5. The diffusion model quantization method according to claim 4, wherein:

the time step embedding module consists of a residual convolution and a transformer block.

6. The diffusion model quantization method according to claim 5, wherein:

And in the early stage of the training process, adopting a progressive binarization strategy to enable the binarization diffusion model to stably converge.

7. The diffusion model quantization method according to claim 6, characterized in that said progressive binarization strategy comprises the sub-steps of:

8. The diffusion model quantization device based on multi-base binarization is characterized by comprising a processor and a memory; wherein,

The memory is coupled to the processor for storing a computer program which, when executed by the processor, causes the processor to implement the multi-base binarization-based diffusion model quantization method of any one of claims 1-7.