CN116227332A

CN116227332A - Method and system for quantizing mixed bits of transformers

Info

Publication number: CN116227332A
Application number: CN202211647164.4A
Authority: CN
Inventors: 赵武金; 宋莉莉; 张祥建
Original assignee: Beijing Shihai Xintu Microelectronics Co ltd
Current assignee: Beijing Yixin Yiyu Microelectronics Technology Co ltd
Priority date: 2022-12-21
Filing date: 2022-12-21
Publication date: 2023-06-06

Abstract

The invention discloses a mixed bit quantization method of a transformer, which comprises the following steps: acquiring a trained transducer model, a calibration data set, hardware platform parameters and model operation quality requirements; calculating the sensitivity of each neural network layer of the model according to the transducer model and the calibration data set; performing simulation calculation according to hardware platform parameters, comparing the costs of carrying data and reasoning with different bits to obtain hardware cost information, and generating hardware influence information according to the hardware cost information and model operation quality requirements; and optimizing sensitivity and hardware influence information by using an optimization algorithm or an Automl technology, and outputting a mixed bit quantization scheme. The method combines hardware characteristics, provides more convenient support for quantitative deployment of the transformer model, saves computing resources and improves the performance of the model on the premise of ensuring the optimal precision of the model.

Description

Method and system for quantizing mixed bits of transformers

Technical Field

The invention relates to the technical field of deep learning, in particular to a hybrid bit quantization method and a hybrid bit quantization system for a transformer.

Background

Compared with mainstream convolutional neural networks, transformers have more complex network structures, and are more difficult to deploy at the mobile end, the edge end, embedded wearable devices, and the like. In general, mixed bit quantization is required to be performed on the model, and the model is compressed under the premise of losing a small amount of precision, so that the application of the complex model to embedded terminals such as mobile phones, robots and the like becomes possible. However, most of the current mixed bit quantization schemes do not consider hardware characteristics, especially characteristics of a dedicated AI processor, to perform mixed bit quantization. In addition to the field of NLP natural language processing, vision Transforme is becoming more and more popular in the field of computer vision, and the existing error analysis and optimization method has a less ideal effect on a new model.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a mixed bit quantization method, a system, a terminal and a medium of a transducer, which provide more convenient support for the quantization deployment of the transducer model by combining hardware characteristics, save computing resources and improve the performance of the model on the premise of ensuring the optimal precision of the model.

In a first aspect, a method for quantizing mixed bits of a transformer according to an embodiment of the present invention includes the following steps:

acquiring a trained transducer model, a calibration data set, hardware platform parameters and model operation quality requirements;

calculating the sensitivity of each neural network layer of the model according to the transducer model and the calibration data set;

performing simulation calculation according to hardware platform parameters, comparing the costs of carrying data and reasoning with different bits to obtain hardware cost information, and generating hardware influence information according to the hardware cost information and model operation quality requirements;

and optimizing sensitivity and hardware influence information by using an optimization algorithm or an Automl technology, and outputting a mixed bit quantization scheme.

Optionally, the specific method for calculating the sensitivity of each neural network layer of the model from the transducer model and the calibration data set comprises:

calculating the Hessian spectrum of each neural network layer, and determining the sensitivity of each layer to quantization according to the Hessian spectrum of each neural network layer;

calculating MSE Loss of each neural network layer, configuring different bits layer by layer according to a calibration data set, and performing forward reasoning to observe model Loss once to determine the sensitivity of each neural network layer to quantization;

and calculating the cosine distance of each neural network layer, and determining the sensitivity of each layer to quantization according to the distance between the tensor before and after model quantization according to the calibration data set.

Optionally, the model run quality requirements include accuracy, performance, and power consumption requirements.

Optionally, the optimization algorithm comprises: genetic algorithm, linear programming algorithm, pareto optimal method and dynamic programming method.

Optionally, the specific method for optimizing sensitivity and hardware impact information using an autopml technique includes: and establishing a super network, and searching the super network to find out the sub network with the best effect as a final result.

In a second aspect, an embodiment of the present invention provides a hybrid bit quantization system of a transformer, including: the system comprises a data acquisition module, a sensitivity calculation module, a hardware simulation module and an output module, wherein the data acquisition module is used for acquiring a trained transducer model, a calibration data set, hardware platform parameters and model operation quality requirements;

the sensitivity calculation module is used for calculating the sensitivity of each neural network layer of the model according to the transducer model and the calibration data set;

the hardware simulation module is used for performing simulation calculation according to hardware platform parameters, comparing the costs of carrying data and reasoning with different bits to obtain hardware cost information, and generating hardware influence information according to the hardware cost information and model operation quality requirements;

the optimizing module is used for optimizing sensitivity and hardware influence information by utilizing an optimizing algorithm or an Automl technology and outputting a mixed bit quantization scheme.

Optionally, the sensitivity calculation module includes a Hessian spectrum calculation unit, an MSE Loss calculation unit, and a cosine distance calculation unit, where the Hessian spectrum calculation unit is configured to calculate a Hessian spectrum of each neural network layer, and determine a sensitivity of each layer to quantization according to the Hessian spectrum of each neural network layer;

the MSE Loss calculation unit is used for calculating the MSE Loss of each neural network layer, configuring different bits layer by layer according to the calibration data set, and performing forward reasoning to observe the model Loss to determine the sensitivity of each neural network layer to quantization;

the cosine distance calculation unit is used for calculating the cosine distance of each neural network layer, and determining the sensitivity of each layer to quantization according to the distance between the tensor before and after model quantization according to the calibration data set.

In a third aspect, an embodiment of the present invention provides an intelligent terminal, including a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, and the memory is configured to store a computer program, where the computer program includes program instructions, and the processor is configured to invoke the program instructions to execute the method described in the foregoing embodiment.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method described in the above embodiments.

The invention has the beneficial effects that:

according to the method, the system, the terminal and the medium for quantizing mixed bits of the transformers, disclosed by the invention, by combining hardware characteristics, more convenient support is provided for quantizing and deploying the transformers, and on the premise of ensuring optimal precision of the models, computing resources are saved, and the performance of the models is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. Like elements or portions are generally identified by like reference numerals throughout the several figures. In the drawings, elements or portions thereof are not necessarily drawn to scale.

Fig. 1 is a flowchart of a method for quantizing mixed bits of a transformer according to a first embodiment of the present invention;

fig. 2 is a schematic structural diagram of a hybrid bit quantization system of a transformer according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of an intelligent terminal according to a third embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention pertains.

As shown in fig. 1, a flowchart of a method for quantizing mixed bits of a transformer according to a first embodiment of the present invention is shown, the method includes the following steps:

The transducer is a deep learning model based on the Attention structure. Model quantization is an operation process of converting weights, activation values and the like of a trained deep neural network from high precision to low precision, for example, converting a 32-bit floating point number into 8 to integer number int8, and simultaneously, the accuracy of the model after conversion is expected to be similar to that before conversion.

The user submits the trained deep learning model, target hardware platform parameters, calibration data set, and configuration file to quantization task via web page. The fields in the configuration file include: similarity, loss, cost _ time, optimization, mem _ rate, bandwidth, npu _rate. Wherein, similarity: similarity between quantized result and original result, the value range is 0-1, the larger the effect is, the better the Loss is: the quantized result and the original result are lost, 0-infinity, and the smaller the effect is, the better the cost_time is: the time required to calculate the calibration dataset, units: second, optimization: autoML is optimized by using an AutoML method, math is optimized by using an optimization solution algorithm, and Mem_rate: memory utilization, value range 0-1, bandwidth: bandwidth, unit bps, npu _rate: the utilization rate of the special processor is in the value range of 0-1. The configuration file mainly configures the model quality requirements. The quantization task package catalog includes: the model is used for storing model equivalent large files, and the data is used for storing calibration data set equivalent large files.

A specific method of calculating the sensitivity of each neural network layer of the model from the transducer model and the calibration data set comprises:

calculating MSE Loss of each neural network layer, configuring different bits layer by layer according to a calibration data set, and performing forward reasoning to observe model Loss (the difference between the model Loss and a real label or float model) once to determine the sensitivity of each neural network layer to quantization;

And executing different hardware simulation calculation processes according to the configured deployment hardware platform parameters, and comparing the costs of carrying data and reasoning with different bits to obtain hardware cost information. And generating hardware influence condition information according to the quality red line requirements (minimum requirements in terms of precision, performance, power consumption and the like) of the user.

And optimizing the sensitivity and the hardware influence information by using an optimization algorithm or an Automl technology to obtain the mixed bit quantization scheme. The optimization algorithm comprises the following steps: genetic algorithm, linear programming algorithm, pareto optimal method and dynamic programming method.

Linear programming algorithm: mathematical theory and method for researching extremum problem of linear objective function under linear constraint condition.

Genetic algorithm: a randomized search method by referring to evolution rules (survival of the fittest and the worse out genetic mechanism) of the biology world.

Pareto optimal method: an ideal state of resource allocation is one in which, given an inherent group of people and allocatable resources, a change from one allocation state to another state results in at least one person becoming better without degrading any person's circumstances.

Dynamic programming method: a mathematical method of transforming a multi-stage decision problem into a series of interrelated single-stage problems and then solving one by one.

Automl: the method is a process of applying machine learning to end-to-end process automation of a real problem, and AutoML realizes automation from three aspects of feature engineering, model construction and super-parameter optimization.

Reinforcement learning: learning of the map from the environmental state to the action to maximize the cumulative prize value that the action obtains from the environment, discovers the optimal behavior strategy by trial and error.

SuperNet: the super network is a high-efficiency NAS searching process, in short, the total searching large space is established, and then the sub network with the best effect is searched out as the final result. A weight sharing model (SuperNet) is built, and the SuperNet is a super-net and comprises all operators of the whole network. The goal is to search and optimize operator channels and kernel sizes while keeping the operators the same as the original model.

MSE Loss: mean square error, a measure reflecting the degree of difference between the estimated quantity and the estimated quantity.

Cosine distance: cosine similarity may also be called. The cosine of the included angle in the geometry can be used to measure the difference between the two vector directions, and the machine learning can use this concept to measure the difference between the sample vectors.

The method for quantizing the mixed bits of the transformers, provided by the embodiment of the invention, can be applied to a special processor, such as a processor of a touch pen, by combining different hardware characteristics, provides more convenient support for quantizing and deploying the transformers model, and ensures that the model precision is optimal, and the method can be used for reasoning as high as possible in the appointed hardware.

The optimization stage is increased, and various optimization methods such as an optimization solving algorithm, an Automl technology and the like are used, so that the quantization effect of the transducer model is improved, the accuracy is ensured, meanwhile, the calculation resources are saved, and the performance of the model is improved.

As shown in fig. 2, a schematic structural diagram of a hybrid bit quantization system of a transformer according to an embodiment of the present invention is shown, where the system includes: the system comprises a data acquisition module, a sensitivity calculation module, a hardware simulation module and an optimization module, wherein the data acquisition module is used for acquiring a trained transducer model, a calibration data set, hardware platform parameters and model operation quality requirements;

The sensitivity calculation module comprises a Hessian frequency spectrum calculation unit, an MSE Loss calculation unit and a cosine distance calculation unit, wherein the Hessian frequency spectrum calculation unit is used for calculating the Hessian frequency spectrum of each layer of the neural network layer, and determining the sensitivity of each layer to quantization according to the Hessian frequency spectrum of each layer of the neural network layer; the MSE Loss calculation unit is used for calculating the MSE Loss of each neural network layer, configuring different bits layer by layer according to the calibration data set, and performing forward reasoning to observe the model Loss to determine the sensitivity of each neural network layer to quantization; the cosine distance calculation unit is used for calculating the cosine distance of each neural network layer, and determining the sensitivity of each layer to quantization according to the distance between the tensor before and after model quantization according to the calibration data set. Model operational quality requirements include accuracy, performance, and power consumption requirements.

The transformation former mixed bit quantization system provided by the embodiment of the invention can be applied to a special processor by combining different hardware characteristics, provides more convenient support for transformation former model quantization deployment, and ensures that the model precision is optimal, and the specific hardware can be inferred as efficiently as possible.

The optimization module is added, and various optimization methods such as an optimization solving algorithm, an Automl technology and the like are used, so that the quantization effect of the transducer model is improved, the accuracy is ensured, meanwhile, the calculation resources are saved, and the performance of the model is improved.

As shown in fig. 3, a third embodiment of the present invention further provides a schematic structural diagram of an intelligent terminal, where the terminal includes a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, and the memory is configured to store a computer program, where the computer program includes program instructions, and the processor is configured to invoke the program instructions to perform the method described in the foregoing embodiments.

It should be appreciated that in embodiments of the present invention, the processor may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The input devices may include a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of a fingerprint), a microphone, etc., and the output devices may include a display (LCD, etc.), a speaker, etc.

The memory may include read only memory and random access memory and provide instructions and data to the processor. A portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

In a specific implementation, the processor, the input device, and the output device described in the embodiments of the present invention may execute the implementation described in the method embodiment provided in the embodiments of the present invention, or may execute the implementation of the system embodiment described in the embodiments of the present invention, which is not described herein again.

In a further embodiment of the invention, a computer-readable storage medium is provided, which stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method described in the above embodiment.

The computer readable storage medium may be an internal storage unit of the terminal according to the foregoing embodiment, for example, a hard disk or a memory of the terminal. The computer readable storage medium may also be an external storage device of the terminal, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the terminal. The computer-readable storage medium is used to store the computer program and other programs and data required by the terminal. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working procedures of the terminal and the unit described above may refer to the corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In several embodiments provided in the present application, it should be understood that the disclosed terminal and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention, and are intended to be included within the scope of the appended claims and description.

Claims

1. A method for mixed bit quantization of a transformer, comprising the steps of:

2. The method of claim 1, wherein the specific method of calculating the sensitivity of each neural network layer of the model from the transducer model and the calibration data set comprises:

3. The method of claim 1, wherein the model operational quality requirements include accuracy, performance, and power consumption requirements.

4. The method of claim 1, wherein the optimization algorithm comprises: genetic algorithm, linear programming algorithm, pareto optimal method and dynamic programming method.

5. The method of claim 1, wherein the specific method for optimizing sensitivity and hardware impact information using an autopl technique comprises: and establishing a super network, and searching the super network to find out the sub network with the best effect as a final result.

6. A transform's mixed bit quantization system, comprising: the system comprises a data acquisition module, a sensitivity calculation module, a hardware simulation module and an optimization module, wherein the data acquisition module is used for acquiring a trained transducer model, a calibration data set, hardware platform parameters and model operation quality requirements;

7. The system of claim 6, wherein the sensitivity calculation module comprises a Hessian spectrum calculation unit, an MSE Loss calculation unit, and a cosine distance calculation unit, the Hessian spectrum calculation unit configured to calculate a Hessian spectrum for each neural network layer, and determine a sensitivity of each layer to quantization based on the Hessian spectrum for each neural network layer;

8. The method of claim 6, wherein the model operational quality requirements include accuracy, performance, and power consumption requirements.

9. A smart terminal comprising a processor, an input device, an output device and a memory, the processor, the input device, the output device and the memory being interconnected, the memory being for storing a computer program, the computer program comprising program instructions, characterized in that the processor is configured to invoke the program instructions to perform the method of any of claims 1-5.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-5.