CN116227332A - Method and system for quantizing mixed bits of transformers - Google Patents

Method and system for quantizing mixed bits of transformers Download PDF

Info

Publication number
CN116227332A
CN116227332A CN202211647164.4A CN202211647164A CN116227332A CN 116227332 A CN116227332 A CN 116227332A CN 202211647164 A CN202211647164 A CN 202211647164A CN 116227332 A CN116227332 A CN 116227332A
Authority
CN
China
Prior art keywords
model
sensitivity
hardware
neural network
network layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211647164.4A
Other languages
Chinese (zh)
Inventor
赵武金
宋莉莉
张祥建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yixin Yiyu Microelectronics Technology Co ltd
Original Assignee
Beijing Shihai Xintu Microelectronics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shihai Xintu Microelectronics Co ltd filed Critical Beijing Shihai Xintu Microelectronics Co ltd
Priority to CN202211647164.4A priority Critical patent/CN116227332A/en
Publication of CN116227332A publication Critical patent/CN116227332A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2111/00Details relating to CAD techniques
    • G06F2111/06Multi-objective optimisation, e.g. Pareto optimisation using simulated annealing [SA], ant colony algorithms or genetic algorithms [GA]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Geometry (AREA)
  • Medical Informatics (AREA)
  • Neurology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Hardware Design (AREA)
  • Analogue/Digital Conversion (AREA)
  • Supply And Distribution Of Alternating Current (AREA)
  • Emergency Protection Circuit Devices (AREA)

Abstract

The invention discloses a mixed bit quantization method of a transformer, which comprises the following steps: acquiring a trained transducer model, a calibration data set, hardware platform parameters and model operation quality requirements; calculating the sensitivity of each neural network layer of the model according to the transducer model and the calibration data set; performing simulation calculation according to hardware platform parameters, comparing the costs of carrying data and reasoning with different bits to obtain hardware cost information, and generating hardware influence information according to the hardware cost information and model operation quality requirements; and optimizing sensitivity and hardware influence information by using an optimization algorithm or an Automl technology, and outputting a mixed bit quantization scheme. The method combines hardware characteristics, provides more convenient support for quantitative deployment of the transformer model, saves computing resources and improves the performance of the model on the premise of ensuring the optimal precision of the model.

Description

Method and system for quantizing mixed bits of transformers
Technical Field
The invention relates to the technical field of deep learning, in particular to a hybrid bit quantization method and a hybrid bit quantization system for a transformer.
Background
Compared with mainstream convolutional neural networks, transformers have more complex network structures, and are more difficult to deploy at the mobile end, the edge end, embedded wearable devices, and the like. In general, mixed bit quantization is required to be performed on the model, and the model is compressed under the premise of losing a small amount of precision, so that the application of the complex model to embedded terminals such as mobile phones, robots and the like becomes possible. However, most of the current mixed bit quantization schemes do not consider hardware characteristics, especially characteristics of a dedicated AI processor, to perform mixed bit quantization. In addition to the field of NLP natural language processing, vision Transforme is becoming more and more popular in the field of computer vision, and the existing error analysis and optimization method has a less ideal effect on a new model.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a mixed bit quantization method, a system, a terminal and a medium of a transducer, which provide more convenient support for the quantization deployment of the transducer model by combining hardware characteristics, save computing resources and improve the performance of the model on the premise of ensuring the optimal precision of the model.
In a first aspect, a method for quantizing mixed bits of a transformer according to an embodiment of the present invention includes the following steps:
acquiring a trained transducer model, a calibration data set, hardware platform parameters and model operation quality requirements;
calculating the sensitivity of each neural network layer of the model according to the transducer model and the calibration data set;
performing simulation calculation according to hardware platform parameters, comparing the costs of carrying data and reasoning with different bits to obtain hardware cost information, and generating hardware influence information according to the hardware cost information and model operation quality requirements;
and optimizing sensitivity and hardware influence information by using an optimization algorithm or an Automl technology, and outputting a mixed bit quantization scheme.
Optionally, the specific method for calculating the sensitivity of each neural network layer of the model from the transducer model and the calibration data set comprises:
calculating the Hessian spectrum of each neural network layer, and determining the sensitivity of each layer to quantization according to the Hessian spectrum of each neural network layer;
calculating MSE Loss of each neural network layer, configuring different bits layer by layer according to a calibration data set, and performing forward reasoning to observe model Loss once to determine the sensitivity of each neural network layer to quantization;
and calculating the cosine distance of each neural network layer, and determining the sensitivity of each layer to quantization according to the distance between the tensor before and after model quantization according to the calibration data set.
Optionally, the model run quality requirements include accuracy, performance, and power consumption requirements.
Optionally, the optimization algorithm comprises: genetic algorithm, linear programming algorithm, pareto optimal method and dynamic programming method.
Optionally, the specific method for optimizing sensitivity and hardware impact information using an autopml technique includes: and establishing a super network, and searching the super network to find out the sub network with the best effect as a final result.
In a second aspect, an embodiment of the present invention provides a hybrid bit quantization system of a transformer, including: the system comprises a data acquisition module, a sensitivity calculation module, a hardware simulation module and an output module, wherein the data acquisition module is used for acquiring a trained transducer model, a calibration data set, hardware platform parameters and model operation quality requirements;
the sensitivity calculation module is used for calculating the sensitivity of each neural network layer of the model according to the transducer model and the calibration data set;
the hardware simulation module is used for performing simulation calculation according to hardware platform parameters, comparing the costs of carrying data and reasoning with different bits to obtain hardware cost information, and generating hardware influence information according to the hardware cost information and model operation quality requirements;
the optimizing module is used for optimizing sensitivity and hardware influence information by utilizing an optimizing algorithm or an Automl technology and outputting a mixed bit quantization scheme.
Optionally, the sensitivity calculation module includes a Hessian spectrum calculation unit, an MSE Loss calculation unit, and a cosine distance calculation unit, where the Hessian spectrum calculation unit is configured to calculate a Hessian spectrum of each neural network layer, and determine a sensitivity of each layer to quantization according to the Hessian spectrum of each neural network layer;
the MSE Loss calculation unit is used for calculating the MSE Loss of each neural network layer, configuring different bits layer by layer according to the calibration data set, and performing forward reasoning to observe the model Loss to determine the sensitivity of each neural network layer to quantization;
the cosine distance calculation unit is used for calculating the cosine distance of each neural network layer, and determining the sensitivity of each layer to quantization according to the distance between the tensor before and after model quantization according to the calibration data set.
Optionally, the model run quality requirements include accuracy, performance, and power consumption requirements.
In a third aspect, an embodiment of the present invention provides an intelligent terminal, including a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, and the memory is configured to store a computer program, where the computer program includes program instructions, and the processor is configured to invoke the program instructions to execute the method described in the foregoing embodiment.
In a fourth aspect, embodiments of the present invention provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method described in the above embodiments.
The invention has the beneficial effects that:
according to the method, the system, the terminal and the medium for quantizing mixed bits of the transformers, disclosed by the invention, by combining hardware characteristics, more convenient support is provided for quantizing and deploying the transformers, and on the premise of ensuring optimal precision of the models, computing resources are saved, and the performance of the models is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. Like elements or portions are generally identified by like reference numerals throughout the several figures. In the drawings, elements or portions thereof are not necessarily drawn to scale.
Fig. 1 is a flowchart of a method for quantizing mixed bits of a transformer according to a first embodiment of the present invention;
fig. 2 is a schematic structural diagram of a hybrid bit quantization system of a transformer according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of an intelligent terminal according to a third embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".
It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention pertains.
As shown in fig. 1, a flowchart of a method for quantizing mixed bits of a transformer according to a first embodiment of the present invention is shown, the method includes the following steps:
acquiring a trained transducer model, a calibration data set, hardware platform parameters and model operation quality requirements;
calculating the sensitivity of each neural network layer of the model according to the transducer model and the calibration data set;
performing simulation calculation according to hardware platform parameters, comparing the costs of carrying data and reasoning with different bits to obtain hardware cost information, and generating hardware influence information according to the hardware cost information and model operation quality requirements;
and optimizing sensitivity and hardware influence information by using an optimization algorithm or an Automl technology, and outputting a mixed bit quantization scheme.
The transducer is a deep learning model based on the Attention structure. Model quantization is an operation process of converting weights, activation values and the like of a trained deep neural network from high precision to low precision, for example, converting a 32-bit floating point number into 8 to integer number int8, and simultaneously, the accuracy of the model after conversion is expected to be similar to that before conversion.
The user submits the trained deep learning model, target hardware platform parameters, calibration data set, and configuration file to quantization task via web page. The fields in the configuration file include: similarity, loss, cost _ time, optimization, mem _ rate, bandwidth, npu _rate. Wherein, similarity: similarity between quantized result and original result, the value range is 0-1, the larger the effect is, the better the Loss is: the quantized result and the original result are lost, 0-infinity, and the smaller the effect is, the better the cost_time is: the time required to calculate the calibration dataset, units: second, optimization: autoML is optimized by using an AutoML method, math is optimized by using an optimization solution algorithm, and Mem_rate: memory utilization, value range 0-1, bandwidth: bandwidth, unit bps, npu _rate: the utilization rate of the special processor is in the value range of 0-1. The configuration file mainly configures the model quality requirements. The quantization task package catalog includes: the model is used for storing model equivalent large files, and the data is used for storing calibration data set equivalent large files.
A specific method of calculating the sensitivity of each neural network layer of the model from the transducer model and the calibration data set comprises:
calculating the Hessian spectrum of each neural network layer, and determining the sensitivity of each layer to quantization according to the Hessian spectrum of each neural network layer;
calculating MSE Loss of each neural network layer, configuring different bits layer by layer according to a calibration data set, and performing forward reasoning to observe model Loss (the difference between the model Loss and a real label or float model) once to determine the sensitivity of each neural network layer to quantization;
and calculating the cosine distance of each neural network layer, and determining the sensitivity of each layer to quantization according to the distance between the tensor before and after model quantization according to the calibration data set.
And executing different hardware simulation calculation processes according to the configured deployment hardware platform parameters, and comparing the costs of carrying data and reasoning with different bits to obtain hardware cost information. And generating hardware influence condition information according to the quality red line requirements (minimum requirements in terms of precision, performance, power consumption and the like) of the user.
And optimizing the sensitivity and the hardware influence information by using an optimization algorithm or an Automl technology to obtain the mixed bit quantization scheme. The optimization algorithm comprises the following steps: genetic algorithm, linear programming algorithm, pareto optimal method and dynamic programming method.
Linear programming algorithm: mathematical theory and method for researching extremum problem of linear objective function under linear constraint condition.
Genetic algorithm: a randomized search method by referring to evolution rules (survival of the fittest and the worse out genetic mechanism) of the biology world.
Pareto optimal method: an ideal state of resource allocation is one in which, given an inherent group of people and allocatable resources, a change from one allocation state to another state results in at least one person becoming better without degrading any person's circumstances.
Dynamic programming method: a mathematical method of transforming a multi-stage decision problem into a series of interrelated single-stage problems and then solving one by one.
Automl: the method is a process of applying machine learning to end-to-end process automation of a real problem, and AutoML realizes automation from three aspects of feature engineering, model construction and super-parameter optimization.
Reinforcement learning: learning of the map from the environmental state to the action to maximize the cumulative prize value that the action obtains from the environment, discovers the optimal behavior strategy by trial and error.
SuperNet: the super network is a high-efficiency NAS searching process, in short, the total searching large space is established, and then the sub network with the best effect is searched out as the final result. A weight sharing model (SuperNet) is built, and the SuperNet is a super-net and comprises all operators of the whole network. The goal is to search and optimize operator channels and kernel sizes while keeping the operators the same as the original model.
MSE Loss: mean square error, a measure reflecting the degree of difference between the estimated quantity and the estimated quantity.
Cosine distance: cosine similarity may also be called. The cosine of the included angle in the geometry can be used to measure the difference between the two vector directions, and the machine learning can use this concept to measure the difference between the sample vectors.
The method for quantizing the mixed bits of the transformers, provided by the embodiment of the invention, can be applied to a special processor, such as a processor of a touch pen, by combining different hardware characteristics, provides more convenient support for quantizing and deploying the transformers model, and ensures that the model precision is optimal, and the method can be used for reasoning as high as possible in the appointed hardware.
The optimization stage is increased, and various optimization methods such as an optimization solving algorithm, an Automl technology and the like are used, so that the quantization effect of the transducer model is improved, the accuracy is ensured, meanwhile, the calculation resources are saved, and the performance of the model is improved.
As shown in fig. 2, a schematic structural diagram of a hybrid bit quantization system of a transformer according to an embodiment of the present invention is shown, where the system includes: the system comprises a data acquisition module, a sensitivity calculation module, a hardware simulation module and an optimization module, wherein the data acquisition module is used for acquiring a trained transducer model, a calibration data set, hardware platform parameters and model operation quality requirements;
the sensitivity calculation module is used for calculating the sensitivity of each neural network layer of the model according to the transducer model and the calibration data set;
the hardware simulation module is used for performing simulation calculation according to hardware platform parameters, comparing the costs of carrying data and reasoning with different bits to obtain hardware cost information, and generating hardware influence information according to the hardware cost information and model operation quality requirements;
the optimizing module is used for optimizing sensitivity and hardware influence information by utilizing an optimizing algorithm or an Automl technology and outputting a mixed bit quantization scheme.
The sensitivity calculation module comprises a Hessian frequency spectrum calculation unit, an MSE Loss calculation unit and a cosine distance calculation unit, wherein the Hessian frequency spectrum calculation unit is used for calculating the Hessian frequency spectrum of each layer of the neural network layer, and determining the sensitivity of each layer to quantization according to the Hessian frequency spectrum of each layer of the neural network layer; the MSE Loss calculation unit is used for calculating the MSE Loss of each neural network layer, configuring different bits layer by layer according to the calibration data set, and performing forward reasoning to observe the model Loss to determine the sensitivity of each neural network layer to quantization; the cosine distance calculation unit is used for calculating the cosine distance of each neural network layer, and determining the sensitivity of each layer to quantization according to the distance between the tensor before and after model quantization according to the calibration data set. Model operational quality requirements include accuracy, performance, and power consumption requirements.
The transformation former mixed bit quantization system provided by the embodiment of the invention can be applied to a special processor by combining different hardware characteristics, provides more convenient support for transformation former model quantization deployment, and ensures that the model precision is optimal, and the specific hardware can be inferred as efficiently as possible.
The optimization module is added, and various optimization methods such as an optimization solving algorithm, an Automl technology and the like are used, so that the quantization effect of the transducer model is improved, the accuracy is ensured, meanwhile, the calculation resources are saved, and the performance of the model is improved.
As shown in fig. 3, a third embodiment of the present invention further provides a schematic structural diagram of an intelligent terminal, where the terminal includes a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, and the memory is configured to store a computer program, where the computer program includes program instructions, and the processor is configured to invoke the program instructions to perform the method described in the foregoing embodiments.
It should be appreciated that in embodiments of the present invention, the processor may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The input devices may include a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of a fingerprint), a microphone, etc., and the output devices may include a display (LCD, etc.), a speaker, etc.
The memory may include read only memory and random access memory and provide instructions and data to the processor. A portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.
In a specific implementation, the processor, the input device, and the output device described in the embodiments of the present invention may execute the implementation described in the method embodiment provided in the embodiments of the present invention, or may execute the implementation of the system embodiment described in the embodiments of the present invention, which is not described herein again.
In a further embodiment of the invention, a computer-readable storage medium is provided, which stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method described in the above embodiment.
The computer readable storage medium may be an internal storage unit of the terminal according to the foregoing embodiment, for example, a hard disk or a memory of the terminal. The computer readable storage medium may also be an external storage device of the terminal, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the terminal. The computer-readable storage medium is used to store the computer program and other programs and data required by the terminal. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working procedures of the terminal and the unit described above may refer to the corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In several embodiments provided in the present application, it should be understood that the disclosed terminal and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention, and are intended to be included within the scope of the appended claims and description.

Claims (10)

1. A method for mixed bit quantization of a transformer, comprising the steps of:
acquiring a trained transducer model, a calibration data set, hardware platform parameters and model operation quality requirements;
calculating the sensitivity of each neural network layer of the model according to the transducer model and the calibration data set;
performing simulation calculation according to hardware platform parameters, comparing the costs of carrying data and reasoning with different bits to obtain hardware cost information, and generating hardware influence information according to the hardware cost information and model operation quality requirements;
and optimizing sensitivity and hardware influence information by using an optimization algorithm or an Automl technology, and outputting a mixed bit quantization scheme.
2. The method of claim 1, wherein the specific method of calculating the sensitivity of each neural network layer of the model from the transducer model and the calibration data set comprises:
calculating the Hessian spectrum of each neural network layer, and determining the sensitivity of each layer to quantization according to the Hessian spectrum of each neural network layer;
calculating MSE Loss of each neural network layer, configuring different bits layer by layer according to a calibration data set, and performing forward reasoning to observe model Loss once to determine the sensitivity of each neural network layer to quantization;
and calculating the cosine distance of each neural network layer, and determining the sensitivity of each layer to quantization according to the distance between the tensor before and after model quantization according to the calibration data set.
3. The method of claim 1, wherein the model operational quality requirements include accuracy, performance, and power consumption requirements.
4. The method of claim 1, wherein the optimization algorithm comprises: genetic algorithm, linear programming algorithm, pareto optimal method and dynamic programming method.
5. The method of claim 1, wherein the specific method for optimizing sensitivity and hardware impact information using an autopl technique comprises: and establishing a super network, and searching the super network to find out the sub network with the best effect as a final result.
6. A transform's mixed bit quantization system, comprising: the system comprises a data acquisition module, a sensitivity calculation module, a hardware simulation module and an optimization module, wherein the data acquisition module is used for acquiring a trained transducer model, a calibration data set, hardware platform parameters and model operation quality requirements;
the sensitivity calculation module is used for calculating the sensitivity of each neural network layer of the model according to the transducer model and the calibration data set;
the hardware simulation module is used for performing simulation calculation according to hardware platform parameters, comparing the costs of carrying data and reasoning with different bits to obtain hardware cost information, and generating hardware influence information according to the hardware cost information and model operation quality requirements;
the optimizing module is used for optimizing sensitivity and hardware influence information by utilizing an optimizing algorithm or an Automl technology and outputting a mixed bit quantization scheme.
7. The system of claim 6, wherein the sensitivity calculation module comprises a Hessian spectrum calculation unit, an MSE Loss calculation unit, and a cosine distance calculation unit, the Hessian spectrum calculation unit configured to calculate a Hessian spectrum for each neural network layer, and determine a sensitivity of each layer to quantization based on the Hessian spectrum for each neural network layer;
the MSE Loss calculation unit is used for calculating the MSE Loss of each neural network layer, configuring different bits layer by layer according to the calibration data set, and performing forward reasoning to observe the model Loss to determine the sensitivity of each neural network layer to quantization;
the cosine distance calculation unit is used for calculating the cosine distance of each neural network layer, and determining the sensitivity of each layer to quantization according to the distance between the tensor before and after model quantization according to the calibration data set.
8. The method of claim 6, wherein the model operational quality requirements include accuracy, performance, and power consumption requirements.
9. A smart terminal comprising a processor, an input device, an output device and a memory, the processor, the input device, the output device and the memory being interconnected, the memory being for storing a computer program, the computer program comprising program instructions, characterized in that the processor is configured to invoke the program instructions to perform the method of any of claims 1-5.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-5.
CN202211647164.4A 2022-12-21 2022-12-21 Method and system for quantizing mixed bits of transformers Pending CN116227332A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211647164.4A CN116227332A (en) 2022-12-21 2022-12-21 Method and system for quantizing mixed bits of transformers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211647164.4A CN116227332A (en) 2022-12-21 2022-12-21 Method and system for quantizing mixed bits of transformers

Publications (1)

Publication Number Publication Date
CN116227332A true CN116227332A (en) 2023-06-06

Family

ID=86573908

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211647164.4A Pending CN116227332A (en) 2022-12-21 2022-12-21 Method and system for quantizing mixed bits of transformers

Country Status (1)

Country Link
CN (1) CN116227332A (en)

Similar Documents

Publication Publication Date Title
Li et al. Auto-tuning neural network quantization framework for collaborative inference between the cloud and edge
CN110378468B (en) Neural network accelerator based on structured pruning and low bit quantization
CN107480789B (en) Efficient conversion method and device of deep learning model
CN106485316A (en) Neural network model compression method and device
US20210350233A1 (en) System and Method for Automated Precision Configuration for Deep Neural Networks
CN112287986B (en) Image processing method, device, equipment and readable storage medium
CN110175641B (en) Image recognition method, device, equipment and storage medium
CN109961147B (en) Automatic model compression method based on Q-Learning algorithm
CN110929862B (en) Fixed-point neural network model quantification device and method
CN113343427B (en) Structural topology configuration prediction method based on convolutional neural network
CN113392973B (en) AI chip neural network acceleration method based on FPGA
US20210350230A1 (en) Data dividing method and processor for convolution operation
CN115510795A (en) Data processing method and related device
CN116450486B (en) Modeling method, device, equipment and medium for nodes in multi-element heterogeneous computing system
CN111831355A (en) Weight precision configuration method, device, equipment and storage medium
CN111831359A (en) Weight precision configuration method, device, equipment and storage medium
CN114819168B (en) Quantum comparison method and device for matrix eigenvalues
Guan et al. Using data compression for optimizing FPGA-based convolutional neural network accelerators
CN116227332A (en) Method and system for quantizing mixed bits of transformers
CN116011682A (en) Meteorological data prediction method and device, storage medium and electronic device
Chen et al. A DNN optimization framework with unlabeled data for efficient and accurate reconfigurable hardware inference
CN114219091A (en) Network model reasoning acceleration method, device, equipment and storage medium
CN113222121A (en) Data processing method, device and equipment
CN113760407A (en) Information processing method, device, equipment and storage medium
Wu et al. Accelerating deep convolutional neural network inference based on OpenCL

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240121

Address after: Room 202B East 965, 2nd Floor, Building 1, No.1 Courtyard, Lize Middle Road, Chaoyang District, Beijing, 100102

Applicant after: Beijing Yixin Yiyu Microelectronics Technology Co.,Ltd.

Country or region after: China

Address before: 519, Floor 5, No. 68, North Fourth Ring West Road, Haidian District, Beijing, 100089

Applicant before: Beijing Shihai Xintu Microelectronics Co.,Ltd.

Country or region before: China

TA01 Transfer of patent application right
CB02 Change of applicant information

Country or region after: China

Address after: Room 965 East, Building 1, 2nd Floor, 202B, No.1 Courtyard, Lize Middle Road, Chaoyang District, Beijing, 100020

Applicant after: Beijing Qianhe Yibang Cloud Information Technology Co.,Ltd.

Address before: Room 202B East 965, 2nd Floor, Building 1, No.1 Courtyard, Lize Middle Road, Chaoyang District, Beijing, 100102

Applicant before: Beijing Yixin Yiyu Microelectronics Technology Co.,Ltd.

Country or region before: China

CB02 Change of applicant information