WO2024067563A1 - 基于模型量化的任务处理方法、装置、设备及存储介质 - Google Patents

基于模型量化的任务处理方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2024067563A1
WO2024067563A1 PCT/CN2023/121487 CN2023121487W WO2024067563A1 WO 2024067563 A1 WO2024067563 A1 WO 2024067563A1 CN 2023121487 W CN2023121487 W CN 2023121487W WO 2024067563 A1 WO2024067563 A1 WO 2024067563A1
Authority
WO
WIPO (PCT)
Prior art keywords
optimization unit
quantization
output
weight
floating
Prior art date
Application number
PCT/CN2023/121487
Other languages
English (en)
French (fr)
Inventor
蓝朝祥
李哲暘
张凯
Original Assignee
杭州海康威视数字技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州海康威视数字技术股份有限公司 filed Critical 杭州海康威视数字技术股份有限公司
Publication of WO2024067563A1 publication Critical patent/WO2024067563A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Definitions

  • the present disclosure relates to the field of artificial intelligence technology, and in particular to a task processing method, device, equipment and storage medium based on model quantization.
  • MACs multiply-accumulate operations
  • lower-precision data formats such as low-precision floating-point numbers float16, float8, low-precision integer numbers in8, int4, etc.
  • 32-bit floating-point operations can reduce resource consumption and improve model training and/or inference speed.
  • the present disclosure provides a task processing method, apparatus, device and storage medium based on model quantization.
  • a task processing method based on model quantization comprising:
  • the weight quantization coefficient and the activation quantization coefficient of the optimization unit are updated to obtain the target weight quantization coefficient and the target activation quantization coefficient of the optimization unit; wherein the first quantized output includes the result of the optimization unit output when the weight parameter of the optimization unit is quantized according to the weight quantization coefficient, and the input and output of the optimization unit are quantized according to the activation quantization coefficient, and the first floating-point output includes the result of the optimization unit output when the weight parameter of the optimization unit is a floating-point weight parameter and the input is a floating-point input;
  • the weight quantization increment of the optimization unit is updated to obtain the target weight quantization increment of the optimization unit; wherein the second quantized output includes the weight parameter of the optimization unit quantized according to the target weight quantization coefficient, and quantized according to the weight quantization increment.
  • a task processing device based on model quantization comprising:
  • a first determination unit is used for updating the weight quantization coefficient and activation quantization coefficient of any optimization unit in the Transformer model according to a first difference between the first quantized output and the first floating-point output of the optimization unit, with the goal of minimizing the first difference, to obtain a target weight quantization coefficient and a target activation quantization coefficient of the optimization unit; wherein the first quantized output includes a result of the optimization unit output when the weight parameter of the optimization unit is quantized according to the weight quantization coefficient, and the input and output of the optimization unit are quantized according to the activation quantization coefficient, and the first floating-point output includes a result of the optimization unit output when the weight parameter of the optimization unit is a floating-point weight parameter and the input is a floating-point input;
  • a second determining unit is used to determine the weight quantization increment of the optimization unit according to a second difference between the second quantized output and the second floating-point output of the optimization unit with the goal of minimizing the second difference, while fixing the floating-point weight parameter, the target weight quantization coefficient and the target activation quantization coefficient of the optimization unit.
  • the result of the output of the optimization unit, the second floating-point output includes the weight parameter of the optimization unit is the floating-point weight parameter, and the result of the output of the optimization unit when the input is a floating-point input;
  • a quantization unit used to determine the weight quantization rounding direction of the optimization unit according to the target weight quantization increment; quantize the weight parameter of the optimization unit according to the target weight quantization coefficient and the weight quantization rounding direction to obtain the target quantization weight parameter of the optimization unit;
  • the task processing unit is used to perform forward reasoning calculations on the input data of any optimization unit according to the target quantization weight parameters corresponding to the optimization unit when the Transformer model is used for task processing, and to quantize the input/output of the optimization unit according to the target activation quantization coefficient of the optimization unit.
  • an electronic device comprising a processor and a memory, wherein the memory stores machine executable instructions that can be executed by the processor, and the processor is used to execute the machine executable instructions to implement the method provided in the first aspect.
  • a storage medium wherein machine executable instructions are stored in the storage medium, and when the machine executable instructions are executed by a processor, the method provided in the first aspect is implemented.
  • FIG1 is a schematic diagram of a flow chart of a task processing method based on model quantization provided by an embodiment of the present disclosure
  • FIG2 is a schematic diagram of an artificial intelligence subject framework provided by an embodiment of the present disclosure.
  • FIG3 is a schematic diagram of a flow chart of optimizing quantization coefficients provided by an embodiment of the present disclosure
  • FIG4 is a schematic diagram of the structure of a task processing device based on model quantization provided by an embodiment of the present disclosure
  • FIG. 5 is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present disclosure.
  • FIG1 is a flowchart of a task processing method based on model quantization provided in an embodiment of the present disclosure.
  • the task processing method based on model quantization may include the following steps:
  • Step S100 For any optimization unit in the Transformer model, based on the first difference between the first quantized output and the first floating-point output of the optimization unit, the weight quantization coefficient and the activation quantization coefficient of the optimization unit are updated with the goal of minimizing the first difference, so as to obtain the target weight quantization coefficient and the target activation quantization coefficient of the optimization unit; wherein, the first quantized output includes the result of the optimization unit output when the weight parameter of the optimization unit is quantized according to the weight quantization coefficient and the input/output of the optimization unit is quantized according to the activation quantization coefficient, and the first floating-point output includes the result of the optimization unit output when the weight parameter of the optimization unit is a floating-point weight parameter and the input is a floating-point input.
  • the input/output of the optimization unit may include input/output data, which is calculated in a quantized state.
  • Step S110 under the condition that the floating-point weight parameters, target weight quantization coefficient and target activation quantization coefficient of the optimization unit are fixed, based on the second difference between the second quantized output and the second floating-point output of the optimization unit, with the goal of minimizing the second difference, the weight quantization increment of the optimization unit is updated to obtain the target weight quantization increment of the optimization unit; wherein, the second quantized output includes the weight parameters of the optimization unit quantized according to the target weight quantization coefficient, and quantized according to the weight quantization increment, and the input/output of the optimization unit is quantized according to the target activation quantization coefficient, and the result of the optimization unit output is obtained; the second floating-point output includes the weight parameters of the optimization unit being floating-point weight parameters, and the input being a floating-point input.
  • quantization means that during the model calculation process, the parameters and features represented by floating-point numbers are approximated with fixed-point values to improve the operation speed.
  • the Transformer model is a new type of neural network that can learn context and thus meaning by tracking relationships in a sequence.
  • the Transformer model can be trained in parallel relatively well and uses a self-attention mechanism.
  • RNN recurrent neural network
  • the complexity of the Transformer model and the doubling of the number of network parameters have dramatically increased its demand for computing power and data storage. Quantizing the model can reduce the model's demand for computing power and storage, but at the cost of reducing the model's final performance.
  • floating point values are used, which can be float32 or float16.
  • Floating point weight parameters, floating point outputs, and floating point inputs are numerical expressions before quantization and compression.
  • the quantization coefficient finally used will affect the performance of the quantization model
  • the round-to-nearest method is generally used for quantization rounding.
  • the optimal quantization rounding method not all quantization processes use rounding, which is the optimal quantization rounding method. Therefore, adaptive quantization rounding direction (such as rounding up or rounding down) can reduce the difference in output characteristics between the quantization model and the floating-point model, thereby improving the performance of the quantization model.
  • the performance of the Transformer quantization model is improved by optimizing the quantization coefficients and the quantization rounding direction.
  • the Transformer model can be viewed as a plurality of modules (which can be referred to as Blocks) stacked in a certain order.
  • a Block may include a functional unit consisting of a single layer or multiple layers.
  • the optimization units in a Transformer model may include: Transformer Stage, Transformer Block, or a single linear layer.
  • a layer is a unit composed of a single operation. Layers stacked together in a specific order can be considered as a whole called a block. Blocks of the same type are stacked together to form a stage. Transformer blocks include but are not limited to encoder blocks and decoder blocks.
  • the optimization performance may be poor.
  • the Transformer Block is generally used as the basic optimization unit.
  • a single linear layer can also be used as an optimization unit.
  • Block in the process of quantizing the Transformer model, Block can be used as the basic unit (which can be called an optimization unit), and the output difference of the optimization unit before and after quantization can be minimized, and the quantization coefficient and quantization rounding direction of each optimization unit can be optimized in sequence.
  • the Transformer model may include multiple optimization units.
  • the task processing method provided in the embodiments of the present disclosure may be performed on all or part of the multiple optimization units.
  • the quantization coefficients may include weight quantization coefficients (used to quantize weight parameters) and activation quantization coefficients.
  • the activation quantization coefficient may include an input/output quantization coefficient used to quantize input data/output data.
  • the weight parameter of the optimization unit can be obtained according to the current weight quantization coefficient for quantization, and the input/output of the optimization unit is quantized according to the current activation quantization coefficient, and the output result of the optimization unit (referred to as the first quantized output herein) is obtained.
  • the floating-point output result of the optimization unit (referred to as the first floating-point output herein) is obtained when the weight parameter of the optimization unit is a floating-point weight parameter and the input is a floating-point input, and according to the difference between the first quantized output and the first floating-point output, with the goal of minimizing the difference, the current weight quantization coefficient and the current activation quantization coefficient of the optimization unit are updated to obtain the weight quantization coefficient (referred to as the target weight quantization coefficient herein) and the activation quantization coefficient (referred to as the target activation quantization coefficient herein) used by the optimization unit in the actual quantization process.
  • the target weight quantization coefficient referred to as the target weight quantization coefficient
  • the target activation quantization coefficient referred to as the target activation quantization coefficient herein
  • the values of the current weight quantization coefficient and the current activation quantization coefficient can be the initial values; after the first update is completed, the values of the current weight quantization coefficient and the current activation quantization can be the values of the weight quantization coefficient and the activation quantization coefficient when the last update was completed.
  • each optimization unit in the Transformer model when obtaining the first floating-point output of the optimization unit, can be in floating-point mode; when obtaining the first quantized output of the optimization unit, the optimization units before the optimization unit can be in quantized mode.
  • the prepared data (such as multiple unlabeled images) can be input into the Transformer model in batches.
  • the weight quantization coefficient and activation quantization coefficient of the optimization unit can be updated based on the difference between the first floating-point output and the first quantized output corresponding to the batch of data.
  • multiple epochs (one epoch means inputting all prepared data into the Transformer model for updating) can also be used to iterate the update of the weight quantization coefficient and activation quantization coefficient of the optimization unit.
  • the activated quantization coefficient of the optimization unit may include a quantization coefficient for input data and/or a quantization coefficient for output data.
  • the activated quantization coefficient of the optimization unit may also include the quantization coefficient for the output data (the activated quantization coefficient of the first optimization unit is also the quantization coefficient for the input data).
  • the activated quantization coefficient of the optimization unit may also include the quantization coefficient for the input data (the activated quantization coefficient of the last optimization unit is also the quantization coefficient for the output data).
  • the quantization rounding direction can also be optimized to determine the optimal quantization rounding direction, rather than performing quantization rounding in a fixed rounding manner.
  • the activation quantization coefficient depends on the quantization accuracy (such as the number of bits after quantization) and the quantization range.
  • the quantization range of the input of the Transformer model can be obtained through statistics, and the quantization accuracy can be 8bit, 4bit, etc.
  • the weight remains fixed during the inference process, and the optimization in the quantization rounding direction only needs to be performed once, which will not bring new inference computing consumption.
  • the floating-point weight parameters, target weight quantization coefficients, and target activation quantization coefficients of the optimization unit can be fixed, and the quantized input and floating-point output of the optimization unit can be obtained respectively when the input data of the Transformer model is the same.
  • the weight parameter of the optimization unit when obtaining the quantized input of the optimization unit, can be quantized according to the target weight quantization coefficient, and the quantized weight parameter can be quantized and adjusted according to the current weight quantization increment, and the input/output of the optimization unit can be quantized according to the target activation quantization coefficient, and the The output result of the optimization unit in this case (referred to herein as the second quantized output) is.
  • the weight quantization increment is used to adjust the quantization weight parameter of the optimization unit (the quantization weight parameter is obtained by quantizing the weight parameter according to the target weight quantization coefficient), giving the quantization weight parameter an increment in the range of 0 to 1, so as to determine the optimal quantization rounding direction (rounding up or rounding down) when quantizing the weight parameter.
  • the inventors of the present disclosure have found that the quantization rounding direction has the smallest error locally, but in actual measurements, especially in low-precision quantization scenarios, the rounding is not optimal for the overall performance of the quantization model, because the quantization error will continue to accumulate and amplify in forward reasoning. Based on this, in the embodiments of the present disclosure, in order to ensure that the difference between the output of the quantization model and the output of the floating-point model is as small as possible (in line with expectations), an adaptive rounding method of rounding up or rounding down is adopted.
  • the difference between the second quantized output and the second floating-point output is smaller when the increment of the quantized weight parameter according to the weight quantization increment is closer to 1, it indicates that when the weight parameter is quantized, the performance of rounding up is better; if the difference between the second quantized output and the second floating-point output is smaller when the increment of the quantized weight parameter according to the weight quantization increment is closer to 0, it indicates that when the weight parameter is quantized, the performance of rounding down is better.
  • the output of the optimization unit (referred to herein as the second floating-point output) can be obtained when the weight parameter of the optimization unit is a floating-point weight parameter and the input is a floating-point input.
  • the current weight quantization increment can be the initial value; after the first update is completed, the current weight quantization increment value can be the value of the weight quantization increment when the last update was completed.
  • the weight quantization increment of the optimization unit can be iteratively updated with the goal of minimizing the difference to obtain a final weight quantization increment of the optimization unit (referred to herein as a target weight quantization increment).
  • the target weight quantization increment can be used to determine the rounding direction of the weight quantization of the optimization unit.
  • Step S120 determine the weight quantization rounding direction of the optimization unit according to the target weight quantization increment, and quantize the weight parameter of the optimization unit according to the target weight quantization coefficient and the weight quantization rounding direction to obtain the target quantization weight parameter of the optimization unit.
  • the weight parameter of the optimization unit and the determined weight quantization rounding direction of the optimization unit can be quantized according to the target weight quantization coefficient to obtain the final quantization weight parameter of the optimization unit (which can be called the target quantization weight parameter).
  • Step S130 When using the Transformer model for task processing, for any optimization unit, forward reasoning calculation is performed on the input data of the optimization unit according to the target quantization weight parameter corresponding to the optimization unit, and the input/output of the optimization unit is quantized according to the target activation quantization coefficient of the optimization unit.
  • forward reasoning calculations can be performed on the input data of the optimization unit based on the target quantization weight parameters of each optimization unit, and the input/output of the optimization unit can be quantized based on the target activation quantization coefficient of the optimization unit.
  • the optimization unit can quantize the input data of this optimization unit according to the quantization coefficient for the input data, and perform forward reasoning calculations on the quantized input data according to the target quantization weight parameters of this optimization unit, thereby improving the model reasoning speed while ensuring the model performance, thereby improving the task processing efficiency.
  • the input data of the optimization unit can be quantized according to the target activation quantization coefficient, and then the quantized input data is used to perform forward reasoning calculation of the optimization unit to obtain the initial output data of the optimization unit, and the initial output data is quantized according to the target activation quantization coefficient.
  • the parameters involved are quantized according to the target quantization weight parameters of the optimization unit.
  • the above-mentioned task processing includes but is not limited to NLP (Natural Language Processing) tasks (such as machine translation), or Speech (voice) tasks (such as speech recognition), or CV (Computer Vision) tasks (such as classification, target detection or target tracking tasks, etc.), etc.
  • NLP Natural Language Processing
  • Speech voice
  • CV Computer Vision
  • the above-mentioned updating the weight quantization coefficient and the activation quantization coefficient of the optimization unit based on the first difference between the first quantized output and the first floating-point output of the optimization unit with the goal of minimizing the first difference to obtain the target weight quantization coefficient and the target activation quantization coefficient of the optimization unit may include:
  • the weight quantization coefficient and activation quantization coefficient of the optimization unit are iteratively updated using the gradient descent algorithm until the first preset end condition is reached, thereby obtaining the target weight quantization coefficient and target activation quantization coefficient of the optimization unit.
  • the quantization loss (which may be referred to as the first quantization loss) of the optimization unit can be determined based on the first quantization output and the first floating-point output of the optimization unit using a preset loss function (referred to as the first preset loss function in this article), such as a mean square error loss function, an absolute value loss function, etc., and based on the first quantization loss of the optimization unit, with the goal of minimizing the first quantization loss, the weight quantization coefficient and the activation quantization coefficient of the optimization unit are iteratively updated using a gradient descent algorithm until a first preset end condition is reached, such as the number of iterations reaching a preset maximum number of iterations, and/or the loss function converges, etc., to obtain the target weight quantization coefficient and the target activation quantization coefficient of the optimization unit.
  • a preset loss function such as a mean square error loss function, an absolute value loss function, etc.
  • the updating of the weight quantization increment of the optimization unit based on the second difference between the second quantized output and the second floating-point output of the optimization unit with the goal of minimizing the second difference to obtain the target weight quantization increment of the optimization unit may include:
  • the weight quantization increment of the optimization unit is iteratively updated using a gradient descent algorithm until a second preset end condition is reached, thereby obtaining the target weight quantization increment of the optimization unit.
  • the second quantization output and the second floating-point output of the optimization unit can be determined in the above manner, and based on the second quantization output and the second floating-point output of the optimization unit, a preset loss function (referred to as the second preset loss function herein), such as a mean square error loss function, an absolute value loss function, etc., is used to determine the quantization loss of the optimization unit (referred to as the second quantization loss herein), and based on the second quantization loss of the optimization unit, with the goal of minimizing the second quantization loss, the weight quantization increment of the optimization unit is iteratively updated using a gradient descent algorithm until a second preset end condition is reached, such as the number of iterations reaching a preset maximum number of iterations, and/or the loss function converges, etc., to obtain the target weight quantization increment of the optimization unit.
  • a preset loss function such as a mean square error loss function, an absolute value loss function, etc.
  • the first preset end condition and the second preset end condition may be the same or different and may be set according to actual conditions, and the present disclosure does not limit this.
  • the weight parameter of the optimization unit is quantized according to the target weight quantization coefficient, and adjusted according to the weight quantization increment, which can be achieved by the following formula:
  • W′′ is the weight parameter of the optimization unit after quantization adjustment
  • W is the floating-point weight parameter of the optimization unit
  • s is the target weight quantization coefficient
  • V is the weight quantization increment, which is used to adjust the quantization rounding direction
  • t is the hyperparameter (temperature coefficient) for controlling the sigmoid function, which is used to control the steepness of the sigmoid function.
  • t gradually decreases, and the shape of the sigmiod function gradually becomes a step function.
  • Floor() is a rounding-down operation.
  • the quantization rounding function can be replaced by the Round function (i.e., rounding function) with the Floor function (i.e., rounding down function), and the weight parameters of the optimization unit are quantized according to the target weight quantization coefficient of the optimization unit, and the increment of the quantization weight parameters is determined according to the current weight quantization increment using the sigmoid function.
  • the sigmoid function is a common curve function with a monotonic property, which can map variables to between 0 and 1.
  • the value of the sigmoid function is less than 0.5; when the variable is greater than 0, the value of the sigmoid function is greater than 0.5.
  • the optimal weight quantization increment i.e., the target weight quantization increment
  • the optimal weight quantization rounding direction of the optimization unit can be determined based on the target weight quantization increment.
  • Wq is the final (target) quantization weight parameter of the optimization unit
  • W is the floating-point weight parameter of the optimization unit
  • s is the target weight quantization coefficient
  • V is the target weight quantization increment
  • Floor() is the rounding down operation
  • (V>0?1:0) means that when V>0 is true, the value is 1, and when V>0 is not true, the value is 0.
  • the optimal weight quantization rounding direction can be determined according to the target weight quantization increment.
  • the optimal weight quantization rounding direction can be determined to be rounding up; when the target weight quantization increment is less than 0, the optimal weight quantization rounding direction can be determined to be rounding down.
  • the quantized loss of the optimization unit is determined by using a preset loss function, including:
  • the quantized output of the optimization unit is divided by the standard deviation to obtain the processed quantized output; and the floating point output of the optimization unit is divided by the standard deviation to obtain the processed floating point output;
  • the quantization loss of the optimization unit is determined using a mean square error loss function.
  • the value of the first dimension may be 0-1
  • the value of the second dimension may be 10-100
  • the value of the third dimension may be 100-1000.
  • the quantization loss is calculated directly based on the floating-point output and quantized output of the optimization unit using the mean square error loss function, the calculation result will focus too much on the value of a certain dimension, affecting the optimization effect.
  • the standard deviation of the floating-point output of the optimization unit along a certain dimension can be first calculated based on the floating-point output of the optimization unit.
  • the standard deviation of the first floating-point output of the optimization unit along a certain dimension can be determined based on the first floating-point output of the optimization unit, and the first quantized output of the optimization unit can be divided by the standard deviation to obtain the processed first quantized output, and the first floating-point output of the optimization unit can be divided by the standard deviation to obtain the processed first floating-point output. Then, based on the processed first quantized output and the processed first floating-point output, the first quantization loss of the optimization unit can be determined using the mean square error loss function.
  • the main framework of artificial intelligence can include five levels: application layer, algorithm layer, system layer, dependency layer, and device layer. There is a mutual dependency relationship from top to bottom. At the same time, it is biased towards practical applications upward and towards the underlying hardware downward.
  • Application layer By analyzing the needs, locate the problem to the corresponding branch of artificial intelligence.
  • Algorithm layer Design the model training strategy, loss function, and subsequent compression algorithm based on the application scenario.
  • System layer Use the deep learning framework to build the model, train the built model, and perform computational graph analysis and model compression.
  • Dependency layer The language or deep learning framework that implements the algorithm uses the device's external interface and protocol to call the corresponding device.
  • Device layer It is composed of computing units and provides computing power support for artificial intelligence systems.
  • the difference in output characteristics between the quantization model and the floating-point model can be reduced by optimizing the weight quantization rounding direction and the quantization coefficient, thereby improving the performance of the quantization model.
  • Model to be quantized swin-tiny ImageNet classification model
  • Time required for quantification about 30 minutes
  • the time required for quantization is independent of the quantization accuracy and depends on the number of algorithm iterations and the model size.
  • each Transformer Block layer includes a Shortcut addition layer, and the corresponding Shortcut layer can be replaced by an Eltwise layer.
  • the Eltwise layer is a general term for a functional layer in a neural network, which is characterized by performing element-by-element (same position) addition calculations on two or more data blocks of the same size.
  • a corresponding compensation coefficient can be set for each channel of each data block, that is, compensation coefficients refined to the input channel level are proposed.
  • These compensation coefficients can compensate for the data range differences on each channel of each data block, and then compensate for the data range differences of these multiple data blocks, so that the data precision range of these multiple data blocks is also compensated, thereby converting these multiple data blocks into data blocks with the same data precision to align the data precision of data in different channels, so that the overall distribution variance of the element-level operation results obtained after element-level operations based on the compensated data is reduced, and the data precision is improved.
  • the summed result and the bias coefficient (which can be represented by bias) can be added.
  • the bias coefficient refers to a compensation coefficient used to correct the data zero-point drift.
  • the operation results are added to the bias coefficient. This can correct the zero-point drift after the element-level operation, reduce the zero-point drift that may occur in each data channel, and further reduce the data error of the element-level operation.
  • the precision of the quantization nodes is optional.
  • xmax is the quantization boundary and nlevel is the number of quantization levels.
  • the quantization boundary depends on a boundary truncation method, such as max, percentile, OMSE, etc.
  • the number of quantization levels depends on the number of quantization bits (ie, precision) b.
  • the model can be viewed as a stack of modules in a certain order.
  • a module can be a single layer or a combination of multiple layers.
  • the module can be called a Block.
  • the optimization algorithm uses the Block as the basic unit (ie, the optimization unit), and reduces the difference in Block output characteristics before and after quantization by optimizing the quantization coefficient and the weight quantization rounding direction, thereby improving the performance of the quantization model.
  • the method for selecting the optimization unit in this embodiment is as follows:
  • a single Transformer Block is used as a basic optimization unit.
  • the Round function is used as the quantization rounding function
  • the quantization coefficient is regarded as a learnable parameter
  • the difference between the floating-point output and the quantization output of the same optimization unit is minimized to adjust the quantization coefficient (including the weight quantization coefficient and the activation quantization coefficient).
  • the floating-point weight parameters and quantization coefficients i.e., the above-mentioned target weight quantization coefficients and target activation quantization coefficients
  • the weight quantization rounding direction can be optimized by optimizing the weight quantization increment.
  • W′′ is the weight parameter after quantization adjustment of the optimization unit
  • W is the floating-point weight parameter of the optimization unit
  • s is the target weight quantization coefficient
  • V is the weight quantization increment
  • t is the hyperparameter (temperature coefficient) for controlling the sigmoid function, and in the process of iteratively updating the weight quantization increment of the optimization unit, t gradually decreases
  • Floor() is a rounding-down operation.
  • the iterative update process of V can be decomposed into the following execution steps:
  • Wq is the target quantization weight parameter of the optimization unit
  • W is the floating-point weight parameter of the optimization unit
  • s is the target weight quantization coefficient
  • V is the target weight quantization increment
  • Floor() is the rounding down operation
  • (V>0?1:0) means that when V>0 is true, the value is 1, and when V>0 is not true, the value is 0.
  • the output feature of the optimization unit is usually multi-dimensional data, and the difference in the values of different dimensions may be relatively large.
  • the value of the first dimension may be 0-1
  • the value of the second dimension may be 10-100
  • the value of the third dimension may be 100-1000.
  • the quantization loss is calculated directly based on the floating-point output and quantized output of the optimization unit using the mean square error loss function, the calculation result will focus too much on the value of a certain dimension, affecting the optimization effect.
  • the standard deviation of the floating-point output features on each Token can be used to balance the Token differences between the floating-point output features and the quantization output features, and then calculate the mean square error.
  • the standard deviation of the floating-point output of the optimization unit along a certain dimension can be determined based on the floating-point output of the optimization unit. Then, the quantized output and the floating-point output of the optimization unit are divided by the standard deviation to obtain the processed quantized output and the processed floating-point output, and the mean square error is calculated based on the processed quantized output and the processed floating-point output.
  • the Transformer model quantized in the above manner can be used for natural language processing tasks, such as text similarity, text classification, machine translation, etc.; it can also be applied to speech tasks, such as speech recognition; it can also be used for visual tasks, such as image classification, target detection and target tracking.
  • License plate recognition refers to the technology that can detect vehicles on the monitored road and automatically extract vehicle license plate information (including Chinese characters, English letters, Arabic numerals and license plate color) for processing. License plate recognition is one of the important components of modern intelligent transportation systems and is widely used. It is based on digital image processing, pattern recognition, computer vision and other technologies to analyze the vehicle images or video sequences taken by the camera to obtain the unique license plate number of each car, thereby completing the recognition process. Through some subsequent processing methods, parking fee management, traffic flow control index measurement, vehicle positioning, car anti-theft, highway speeding automatic supervision, red light electronic police, highway toll stations and other functions can be realized. The steps of license plate recognition are mainly to locate the license plate position in the picture first, then segment the characters in the license plate, and finally recognize the segmented characters to form the license plate number.
  • license plate character recognition is implemented based on a neural network.
  • the Recognition task can be completed quickly and efficiently.
  • OCR Optical Character Recognition
  • text recognition refers to the process of an electronic device (such as a scanner or digital camera) checking the characters printed on paper and then translating the shapes into computer text using character recognition methods; that is, scanning text data and then analyzing and processing the image file to obtain text and layout information.
  • Recognition speed is one of the main indicators for measuring the performance of an OCR system. By extracting text features using the quantized Transformer model provided by the method disclosed herein, the recognition speed can be improved, thereby improving the practicality of the OCR product.
  • Pedestrian retrieval is a technology that uses computer vision technology to determine whether there are specific pedestrians in an image or video sequence, which belongs to the problem of image retrieval.
  • FIG4 is a schematic diagram of a structure of a task processing device based on model quantization provided by an embodiment of the present disclosure.
  • the task processing device based on model quantization may include:
  • a first determining unit 410 is used for updating the weight quantization coefficient and activation quantization coefficient of any optimization unit in the Transformer model according to a first difference between the first quantized output and the first floating-point output of the optimization unit, with the goal of minimizing the first difference, to obtain a target weight quantization coefficient and a target activation quantization coefficient of the optimization unit; wherein the first quantized output includes a result of the optimization unit output when the weight parameter of the optimization unit is quantized according to the weight quantization coefficient and the input/output of the optimization unit is quantized according to the activation quantization coefficient, and the first floating-point output is a result of the optimization unit output when the weight parameter of the optimization unit is a floating-point weight parameter and the input is a floating-point input;
  • a second determining unit 420 is used to update the weight quantization increment of the optimization unit according to the second difference between the second quantized output and the second floating-point output of the optimization unit with the goal of minimizing the second difference, while fixing the floating-point weight parameter of the optimization unit and the target weight quantization coefficient and the target activation quantization coefficient, so as to obtain the target weight quantization increment of the optimization unit;
  • the second quantized output includes the weight parameter of the optimization unit quantized according to the target weight quantization coefficient, and quantized adjustment according to the weight quantization increment, and the result of the optimization unit output when the input/output of the optimization unit is quantized according to the target activation quantization coefficient
  • the second floating-point output includes the weight parameter of the optimization unit being a floating-point weight parameter, and the result of the optimization unit output when the input is a floating-point input;
  • a quantization unit 430 is used to determine the weight quantization rounding direction of the optimization unit according to the target weight quantization increment, and quantize the weight parameter of the optimization unit according to the target weight quantization coefficient to obtain the target quantization weight parameter of the optimization unit;
  • the task processing unit 440 is used to perform forward reasoning calculations on the input data of any optimization unit according to the target quantization weight parameters corresponding to the optimization unit when the Transformer model is used for task processing, and to quantize the input/output of the optimization unit according to the target activation quantization coefficient of the optimization unit.
  • the first determining unit 410 updates the weight quantization coefficient and the activation quantization coefficient of the optimization unit according to the first difference between the first quantized output and the first floating-point output of the optimization unit with the goal of minimizing the first difference, and obtains the target weight quantization coefficient and the target activation quantization coefficient of the optimization unit, including:
  • the weight quantization coefficient and activation quantization coefficient of the optimization unit are iteratively updated using a gradient descent algorithm until the first preset end condition is reached, thereby obtaining the target weight quantization coefficient and target activation quantization coefficient of the optimization unit.
  • the second determining unit 420 updates the weight quantization increment of the optimization unit according to the second difference between the second quantized output and the second floating-point output of the optimization unit with the goal of minimizing the second difference to obtain the target weight quantization increment of the optimization unit, including:
  • the weight quantization increment of the optimization unit is iteratively updated using a gradient descent algorithm until a second preset end condition is reached, thereby obtaining the target weight quantization increment of the optimization unit.
  • the second determining unit 420 quantizes the weight parameter of the optimization unit according to the target weight quantization coefficient, and performs quantization adjustment according to the weight quantization increment, which is implemented by the following formula:
  • W′′ is the weight parameter after quantization adjustment of the optimization unit
  • W is the floating-point weight parameter of the optimization unit
  • s is the target weight quantization coefficient
  • V is the weight quantization increment
  • t is the hyperparameter for controlling the sigmoid function, and in the process of iteratively updating the weight quantization increment of the optimization unit, t gradually decreases
  • Floor() is a rounding-down operation.
  • Wq is the target quantization weight parameter of the optimization unit
  • W is the floating-point weight parameter of the optimization unit
  • s is the target weight quantization coefficient
  • V is the target weight quantization increment
  • Floor() is a rounding-down operation
  • (V>0?1:0) means that when V>0 is true, the value is 1, and when V>0 is not true, the value is 0.
  • the first determining unit 410 determines the first quantization loss of the optimization unit using a first preset loss function according to the first quantization output and the first floating-point output of the optimization unit, including:
  • a first quantized loss of the optimization unit is determined using a mean square error loss function.
  • the second determining unit 420 determines the second quantization loss of the optimization unit using a second preset loss function according to the second quantization output and the second floating-point output of the optimization unit, including:
  • a second quantization loss of the optimization unit is determined using a mean square error loss function.
  • the optimization unit in the Transformer model includes: Transformer Stage, Transformer Block, or a single linear layer.
  • An embodiment of the present disclosure provides an electronic device, including a processor and a memory, wherein the memory stores machine executable instructions that can be executed by the processor, and the processor is used to execute the machine executable instructions to implement the task processing method based on domain adaptation described above.
  • FIG. 5 is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present disclosure.
  • the electronic device may include a processor 501 and a memory 502 storing machine executable instructions.
  • the processor 501 and the memory 502 may communicate via a system bus 503.
  • the processor 501 may execute the domain adaptation-based task processing method described above.
  • the memory 502 mentioned in this article can be any electronic, magnetic, optical or other physical storage device that can contain or store information, such as executable instructions, data, etc.
  • the machine-readable storage medium can be: RAM (Radom Access Memory), volatile memory, non-volatile memory, flash memory, storage drive (such as hard disk drive), solid state drive, any type of storage disk (such as CD, DVD, etc.), or similar storage medium, or a combination thereof.
  • a storage medium is also provided, such as the memory 502 in FIG. 5 , which is a machine-readable storage medium, and the storage medium stores machine-executable instructions, and when the machine-executable instructions are executed by the processor, the task processing method based on domain adaptation described above is implemented.
  • the storage medium can be ROM (Read-Only Memory), RAM, CD-ROM (Compact Disc Read-Only Memory), magnetic tape, floppy disk, optical data storage device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

本公开提供一种基于模型量化的任务处理方法、装置、设备及存储介质,该任务处理方法包括:依据Transformer模型中优化单元的第一量化输出和第一浮点输出之间的第一差异,对优化单元的权重量化系数和激活量化系数进行更新;依据优化单元的第二量化输出和第二浮点输出之间的第二差异,对优化单元的权重量化增量进行更新;依据目标权重量化增量确定优化单元的权重量化取整方向,依据目标权重量化系数和权重量化取整方向对优化单元的权重参数进行量化;依据优化单元对应的目标量化权重参数对优化单元的输入数据进行前向推理计算,并依据优化单元的目标激活量化系数对优化单元的输入和输出进行量化。该方法可以提升任务处理的准确性。

Description

基于模型量化的任务处理方法、装置、设备及存储介质 技术领域
本公开涉及人工智能技术领域,尤其涉及一种基于模型量化的任务处理方法、装置、设备及存储介质。
背景技术
神经网络(Neural Network,简称NN)进行训练或者在设备端进行前向推理时,需要进行大量的浮点float32(32位浮点数)乘加运算(Multiply-Accumulate Operation,简称MACs),消耗了存储和计算资源。
使用更低精度的数据形式(例如低精度浮点数float16、float8,低精度整型数in8、int4等)代替32位浮点进行运算可以降低资源消耗,同时提高模型训练和/或推理速度。
传统的模型量化方案中,仅针对量化系数进行优化,得到的量化模型通常会存在较大的性能损失,进而,影响利用量化模型进行任务处理的准确性。
发明内容
有鉴于此,本公开提供一种基于模型量化的任务处理方法、装置、设备及存储介质。
具体地,本公开是通过如下技术方案实现的:
根据本公开实施例的第一方面,提供一种基于模型量化的任务处理方法,包括:
对于Transformer模型中的任一优化单元,依据该优化单元的第一量化输出和第一浮点输出之间的第一差异,以最小化该第一差异为目标,对该优化单元的权重量化系数和激活量化系数进行更新,得到该优化单元的目标权重量化系数和目标激活量化系数;其中,所述第一量化输出包括该优化单元的权重参数依据所述权重量化系数进行量化,且该优化单元的输入和输出依据所述激活量化系数进行量化的情况下该优化单元输出的结果,所述第一浮点输出包括该优化单元的权重参数为浮点权重参数,且输入为浮点输入的情况下该优化单元输出的结果;
在固定该优化单元的所述浮点权重参数、所述目标权重量化系数和所述目标激活量化系数的情况下,依据该优化单元的第二量化输出和第二浮点输出之间的第二差异,以最小化该第二差异为目标,对该优化单元的权重量化增量进行更新,得到该优化单元的目标权重量化增量;其中,所述第二量化输出包括该优化单元的权重参数依据所述目标权重量化系数进行量化,并依据所述权重量化增量进行量化调整,且该优化单元的输入和输出依据所述目标激活量化系数进行量化的情况下该优化单元输出的结果,所述第二浮点输出包括该优化单元的权重参数为所述浮点权重参数,且输入为浮点输入的情况下该优化单元输出的结果;
依据所述目标权重量化增量确定该优化单元的权重量化取整方向;
依据所述目标权重量化系数和所述权重量化取整方向对该优化单元的权重参数进行量化,得到该优化单元的目标量化权重参数;
在利用所述Transformer模型进行任务处理的情况下,对于任一优化单元,依据该优化单元对应的目标量化权重参数对该优化单元的输入数据进行前向推理计算,并依据该优化单元的所述目标激活量化系数对该优化单元的输入和输出进行量化。
根据本公开实施例的第二方面,提供一种基于模型量化的任务处理装置,包括:
第一确定单元,用于对于Transformer模型中的任一优化单元,依据该优化单元的第一量化输出和第一浮点输出之间的第一差异,以最小化该第一差异为目标,对该优化单元的权重量化系数和激活量化系数进行更新,得到该优化单元的目标权重量化系数和目标激活量化系数;其中,所述第一量化输出包括该优化单元的权重参数依据所述权重量化系数进行量化,且该优化单元的输入和输出依据所述激活量化系数进行量化的情况下该优化单元输出的结果,所述第一浮点输出包括该优化单元的权重参数为浮点权重参数,且输入为浮点输入的情况下该优化单元输出的结果;
第二确定单元,用于在固定该优化单元的所述浮点权重参数、所述目标权重量化系数和所述目标激活量化系数的情况下,依据该优化单元的第二量化输出和第二浮点输出之间的第二差异,以最小化该第二差异为目标,对该优化单元的权重量化增量进行 更新,得到该优化单元的目标权重量化增量;其中,所述第二量化输出包括该优化单元的权重参数依据所述目标权重量化系数进行量化,并依据所述权重量化增量进行量化调整,且该优化单元的输入和输出依据所述目标激活量化系数进行量化的情况下该优化单元输出的结果,所述第二浮点输出包括该优化单元的权重参数为所述浮点权重参数,且输入为浮点输入的情况下该优化单元输出的结果;
量化单元,用于依据所述目标权重量化增量确定该优化单元的权重量化取整方向;依据所述目标权重量化系数和所述权重量化取整方向对该优化单元的权重参数进行量化,得到该优化单元的目标量化权重参数;
任务处理单元,用于在利用所述Transformer模型进行任务处理的情况下,对于任一优化单元,依据该优化单元对应的目标量化权重参数对该优化单元的输入数据进行前向推理计算,并依据该优化单元的所述目标激活量化系数对该优化单元的输入/输出进行量化。
根据本公开实施例的第三方面,提供一种电子设备,包括处理器和存储器,所述存储器存储有能够被所述处理器执行的机器可执行指令,所述处理器用于执行机器可执行指令,以实现第一方面提供的方法。
根据本公开实施例的第四方面,提供一种存储介质,所述存储介质内存储有机器可执行指令,所述机器可执行指令被处理器执行时实现第一方面提供的方法。
本公开提供的技术方案至少可以带来以下有益效果:
在对Transformer模型进行量化的过程中,不仅对量化系数(包括权重量化系数和激活量化系数)进行了优化,还对权重量化取整方向进行了优化,有效提升了量化模型的性能,进而,提升了利用量化模型进行任务处理时的任务处理的准确性。
附图说明
图1是本公开实施例提供的一种基于模型量化的任务处理方法的流程示意图;
图2是本公开实施例提供的一种人工智能主体框架的示意图;
图3是本公开实施例提供的一种优化量化系数的流程示意图;
图4是本公开实施例提供的一种基于模型量化的任务处理装置的结构示意图;
图5是本公开实施例提供的一种电子设备的硬件结构示意图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开。在本公开和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。
为了使本领域技术人员更好地理解本公开实施例提供的技术方案,并使本公开实施例的上述目的、特征和优点能够更加明显易懂,下面结合附图对本公开实施例中技术方案作进一步详细的说明。
请参见图1,为本公开实施例提供的一种基于模型量化的任务处理方法的流程示意图,如图1所示,该基于模型量化的任务处理方法可以包括以下步骤:
步骤S100、对于Transformer模型中的任一优化单元,依据该优化单元的第一量化输出和第一浮点输出之间的第一差异,以最小化该第一差异为目标,对该优化单元的权重量化系数和激活量化系数进行更新,得到该优化单元的目标权重量化系数和目标激活量化系数;其中,第一量化输出包括该优化单元的权重参数依据权重量化系数进行量化,且该优化单元的输入/输出依据激活量化系数进行量化的情况下该优化单元输出的结果,第一浮点输出包括该优化单元的权重参数为浮点权重参数,且输入为浮点输入的情况下该优化单元输出的结果。该优化单元的输入/输出可以包括输入/输出的数据,其是在量化状态下计算得到的。
步骤S110、在固定该优化单元的浮点权重参数、目标权重量化系数和目标激活量化系数的情况下,依据该优化单元的第二量化输出和第二浮点输出之间的第二差异,以最小化该第二差异为目标,对该优化单元的权重量化增量进行更新,得到该优化单元的目标权重量化增量;其中,第二量化输出包括该优化单元的权重参数依据目标权重量化系数进行量化,并依据权重量化增量进行量化调整,且该优化单元的输入/输出依据目标激活量化系数进行量化的情况下该优化单元输出的结果,第二浮点输出包括该优化单元的权重参数为浮点权重参数,且输入为浮点输入的情况下该优化单元输出的结果。
需要说明的是,量化是指,在模型计算过程中,将浮点数表示的参数和特征用定点值近似表示,提升运算速度。Transformer模型是一种新型神经网络,它可以通过跟踪序列中的关系来学习上下文并因此学习含义。Transformer模型可以比较好的进行并行训练,并使用了自注意力(self-attention)机制。但是相比于传统的神经网络模型,如RNN(循环神经网络)等,Transformer模型的复杂度和网络参数量的倍增,其对计算能力和数据存储的需求急剧提高。对该模型进行量化可以减少模型对计算能力和存储的需求,但其代价是会降低模型的最终性能。
不同的Transformer模型使用的精度不同,在一般情况下,使用的是浮点可以是float32或float16。浮点权重参数、浮点输出、浮点输入为未被量化压缩前的数值表达类型。
本公开实施例中,考虑到在对Transformer模型进行量化的过程中,最终使用的量化系数会影响到量化模型的性能,因此,需要确定合适的量化系数,来减小量化模型(即对Transformer模型进行量化后的模型)与浮点模型(即未经量化的Transformer模型)输出特征的差异,提升量化模型的性能。
此外,在传统模型量化方案中,一般统一采用四舍五入(Round-To-Nearest)的方式进行量化取整,但并非所有量化过程采用四舍五入就是最优的量化取整方式,因此,自适应的量化取整方向(如向上取整或向下取整)能够减小量化模型与浮点模型输出特征的差异,进而提升量化模型性能。
即本公开实施例中,以通过对量化系数和量化取整方向进行优化,来提升Transformer量化模型性能。
本公开实施例中,Transformer模型可以看作多个模块(可以称为Block)按照一定的顺序堆叠而成的。
示例性的,一个Block可以包括单个层(layer)或多个层构成的功能单元。
在一个示例中,Transformer模型中的优化单元可以包括:Transformer Stage、Transformer Block或单个线性层。
示例性的,层是单个操作构成的单元。层按特定顺序堆叠在一起后可以看作一个整体称为模块(Block)。相同类型的模块(Block)堆叠在一起构成Stage。Transformer Block包括但不限于编码块(encoder block)、解码块(decoder block)。
示例性的,考虑到Transformer模型中多个层之间存在相互影响的关系,若对该多个层分别单独进行量化系数和量化取整方向的优化,而不考虑该多个层支架内的相互影响,优化性能可能会较差。
相应地,为了在进行量化系数和量化取整方向的优化过程中考虑到层间相互影响关系,在确定Transformer模型中的优化单元时,一般以Transformer Block作为基本的优化单元,除此之外,单个线性层也可以作为优化单元。
需要说明的是,对于单个的非线性层,由于其并不具备权重参数,也不涉及权重参数的量化,因此,不需要针对单个非线性层优化量化取整方向。
本公开实施例中,在对Transformer模型进行量化的过程中,可以以Block为基本单元(可以称为优化单元),以最小化量化前后优化单元的输出差异为目标,按序依次对各优化单元进行量化系数和量化取整方向的优化。
Transformer模型可以包括多个优化单元,在实际应用中,可以对多个优化单元中的全部或部分优化单元进行如本公开实施例所提供的任务处理方法。
示例性的,量化系数可以包括权重量化系数(用于对权重参数进行量化)以及激活量化系数。
其中,激活量化系数可以包括输入/输出量化系数,用于对输入数据/输出数据进行量化。
本公开实施例中,对于Transformer模型中的任一优化单元,可以获取该优化单元的权重参数依据当前的权重量化系数进行量化,且该优化单元的输入/输出依据当前的激活量化系数进行量化的情况下,该优化单元的输出结果(本文中称为第一量化输出)。并且获取该优化单元的权重参数为浮点权重参数,且输入为浮点输入的情况下,该优化单元的浮点输出结果(本文中称为第一浮点输出),并依据第一量化输出和第一浮点输出之间的差异,以最小化该差异为目标,对该优化单元当前的权重量化系数和当前的激活量化系数进行更新,得到该优化单元在实际量化过程中使用的权重量化系数(本文中称为目标权重量化系数)和激活量化系数(本文中称为目标激活量化系数)。
示例性的,对于任一优化单元,在对该优化单元的权重量化系数和激活量化系数进行更新的过程中,第一次更新完成之前,当前权重量化系数和当前激活量化系数的取值可以为初始值;第一次更新完成之后,当前权重量化系数和当前激活量化的取值可以为上一次更新完成时权重量化系数和激活量化系数的取值。
示例性的,对于任一优化单元,在获取该优化单元的第一浮点输出时,Transformer模型中各优化单元可以均处于浮点模式;在获取该优化单元的第一量化输出时,在优化单元之前的优化单元可以均处于量化模式。
需要说明的是,在本公开实施例中,在确定第一量化输出和第一浮点输出之间的差异时,是在对Transformer模型输入相同数据的情况下,该优化单元的第一量化输出和第一浮点输出之间的差异。
例如,可以将准备好的数据(如多张无标签图片)按batch(批)输入到Transformer模型,在将一个batch的数据输入到Transformer模型的情况下,对于任一优化单元,可以依据该一个batch的数据对应的第一浮点输出和第一量化输出之间的差异,对该优化单元的权重量化系数和激活量化系数进行更新。示例性的,也可以使用多次epoch(一次epoch指将所有准备好的数据输入到Transformer模型进行更新)对该优化单元的权重量化系数和激活量化系数进行更新迭代。
示例性的,对任意一个优化单元,在优化该单元的量化系数时需要同时考虑单元输入/输出激活特征的量化系数。
具体的,对于非首个优化单元或非最后一个优化单元,该优化单元的激活量化系数可以包括针对输入数据的量化系数和/或针对输出数据的量化系数。
在优化单元的激活量化系数为针对输入数据的量化系数的情况下,对于最后一个优化单元,该优化单元的激活量化系数还可以包括针对输出数据的量化系数(首个优化单元的激活量化系数也为针对输入数据的量化系数)。
在优化单元的激活量化系数为针对输出数据的量化系数的情况下,对于首个优化单元,该优化单元的激活量化系数还可以包括针对输入数据的量化系数(最后一个优化单元的激活量化系数也为针对输出数据的量化系数)。
本公开实施例中,为了进一步优化量化模型的性能,在确定了优化单元的目标权重量化系数和目标激活量化系数的情况下,还可以对量化取整方向进行优化,以确定最优的量化取整方向,而不是按照固定的四舍五入方式进行量化取整。
考虑到对于任一优化单元,其输出会随着Transformer模型的输入的变化而发生变化,且实际任务处理过程中,Transformer模型的输入也是无法预估的,因而,无法通过测试的方式确定或预估优化单元的输入/输出的最优量化取整方向。
激活量化系数取决于量化精度(如量化后的比特数)和量化范围,Transformer模型的输入的量化范围可以通过统计获得,而量化精度可以是8bit,4bit等等。
相应地,对于任一优化单元,权重在推理过程中是固定不变的,量化取整方向的优化只需进行一次,不会带来新的推理计算消耗。
示例性的,对于任一优化单元,可以固定该优化单元的浮点权重参数、目标权重量化系数,以及目标激活量化系数,并分别获取该优化单元在Transformer模型的输入数据相同的情况下的量化输入和浮点输出。
示例性的,在获取该优化单元的量化输入时,可以依据上述目标权重量化系数对该优化单元的权重参数进行量化,并依据当前的权重量化增量对量化的权重参数进行量化调整,且依据上述目标激活量化系数对该优化单元的输入/输出进行量化,并获取该 优化单元在该情况下的输出结果(本文中称为第二量化输出)。
示例性的,对于任一优化单元,权重量化增量用于对该优化单元的量化权重参数(依据目标权重量化系数对权重参数进行量化得到量化权重参数)进行调整,给量化权重参数一个取值范围为0~1的增量,以便确定对权重参数进行量化时,最优的量化取整方向(向上取整或向下取整)。
本公开的发明人发现,四舍五入的量化取整方向对局部而言是误差最小的,但是在实际测量中,尤其是在低精度量化场景下,四舍五入的取整对量化模型整体性能而言并不是最优的,因为量化误差在前向推理中会不断地累积和放大。基于此,本公开实施例中,为保证量化模型的输出与浮点模型的输出差异尽可能小(贴合预期),采用自适应向上取整或向下取整的取整方式。
例如,若依据权重量化增量给量化权重参数的增量越接近1的情况下,第二量化输出和第二浮点输出之间的差异越小,则表明在对权重参数进行量化时,向上取整的性能更好;若依据权重量化增量给量化权重参数的增量越接近0的情况下,第二量化输出和第二浮点输出之间的差异越小,则表明在对权重参数进行量化时,向下取整的性能更好。
示例性的,在获取该优化单元的浮点输出时,可以获取该优化单元的权重参数为浮点权重参数,且输入为浮点输入的情况下该优化单元的输出(本文中称为第二浮点输出)。
示例性的,对于任一优化单元,在对该优化单元的权重量化增量进行更新的过程中,第一次更新完成之前,当前权重量化增量可以为初始值;第一次更新完成之后,当前权重量化增量取值可以为上一次更新完成时权重量化增量的取值。
需要说明的是,对于任一优化单元,在Transformer模型的输入相同的情况下,上述第一浮点输出和第二浮点输出相同。
示例性的,可以依据优化单元的第二量化输出和第二浮点输出之间的差异,以最小化该差异为目标,对该优化单元的权重量化增量进行迭代更新,得到该优化单元的最终权重量化增量(本文中称为目标权重量化增量),该目标权重量化增量可以用于确定该优化单元的权重量化取整方向。
步骤S120、依据目标权重量化增量确定该优化单元的权重量化取整方向,并依据目标权重量化系数和权重量化取整方向对该优化单元的权重参数进行量化,得到该优化单元的目标量化权重参数。
本公开实施例中,对于任一优化单元,在按照上述方式确定了该优化单元的目标权重量化系数和目标权重量化增量的情况下,可以依据该目标权重量化系数对该优化单元的权重参数以及确定的该优化单元的权重量化取整方向进行量化,得到该优化单元的最终的量化权重参数(可以称为目标量化权重参数)。
步骤S130、在利用Transformer模型进行任务处理的情况下,对于任一优化单元,依据该优化单元对应的目标量化权重参数对该优化单元的输入数据进行前向推理计算,并依据该优化单元的目标激活量化系数对该优化单元的输入/输出进行量化。
本公开实施例中,在利用Transformer模型进行任务处理的情况下,可以依据各优化单元的目标量化权重参数对优化单元的输入数据进行前向推理计算,并依据优化单元的目标激活量化系数对优化单元的输入/输出进行量化。
示例性的,对于任一优化单元,若该优化单元的目标激活量化系数包括针对输入数据的量化系数,则该优化单元可以依据该针对输入数据的量化系数对本优化单元的输入数据进行量化,并依据本优化单元的目标量化权重参数对量化后的输入数据进行前向推理计算,在保证模型性能的情况下,提高模型推理速度,进而提高任务处理效率。
示例性的,对于任一优化单元,可以对该优化单元的输入数据根据目标激活量化系数进行量化,然后使用量化后的输入数据进行该优化单元的前向推理计算,得到该优化单元的初始输出数据,对该初始输出数据根据目标激活量化系数进行量化。在进行前向推理计算时,所涉及到的参数是依据该优化单元的目标量化权重参数进行量化的。
示例性的,上述任务处理包括但不限于NLP(Natural Language Processing,自然语言处理)任务(如机器翻译),或,Speech(语音)任务(如语音识别),或,CV(ComputerVision,计算机视觉)任务(如分类、目标检测或目标追踪任务等)等。
可见,在图1所示方法流程中,在对Transformer模型进行量化的过程中,不仅对量化系数(包括权重量化系数和激活量化系数)进行了优化,还对权重量化取整方向 进行了优化,有效提升了量化模型的性能,进而,提升了利用量化模型进行任务处理时的任务处理的准确性。
在一些实施例中,上述依据该优化单元的第一量化输出和第一浮点输出之间的第一差异,以最小化该第一差异为目标,对该优化单元的权重量化系数和激活量化系数进行更新,得到该优化单元的目标权重量化系数和目标激活量化系数,可以包括:
依据该优化单元的第一量化输出和第一浮点输出,利用第一预设损失函数,确定该优化单元的第一量化损失;
依据该优化单元的第一量化损失,以最小化第一量化损失为目标,利用梯度下降算法对该优化单元的权重量化系数和激活量化系数进行迭代更新,直至达到第一预设结束条件,得到该优化单元的目标权重量化系数和目标激活量化系数。
示例性的,对于任一优化单元,在确定该优化单元的目标权重量化系数和目标激活量化系数的过程中,可以依据该优化单元的第一量化输出和第一浮点输出,利用预设损失函数(本文中称为第一预设损失函数),如均方差损失函数、绝对值损失函数等,确定该优化单元的量化损失(可以称为第一量化损失),并依据该优化单元的第一量化损失,以最小化该第一量化损失为目标,利用梯度下降算法对该优化单元的权重量化系数和激活量化系数进行迭代更新,直至达到第一预设结束条件,如迭代次数达到预设最大迭代次数,和/或,损失函数收敛等,得到该优化单元的目标权重量化系数和目标激活量化系数。
在一些实施例中,上述依据该优化单元的第二量化输出和第二浮点输出之间的第二差异,以最小化该第二差异为目标,对该优化单元的权重量化增量进行更新,得到该优化单元的目标权重量化增量,可以包括:
依据该优化单元的第二量化输出和第二浮点输出,利用第二预设损失函数,确定该优化单元的第二量化损失;
依据该优化单元的第二量化损失,以最小化第二量化损失为目标,利用梯度下降算法对该优化单元的权重量化增量进行迭代更新,直至达到第二预设结束条件,得到该优化单元的目标权重量化增量。
示例性的,对于任一优化单元,在确定了该优化单元的目标权重量化系数的情况下,可以按照上述方式确定该优化单元的第二量化输出和第二浮点输出,依据该优化单元的第二量化输出和第二浮点输出,利用预设损失函数(本文中称为第二预设损失函数),如均方差损失函数、绝对值损失函数等,确定该优化单元的量化损失(本文中称为第二量化损失),并依据该优化单元的第二量化损失,以最小化第二量化损失为目标,利用梯度下降算法对该优化单元的权重量化增量进行迭代更新,直至到达第二预设结束条件,如迭代次数达到预设最大迭代次数,和/或,损失函数收敛等,得到该优化单元的目标权重量化增量。
第一预设结束条件与第二预设结束条件可以相同也可以不同。可以根据实际情况进行设置,本公开对此不做限制。
在一个示例中,依据目标权重量化系数对该优化单元的权重参数进行量化,并依据权重量化增量进行调整,可以通过以下公式实现:
其中,W″为该优化单元的量化调整后的权重参数,W为该优化单元的浮点权重参数,s为目标权重量化系数,V为权重量化增量,用于调整量化取整方向,t为控制sigmoid函数的超参(温度系数),用于控制sigmoid函数的陡度,在对该优化单元的权重量化增量进行迭代更新的过程中,t逐渐减小,sigmiod函数的形状逐渐过程成阶梯函数,Floor()为向下取整操作。
示例性的,对于任一优化单元,在确定该优化单元的最优权重量化取整方向的过程中,可以将量化取整函数由Round函数(即四舍五入函数)替换为Floor函数(即向下取整函数),依据该优化单元的目标权重量化系数对该优化单元的权重参数进行量化,并依据当前的权重量化增量,利用sigmoid函数确定对量化权重参数的增量。
其中,sigmoid函数是一种常见的曲线函数,具有单增性质,可以将变量映射到0~1之间。在变量小于0的情况下,sigmoid函数的取值小于0.5;在变量大于0的情况下,sigmoid函数的取值大于0.5。
通过上述公式,可以在V取值大于0的情况下,为量化权重参数增加一个大于0.5的增量,且V/t取值越大,该增量越接近1;在V取值小于0的情况下,为量化权重参数增加一个小于0.5的增量,且V/t取值越小,该增量越接近0,进而,对V的迭代更新,可以确定最优的权重量化增量(即目标权重量化增量),并依据目标权重量化增量确定该优化单元的最优权重量化取整方向。
在一些实施例中,上述依据目标权重量化系数和所述权重量化取整方向对该优化单元的权重参数进行量化,得到该优化单元的目标量化权重参数,可以通过以下公式实现:
Wq=Floor(W/s)+(V>0?1:0)
其中,Wq为该优化单元的最终(目标)量化权重参数,W为该优化单元的浮点权重参数,s为目标权重量化系数,V为目标权重量化增量,Floor()为向下取整操作,(V>0?1:0)表示在V>0成立的情况下,取值为1,在V>0不成立的情况下,取值为0。
示例性的,在按照上述方式确定了目标权重量化增量的情况下,可以依据该目标权重量化增量确定最优权重量化取整方向。
示例性的,在目标权重量化增量大于0的情况下,可以确定最优权重量化取整方向为向上取整;在目标权重量化增量小于0的情况下,可以确定最优权重量化取整方向为向下取整。
在一些实施例中,依据该优化单元的量化输出和浮点输出,利用预设损失函数,确定该优化单元的量化损失,包括:
依据该优化单元的浮点输出,确定该优化单元的浮点输出的标准差;
分别将该优化单元的量化输出除以该标准差,得到处理后的量化输出;以及,将该优化单元的浮点输出除以该标准差,得到处理后的浮点输出;
依据该处理后的量化输出以及处理后的浮点输出,利用均方误差损失函数,确定该优化单元的量化损失。
示例性的,考虑到优化单元的输出特征通常是多维度数据,且其不同维度的取值的差异可能会比较大,例如,以3维度数据为例,第1个维度的取值可能为0~1,第2个维度的取值可能为10~100,第3个维度的取值可能为100~1000,在该情况下,若直接依据优化单元的浮点输出和量化输出,利用均方差损失函数计算量化损失,则会导致计算结果过于关注某一个维度的取值,影响优化效果。
为了避免上述问题,提升优化效果,在依据优化单元的量化输出和浮点输出确定优化单元的量化损失的过程中,例如,依据优化单元的第一量化输出和第一浮点输出确定优化单元的第一量化损失,或,依据优化单元的第二量化输出和第二浮点输出确定优化单元的第二量化损失的过程中,可以先依据该优化单元的浮点输出,计算该优化单元的浮点输出沿某一个维度上的标准差。
以依据优化单元的第一量化输出和第一浮点输出确定优化单元的第一量化损失为例(依据优化单元的第二量化输出和第二浮点输出确定优化单元的第二量化损失的实现同理可得)。
对于任一优化单元,可以依据该优化单元的第一浮点输出,确定该优化单元的第一浮点输出沿某一个维度上的标准差,并分别将该优化单元的第一量化输出除以该标准差,得到处理后的第一量化输出,以及,将该优化单元的第一浮点输出除以该标准差,得到处理后的第一浮点输出,进而,依据处理后的第一量化输出和处理后的第一浮点输出,利用均方误差损失函数,确定该优化单元的第一量化损失。
为了使本领域技术人员更好地理解本公开实施例提供的技术方案,下面结合具体实施例对本公开实施例提供的技术方案进行说明。
1、下面先对人工智能主体框架进行简单说明。
如图2所示,人工智能主体框架可以包括五个层次:应用层、算法层、系统层、依赖层、设备层。自上而下存在相互依赖关系。同时,往上偏向实际应用,往下偏向底层硬件。
1.1、应用层:通过分析需求,将问题定位到人工智能对应分支上。
1.2、算法层:根据应用场景设计模型训练使用的策略,Loss函数,以及后续的压缩算法等。
1.3、系统层:利用深度学习框架构建模型,训练构建好的模型、并进行计算图解析和模型压缩。
1.4、依赖层:算法实现的语言或深度学习框架,利用设备对外接口和协议调用对应设备。
1.5、设备层:由计算单元构成,为人工智能系统提供算力支持。
2、下面对该实施例中的Transformer模型无训练量化方案(Transformer模型量化过程中使用的数据为无标签数据)的实现进行说明。
在该实施例中,考虑到量化过程中固定采用四舍五入的取整方式并非最优,因此,在对Transformer模型的量化过程中,可以通过优化权重量化取整方向和量化系数的方式,减小量化模型与浮点模型输出特征的差异,进而提升量化模型性能。
2.1、量化算法性能
待量化模型:swin-tiny ImageNet分类模型;
精度设定:848,输入输出8比特量化,权重4比特量化(浮点模型对应的都是32比特);
数据量:1024张无标签图片;
量化所需时间:约30min;
示例性的,量化所需时间与量化精度无关,取决于算法迭代次数和模型大小。
性能:浮点性能acc@top1=81.18,量化性能acc@top1=80.57。
2.2、该实施例的量化方案的基本流程如下:
2.2.1、通过Eltwise(元素级操作层)-With-Bias(简称ELT)方法平衡Shortcut加法的通道间的数值差异;
示例性的,每个Transformer Block层都包括Shortcut加法层,可以将相应的Shortcut层替换成Eltwise层。Eltwise层是神经元网络中一种功能层的统称,其特点在于对两个或以上同等尺寸的数据块进行逐元素(同位置)加法计算。
在Eltwise层的数据处理过程中,向Eltwise层输入的各个数据块的数据精度可能存在较大差异,进而导致各个数据块的数据范围也存在较大差异,则对各个数据块进行元素级操作后得到的元素级操作结果的整体分布方差也较大,导致元素级操作的数据精度变低。
为了解决该问题,对于向Eltwise层输入的每个数据块,可以为每个数据块的每个通道设置对应的补偿系数,即提出细化到输入通道级的补偿系数,这些补偿系数可以对每个数据块的各个通道上的数据范围差异进行补偿,进而对这多个数据块的数据范围差异进行补偿,使得对这多个数据块的数据精度范围也得到补偿,从而将这多个数据块转换成数据精度相同的数据块,以对齐不同通道数据的数据精度,使得基于补偿后的数据进行元素级操作后得到元素级操作结果的整体分布方差减小,数据精度变高。
示例性的,在将各个数据块的每个通道与对应的补偿系数相乘,并将相乘结果加和之后,还可以再将加和结果与偏置系数(可以用bias表示)进行相加。
其中,该偏置系数是指用于对数据零点漂移进行校正的补偿系数,通过在对n个补偿数据块进行元素级操作得到操作结果之后,将操作结果与偏置系数进行相加,可以对元素级操作之后的零点漂移进行校正,减少每个数据通道可能发生的零点漂移,进一步减少元素级操作的数据误差。
2.2.2、插入量化结点,用于量化权重或量化激活(即输入/输出)。
示例性的,量化结点的精度可选。
2.2.3、用图片数据(可以包括上述1024张图片中的部分或全部)初始化量化结点的量化系数。
示例性的,量化系数初始值的计算方法可以为:
alpha=xmax/nlevel
其中,xmax为量化边界,nlevel为量化水平数。
示例性的,量化边界取决于边界截断方法。例如,边界截断方法可以包括:max、percentile、OMSE等。
示例性的,量化水平数取决于量化比特数(即精度)b。
例如,量化水平可以通过以下方式计算:
nlevel=2b-1-1
2.2.4、选取优化单元
示例性的,模型可看作是模块按一定的顺序堆叠而成。模块可以是单个层(layer),或多个层组合而成。
示例性的,可以将模块称为Block。优化算法以Block为基本单元(即优化单元),通过优化量化系数和权重量化取整方向,减小量化前后Block输出特征的差异,进而提升量化模型性能。
示例性的,该实施例中优化单元的选取方法如下:
2.2.4.1、单个Transformer Block作为一个基本优化单元;
2.2.4.2、除Transformer Block外,以单个线性层作为一个优化单元。
2.2.5、优化量化系数
示例性的,以Round函数为量化取整函数,将量化系数看作是可学习参数,最小化同一优化单元的浮点输出和量化输出差异来调整量化系数(包括权重量化系数和激活量化系数),其实现可以流程可以如图3所示,包括如下流程:
2.2.5.1、在浮点模式下,输入图片数据(如上述1024张图片)(对应图3中的Input:Float),获取优化单元的浮点输出Yf1(即上述第一浮点输出)(对应图3中的Output:Float);
2.2.5.2、在量化模式下,输入图片数据(如上述1024张图片)(对应图3中的Input:Quant),获取优化单元的量化输出Yq1(即上述第一量化输出)(对应图3中的Output:Quant);
2.2.5.3、计算优化单元差异L(Yf1,Yq1)(即上述第一量化损失)(对应图3中的Loss),通过LSQ(Learning Step Quantization,学习步长量化)算法对量化系数进行迭代更新,得到最终的量化系数(即上述目标权重量化系数和目标激活量化系数)。
2.2.6、优化权重量化取整方向
示例性的,可以固定浮点权重参数和量化系数(即上述目标权重量化系数和目标激活量化系数),通过优化权重量化增量来优化权重量化取整方向,其具体实现流程如下:
2.2.6.1、权重量化结点的量化取整函数从Round函数替换为Floor函数;
2.2.6.2、定义权重量化增量为参数V,并初始化该参数;
2.2.6.3、在对V的迭代更新过程中量化过程为:
其中,W″为该优化单元的量化调整后的权重参数,W为该优化单元的浮点权重参数,s为目标权重量化系数,V为权重量化增量,t为控制sigmoid函数的超参(温度系数),在对该优化单元的权重量化增量进行迭代更新的过程中,t逐渐减小,Floor()为向下取整操作。
示例性的,对V的迭代更新过程可分解为如下执行步骤:
1)、在浮点模式下,输入图片数据(如上述1024张图片),获取优化单元浮点输出Yf2(即上述第二浮点输出)。
2)、在量化模式下,输入图片数据(如上述1024张图片),获取优化单元的量化输出Yq2(即上述第二量化输出)。
3)、计算优化单元差异L(Yf2,Yq2)(即上述第二量化损失),通过梯度下降算法对V进行迭代更新。
4)、在迭代过程中,温度系数t不断减小,使得Sigmoid函数向阶梯函数,退化成阶梯函数。
在依据目标权重量化系数和目标权重量化增量对权重参数进行量化的过程中,量化过程为:
Wq=Floor(W/s)+(V>0?1:0)
其中,Wq为该优化单元的目标量化权重参数,W为该优化单元的浮点权重参数,s为目标权重量化系数,V为目标权重量化增量,Floor()为向下取整操作,(V>0?1:0)表示在V>0成立的情况下,取值为1,在V>0不成立的情况下,取值为0。
在该实施例中,考虑到优化单元的输出特征通常是多维度数据,且其不同维度的取值的差异可能会比较大,例如,以3维度数据为例,第1个维度的取值可能为0~1,第2个维度的取值可能为10~100,第3个维度的取值可能为100~1000,在该情况下,若直接依据优化单元的浮点输出和量化输出,利用均方差损失函数计算量化损失,则会导致计算结果过于关注某一个维度的取值,影响优化效果。
为了避免上述问题,提升优化效果,在依据优化单元的量化输出和浮点输出确定优化单元的量化损失时,例如,上述L(Yf1,Yq1)或L(Yf2,Yq2)时,可以利用浮点输出特征在各个Token上的标准差,用于平衡浮点输出特征和量化输出特征的Token差异,再计算均方误差。
示例性的,可以先依据该优化单元的浮点输出,确定该优化单元的浮点输出沿某一个维度的标准差。然后,再分别将优化单元的量化输出和浮点输出除以该标准差,得到处理后的量化输出和处理后的浮点输出,并依据处理后的量化输出和处理后的浮点输出计算均方误差。
在该实施例中,按照上述方式量化的Transformer模型可以于自然语言处理任务,如文本相似度、文本分类、机器翻译等;也可以应用于语音任务,如语音识别;还可用于视觉任务,如图像分类、目标检测和目标追踪等任务。
下面结合具体应用举例对本公开实施例的效果进行说明。
一、车牌识别场景
车牌识别是指能够检测到受监控路面的车辆并自动提取车辆牌照信息(含汉字字符、英文字母、阿拉伯数字及号牌颜色)进行处理的技术。车牌识别是现代智能交通系统中的重要组成部分之一,应用十分广泛。它以数字图像处理、模式识别、计算机视觉等技术为基础,对摄像机所拍摄的车辆图像或者视频序列进行分析,得到每一辆汽车唯一的车牌号码,从而完成识别过程。通过一些后续处理手段可以实现停车场收费管理,交通流量控制指标测量,车辆定位,汽车防盗,高速公路超速自动化监管、闯红灯电子警察、公路收费站等等功能。车牌识别的步骤主要是先定位图片中的牌照位置,然后把牌照中的字符分割出来,最后把分割好的字符进行识别,最终组成牌照号码。
其中,牌照字符识别基于神经网络实现,通过采用本公开提供的方法进行量化的Transformer模型用于进行拍照字符识别,可以快速高效的完成识别任务。
二、OCR文字识别场景
OCR(Optical Character Recognition,光符识别)文字识别是指电子设备(例如扫描权或数码相机)检查纸上打印的字符,然后用字符识别方法将形状翻译成计算机文字的过程;即,对文本资料进行扫描,然后对图像文件进行分析处理,获取文字及版面信息的过程。
识别速度是衡量一个OCR系统性能好坏的主要指标之一,通过采用本公开提供的方法进行量化的Transformer模型对文字特征进行提取,可以提高识别速度,从而提高OCR产品的实用性。
三、行人检索场景
行人检索是利用计算机视觉技术判断图像或者视频序列中是否存在特定行人的技术,属于图像检索的问题。
给定一个监控行人图像,检索跨设备下的该行人图像。行人检索的核心在于如何找到有鉴别力的行人表达。很多近期的方法使用了深度学习模型来抽取视觉特征。通过采用本公开的方法量化的Transformer模型,可以实现快速的特征抽取,降低时间开销。
以上对本公开提供的方法进行了描述。下面对本公开提供的装置进行描述:
请参见图4,为本公开实施例提供的一种基于模型量化的任务处理装置的结构示意图,如图4所示,该基于模型量化的任务处理装置可以包括:
第一确定单元410,用于对于Transformer模型中的任一优化单元,依据该优化单元的第一量化输出和第一浮点输出之间的第一差异,以最小化该第一差异为目标,对该优化单元的权重量化系数和激活量化系数进行更新,得到该优化单元的目标权重量化系数和目标激活量化系数;其中,所述第一量化输出包括该优化单元的权重参数依据所述权重量化系数进行量化,且该优化单元的输入/输出依据所述激活量化系数进行量化的情况下该优化单元输出的结果,所述第一浮点输出为该优化单元的权重参数为浮点权重参数,且输入为浮点输入的情况下该优化单元输出的结果;
第二确定单元420,用于在固定该优化单元的浮点权重参数和所述目标权重量化系数和目标激活量化系数的情况下,依据该优化单元的第二量化输出和第二浮点输出之间的第二差异,以最小化该第二差异为目标,对该优化单元的权重量化增量进行更新,得到该优化单元的目标权重量化增量;其中,所述第二量化输出包括该优化单元的权重参数依据所述目标权重量化系数进行量化,并依据所述权重量化增量进行量化调整,且该优化单元的输入/输出依据所述目标激活量化系数进行量化的情况下该优化单元输出的结果,所述第二浮点输出包括该优化单元的权重参数为浮点权重参数,且输入为浮点输入的情况下该优化单元输出结果;
量化单元430,用于依据目标权重量化增量确定该优化单元的权重量化取整方向,并依据所述目标权重量化系数对该优化单元的权重参数进行量化得到该优化单元的目标量化权重参数;
任务处理单元440,用于在利用所述Transformer模型进行任务处理的情况下,对于任一优化单元,依据该优化单元对应的目标量化权重参数对该优化单元的输入数据进行前向推理计算,并依据该优化单元的所述目标激活量化系数对该优化单元的输入/输出进行量化。
在一些实施例中,所述第一确定单元410依据该优化单元的第一量化输出和第一浮点输出之间的第一差异,以最小化该第一差异为目标,对该优化单元的权重量化系数和激活量化系数进行更新,得到该优化单元的目标权重量化系数和目标激活量化系数,包括:
依据该优化单元的第一量化输出和第一浮点输出,利用第一预设损失函数,确定该优化单元的第一量化损失;
依据该优化单元的第一量化损失,以最小化所述第一量化损失为目标,利用梯度下降算法对该优化单元的权重量化系数和激活量化系数进行迭代更新,直至达到第一预设结束条件,得到该优化单元的目标权重量化系数和目标激活量化系数。
在一些实施例中,所述第二确定单元420依据该优化单元的第二量化输出和第二浮点输出之间的第二差异,以最小化该第二差异为目标,对该优化单元的权重量化增量进行更新,得到该优化单元的目标权重量化增量,包括:
依据该优化单元的第二量化输出和第二浮点输出,利用第二预设损失函数,确定该优化单元的第二量化损失;
依据该优化单元的第二量化损失,以最小化所述第二量化损失为目标,利用梯度下降算法对该优化单元的权重量化增量进行迭代更新,直至达到第二预设结束条件,得到该优化单元的目标权重量化增量。
在一些实施例中,所述第二确定单元420依据所述目标权重量化系数对该优化单元的权重参数进行量化,并依据所述权重量化增量进行量化调整,通过以下公式实现:
其中,W″为该优化单元的量化调整后的权重参数,W为该优化单元的浮点权重参数,s为所述目标权重量化系数,V为权重量化增量,t为控制sigmoid函数的超参,在对该优化单元的权重量化增量进行迭代更新的过程中,t逐渐减小,Floor()为向下取整操作。
在一些实施例中,所述量化单元430依据所述目标权重量化系数和所述权重量化取整方向对该优化单元的权重参数进行量化,得到该优化单元的目标量化权重参数,通过以下公式实现:
Wq=Floor(W/s)+(V>0?1:0)
其中,Wq为该优化单元的目标量化权重参数,W为该优化单元的浮点权重参数,s为所述目标权重量化系数,V为目标权重量化增量,Floor()为向下取整操作,(V>0?1:0)表示在V>0成立的情况下,取值为1,在V>0不成立的情况下,取值为0。
在一些实施例中,所述第一确定单元410依据该优化单元的第一量化输出和第一浮点输出,利用第一预设损失函数,确定该优化单元的第一量化损失,包括:
依据该优化单元的第一浮点输出,确定该优化单元的第一浮点输出的标准差;
将该优化单元的第一量化输出除以该标准差,得到处理后的第一量化输出;以及,将该优化单元的第一浮点输出除以该标准差,得到处理后的第一浮点输出;
依据该处理后的第一量化输出以及处理后的第一浮点输出,利用均方误差损失函数,确定该优化单元的第一量化损失。
在一些实施例中,所述第二确定单元420依据该优化单元的第二量化输出和第二浮点输出,利用第二预设损失函数,确定该优化单元的第二量化损失,包括:
依据该优化单元的第二浮点输出,确定该优化单元的第二浮点输出的标准差;
将该优化单元的量化输出除以该标准差,得到处理后的第二量化输出;以及,将该优化单元的第二浮点输出除以该标准差,得到处理后的第二浮点输出;
依据该处理后的第二量化输出以及处理后的第二浮点输出,利用均方误差损失函数,确定该优化单元的第二量化损失。
在一些实施例中,Transformer模型中的优化单元包括:Transformer Stage、Transformer Block、或单个线性层。
本公开实施例提供一种电子设备,包括处理器和存储器,其中,存储器存储有能够被所述处理器执行的机器可执行指令,处理器用于执行机器可执行指令,以实现上文描述的基于域适配的任务处理方法。
请参见图5,为本公开实施例提供的一种电子设备的硬件结构示意图。该电子设备可包括处理器501、存储有机器可执行指令的存储器502。处理器501与存储器502可经由系统总线503通信。并且,通过读取并执行存储器502中与基于域适配的任务处理逻辑对应的机器可执行指令,处理器501可执行上文描述的基于域适配的任务处理方法。
本文中提到的存储器502可以是任何电子、磁性、光学或其它物理存储装置,可以包含或存储信息,如可执行指令、数据,等等。例如,机器可读存储介质可以是:RAM(Radom Access Memory,随机存取存储器)、易失存储器、非易失性存储器、闪存、存储驱动器(如硬盘驱动器)、固态硬盘、任何类型的存储盘(如光盘、dvd等),或者类似的存储介质,或者它们的组合。
在一些实施例中,还提供了一种存储介质,如图5中的存储器502,该存储介质为机器可读存储介质,该存储介质内存储有机器可执行指令,所述机器可执行指令被处理器执行时实现上文描述的基于域适配的任务处理方法。例如,所述存储介质可以是ROM(Read-Only Memory,只读存储器)、RAM、CD-ROM(Compact Disc Read-Only Memory,只读光盘存储器)、磁带、软盘和光数据存储设备等。
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何 其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个......”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
以上所述仅为本公开的较佳实施例而已,并不用以限制本公开,凡在本公开的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本公开保护的范围之内。

Claims (18)

  1. 一种基于模型量化的任务处理方法,其特征在于,包括:
    对于Transformer模型中的任一优化单元,依据该优化单元的第一量化输出和第一浮点输出之间的第一差异,以最小化该第一差异为目标,对该优化单元的权重量化系数和激活量化系数进行更新,得到该优化单元的目标权重量化系数和目标激活量化系数;其中,所述第一量化输出包括该优化单元的权重参数依据所述权重量化系数进行量化,且该优化单元的输入和输出依据所述激活量化系数进行量化的情况下该优化单元输出的结果,所述第一浮点输出包括该优化单元的权重参数为浮点权重参数,且输入为浮点输入的情况下该优化单元输出的结果;
    在固定该优化单元的所述浮点权重参数、所述目标权重量化系数和所述目标激活量化系数的情况下,依据该优化单元的第二量化输出和第二浮点输出之间的第二差异,以最小化该第二差异为目标,对该优化单元的权重量化增量进行更新,得到该优化单元的目标权重量化增量;其中,所述第二量化输出包括该优化单元的权重参数依据所述目标权重量化系数进行量化,并依据所述权重量化增量进行量化调整,且该优化单元的输入和输出依据所述目标激活量化系数进行量化的情况下该优化单元输出的结果,所述第二浮点输出包括该优化单元的权重参数为所述浮点权重参数,且输入为浮点输入的情况下该优化单元输出的结果;
    依据所述目标权重量化增量确定该优化单元的权重量化取整方向;
    依据所述目标权重量化系数和所述权重量化取整方向对该优化单元的权重参数进行量化,得到该优化单元的目标量化权重参数;
    在利用所述Transformer模型进行任务处理的情况下,对于任一优化单元,依据该优化单元对应的目标量化权重参数对该优化单元的输入数据进行前向推理计算,并依据该优化单元的所述目标激活量化系数对该优化单元的输入和输出进行量化。
  2. 根据权利要求1所述的方法,其特征在于,所述依据该优化单元的第一量化输出和第一浮点输出之间的第一差异,以最小化该第一差异为目标,对该优化单元的权重量化系数和激活量化系数进行更新,得到该优化单元的目标权重量化系数和目标激活量化系数,包括:
    依据该优化单元的所述第一量化输出和所述第一浮点输出,利用第一预设损失函数,确定该优化单元的第一量化损失;
    依据该优化单元的所述第一量化损失,以最小化所述第一量化损失为目标,利用梯度下降算法对该优化单元的所述权重量化系数和所述激活量化系数进行迭代更新,直至达到第一预设结束条件,得到该优化单元的所述目标权重量化系数和所述目标激活量化系数。
  3. 根据权利要求1或2所述的方法,其特征在于,所述依据该优化单元的第二量化输出和第二浮点输出之间的第二差异,以最小化该第二差异为目标,对该优化单元的权重量化增量进行更新,得到该优化单元的目标权重量化增量,包括:
    依据该优化单元的所述第二量化输出和所述第二浮点输出,利用第二预设损失函数,确定该优化单元的第二量化损失;
    依据该优化单元的所述第二量化损失,以最小化所述第二量化损失为目标,利用梯度下降算法对该优化单元的所述权重量化增量进行迭代更新,直至达到第二预设结束条件,得到该优化单元的所述目标权重量化增量。
  4. 根据权利要求3所述的方法,其特征在于,所述依据所述目标权重量化系数对该优化单元的权重参数进行量化,并依据所述权重量化增量进行量化调整,通过以下公式实现:
    其中,W″为该优化单元的量化调整后的权重参数,W为该优化单元的所述浮点权重参数,s为所述目标权重量化系数,V为所述权重量化增量,t为控制sigmoid函数的超参,在对该优化单元的所述权重量化增量进行迭代更新的过程中,t逐渐减小,Floor()为向下取整操作。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述依据所述目标权重量化系数和所述权重量化取整方向对该优化单元的权重参数进行量化,得到该优化单元的目标量化权重参数,通过以下公式实现:
    Wq=Floor(W/s)+(V>0?1:0)
    其中,Wq为该优化单元的所述目标量化权重参数,W为该优化单元的所述浮点权重参数,s为所述目标权重量化系数,V为所述目标权重量化增量,Floor()为向下取整操作,(V>0?1:0)表示在V>0成立的情况下,取值为1,在V>0不成立的情况下,取值为0。
  6. 根据权利要求2所述的方法,其特征在于,所述依据该优化单元的第一量化输出和第一浮点输出,利用第一预设损失函数,确定该优化单元的第一量化损失,包括:
    依据该优化单元的所述第一浮点输出,确定该优化单元的所述第一浮点输出的标准差;
    将该优化单元的所述第一量化输出除以所述标准差,得到处理后的第一量化输出;以及,将该优化单元的所述第一浮点输出除以所述标准差,得到处理后的第一浮点输出;
    依据所述处理后的第一量化输出以及所述处理后的第一浮点输出,利用均方误差损失函数,确定该优化单元的所述第一量化损失。
  7. 根据权利要求3所述的方法,其特征在于,所述依据该优化单元的第二量化输出和第二浮点输出,利用第二预设损失函数,确定该优化单元的第二量化损失,包括:
    依据该优化单元的所述第二浮点输出,确定该优化单元的所述第二浮点输出的标准差;
    将该优化单元的所述第二量化输出除以所述标准差,得到处理后的第二量化输出;以及,将该优化单元的所述浮点输出除以所述标准差,得到处理后的第二浮点输出;
    依据所述处理后的第二量化输出以及所述处理后的第二浮点输出,利用均方误差损失函数,确定该优化单元的所述第二量化损失。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述Transformer模型中的优化单元包括:Transformer Stage、Transformer Block、或单个线性层。
  9. 一种基于模型量化的任务处理装置,其特征在于,包括:
    第一确定单元,用于对于Transformer模型中的任一优化单元,依据该优化单元的第一量化输出和第一浮点输出之间的第一差异,以最小化该第一差异为目标,对该优化单元的权重量化系数和激活量化系数进行更新,得到该优化单元的目标权重量化系数和目标激活量化系数;其中,所述第一量化输出包括该优化单元的权重参数依据所述权重量化系数进行量化,且该优化单元的输入和输出依据所述激活量化系数进行量化的情况下该优化单元输出的结果,所述第一浮点输出包括该优化单元的权重参数为浮点权重参数,且输入为浮点输入的情况下该优化单元输出的结果;
    第二确定单元,用于在固定该优化单元的所述浮点权重参数、所述目标权重量化系数和所述目标激活量化系数的情况下,依据该优化单元的第二量化输出和第二浮点输出之间的第二差异,以最小化该第二差异为目标,对该优化单元的权重量化增量进行更新,得到该优化单元的目标权重量化增量;其中,所述第二量化输出包括该优化单元的权重参数依据所述目标权重量化系数进行量化,并依据所述权重量化增量进行量化调整,且该优化单元的输入和输出依据所述目标激活量化系数进行量化的情况下该优化单元输出的结果,所述第二浮点输出包括该优化单元的权重参数为所述浮点权重参数,且输入为浮点输入的情况下该优化单元输出的结果;
    量化单元,用于依据所述目标权重量化增量确定该优化单元的权重量化取整方向;依据所述目标权重量化系数和所述权重量化取整方向对该优化单元的权重参数进行量化,得到该优化单元的目标量化权重参数;
    任务处理单元,用于在利用所述Transformer模型进行任务处理的情况下,对于任一优化单元,依据该优化单元对应的目标量化权重参数对该优化单元的输入数据进行前向推理计算,并依据该优化单元的所述目标激活量化系数对该优化单元的输入/输出进行量化。
  10. 根据权利要求9所述的装置,其特征在于,所述第一确定单元依据该优化单元的第一量化输出和第一浮点输出之间的第一差异,以最小化该第一差异为目标,对该优化单元的权重量化系数和激活量化系数进行更新,得到该优化单元的目标权重量化系数和目标激活量化系数,包括:
    依据该优化单元的所述第一量化输出和所述第一浮点输出,利用第一预设损失函数,确定该优化单元的第一量化损失;
    依据该优化单元的所述第一量化损失,以最小化所述第一量化损失为目标,利用梯度下降算法对该优化单元的所述权重量化系数和所述激活量化系数进行迭代更新,直至达到第一预设结束条件,得到该优化单元的所述目标权重量化系数和所述目标激活量化系数。
  11. 根据权利要求9或10所述的装置,其特征在于,所述第二确定单元依据该优化单元的第二量化输出和第二浮点输出之间的第二差异,以最小化该第二差异为目标,对该优化单元的权重量化增量进行更新,得到该优化单元的目标权重量化增量,包括:
    依据该优化单元的所述第二量化输出和所述第二浮点输出,利用第二预设损失函数,确定该优化单元的第二量化损失;
    依据该优化单元的所述第二量化损失,以最小化所述第二量化损失为目标,利用 梯度下降算法对该优化单元的所述权重量化增量进行迭代更新,直至达到第二预设结束条件,得到该优化单元的所述目标权重量化增量。
  12. 根据权利要求11所述的装置,其特征在于,所述第二确定单元依据所述目标权重量化系数对该优化单元的权重参数进行量化,并依据所述权重量化增量进行量化调整,通过以下公式实现:
    其中,W″为该优化单元的量化调整后的权重参数,W为该优化单元的所述浮点权重参数,s为所述目标权重量化系数,V为所述权重量化增量,t为控制sigmoid函数的超参,在对该优化单元的所述权重量化增量进行迭代更新的过程中,t逐渐减小,Floor()为向下取整操作。
  13. 根据权利要求9-12任一项所述的装置,其特征在于,所述量化单元依据所述目标权重量化系数对该优化单元的权重参数和所述权重量化取整方向进行量化,得到该优化单元的目标量化权重参数,通过以下公式实现:
    Wq=Floor(W/s)+(V>0?1:0)
    其中,Wq为该优化单元的所述目标量化权重参数,W为该优化单元的所述浮点权重参数,s为所述目标权重量化系数,V为所述目标权重量化增量,Floor()为向下取整操作,(V>0?1:0)表示在V>0成立的情况下,取值为1,在V>0不成立的情况下,取值为0。
  14. 根据权利要求9所述的装置,其特征在于,所述第一确定单元依据该优化单元的第一量化输出和第一浮点输出,利用第一预设损失函数,确定该优化单元的第一量化损失,包括:
    依据该优化单元的所述第一浮点输出,确定该优化单元的所述第一浮点输出的标准差;
    将该优化单元的所述第一量化输出除以所述标准差,得到处理后的第一量化输出;以及,将该优化单元的所述第一浮点输出除以所述标准差,得到处理后的第一浮点输出;
    依据所述处理后的第一量化输出以及所述处理后的第一浮点输出,利用均方误差损失函数,确定该优化单元的所述第一量化损失。
  15. 根据权利要求10所述的装置,其特征在于,所述第二确定单元依据该优化单元的第二量化输出和第二浮点输出,利用第二预设损失函数,确定该优化单元的第二量化损失,包括:
    依据该优化单元的所述第二浮点输出,确定该优化单元的所述第二浮点输出的标准差;
    将该优化单元的所述第二量化输出除以所述标准差,得到处理后的第二量化输出;以及,将该优化单元的所述浮点输出除以所述标准差,得到处理后的第二浮点输出;
    依据所述处理后的第二量化输出以及所述处理后的第二浮点输出,利用均方误差损失函数,确定该优化单元的所述第二量化损失。
  16. 根据权利要求9-15任一项所述的装置,其特征在于,所述Transformer模型中的优化单元包括:Transformer Stage、Transformer Block、或单个线性层。
  17. 一种电子设备,其特征在于,包括处理器和存储器,所述存储器存储有能够被所述处理器执行的机器可执行指令,所述处理器用于执行机器可执行指令,以实现如权利要求1-8任一项所述的方法。
  18. 一种存储介质,其特征在于,所述存储介质内存储有机器可执行指令,所述机器可执行指令被处理器执行时实现如权利要求1-8任一项所述的方法。
PCT/CN2023/121487 2022-09-27 2023-09-26 基于模型量化的任务处理方法、装置、设备及存储介质 WO2024067563A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211186183.1 2022-09-27
CN202211186183.1A CN115860068A (zh) 2022-09-27 2022-09-27 基于模型量化的任务处理方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2024067563A1 true WO2024067563A1 (zh) 2024-04-04

Family

ID=85661212

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/121487 WO2024067563A1 (zh) 2022-09-27 2023-09-26 基于模型量化的任务处理方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN115860068A (zh)
WO (1) WO2024067563A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115860068A (zh) * 2022-09-27 2023-03-28 杭州海康威视数字技术股份有限公司 基于模型量化的任务处理方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200218982A1 (en) * 2019-01-04 2020-07-09 Microsoft Technology Licensing, Llc Dithered quantization of parameters during training with a machine learning tool
CN113011571A (zh) * 2021-03-03 2021-06-22 华南理工大学 基于Transformer模型的INT8离线量化及整数推断方法
CN114861907A (zh) * 2022-04-22 2022-08-05 网易(杭州)网络有限公司 数据计算方法、装置、存储介质和设备
CN115860068A (zh) * 2022-09-27 2023-03-28 杭州海康威视数字技术股份有限公司 基于模型量化的任务处理方法、装置、设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200218982A1 (en) * 2019-01-04 2020-07-09 Microsoft Technology Licensing, Llc Dithered quantization of parameters during training with a machine learning tool
CN113011571A (zh) * 2021-03-03 2021-06-22 华南理工大学 基于Transformer模型的INT8离线量化及整数推断方法
CN114861907A (zh) * 2022-04-22 2022-08-05 网易(杭州)网络有限公司 数据计算方法、装置、存储介质和设备
CN115860068A (zh) * 2022-09-27 2023-03-28 杭州海康威视数字技术股份有限公司 基于模型量化的任务处理方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN115860068A (zh) 2023-03-28

Similar Documents

Publication Publication Date Title
CN107636697B (zh) 用于量化浮点神经网络以获得定点神经网络的方法和设备
US20210089922A1 (en) Joint pruning and quantization scheme for deep neural networks
CN110580496A (zh) 一种基于熵最小化的深度迁移学习系统及方法
CN111062413A (zh) 一种道路目标检测方法、装置、电子设备及存储介质
WO2024067563A1 (zh) 基于模型量化的任务处理方法、装置、设备及存储介质
CN110956255B (zh) 难样本挖掘方法、装置、电子设备及计算机可读存储介质
CN110633604B (zh) 信息处理方法和信息处理装置
CN113469186A (zh) 一种基于少量点标注的跨域迁移图像分割方法
CN115147598A (zh) 目标检测分割方法、装置、智能终端及存储介质
CN116861262B (zh) 一种感知模型训练方法、装置及电子设备和存储介质
CN116579345B (zh) 命名实体识别模型的训练方法、命名实体识别方法及装置
CN113408696A (zh) 深度学习模型的定点量化方法及装置
CN111832435A (zh) 基于迁移与弱监督的美丽预测方法、装置及存储介质
CN115810105A (zh) 一种全景分割方法、装置、设备及存储介质
CN111652079B (zh) 应用于流动人群的表情识别方法、系统及存储介质
CN116861261B (zh) 自动驾驶模型的训练方法、部署方法、系统、介质和设备
CN116778300B (zh) 一种基于知识蒸馏的小目标检测方法、系统和存储介质
CN116912922B (zh) 表情识别模型训练方法、装置、电子设备及存储介质
WO2024060727A1 (zh) 神经网络模型的训练方法、装置、设备及系统
CN117274732A (zh) 一种基于情景记忆驱动构建优化扩散模型的方法和系统
CN116091867B (zh) 一种模型训练、图像识别方法、装置、设备及存储介质
CN116385844B (zh) 一种基于多教师模型的特征图蒸馏方法、系统和存储介质
CN116306917B (zh) 任务处理方法、装置、设备和计算机存储介质
CN116959489B (zh) 语音模型的量化方法、装置、服务器及存储介质
US20230289533A1 (en) Neural Topic Modeling with Continuous Learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23870792

Country of ref document: EP

Kind code of ref document: A1