CN117036860B

CN117036860B - Model quantization method, electronic device and medium

Info

Publication number: CN117036860B
Application number: CN202311031534.6A
Authority: CN
Inventors: 徐裕民; 陶林; 林峰
Original assignee: Yixing Intelligent Technology Guangzhou Co ltd
Current assignee: Yixing Intelligent Technology Guangzhou Co ltd
Priority date: 2023-08-15
Filing date: 2023-08-15
Publication date: 2024-02-09
Anticipated expiration: 2043-08-15
Also published as: CN117036860A

Abstract

The invention discloses a model quantization method, which comprises the steps of firstly scoring full-quantity training data pictures, sorting the pictures according to the score from high to low, selecting the pictures with the scores before sorting, removing M pictures from the pictures, taking the rest pictures as calibration data, selecting different quantization strategies, quantizing the model based on the calibration data, finally selecting an optimal quantization strategy, and quantizing the full-quantity data by adopting the optimal quantization strategy to obtain a final quantization model. The method can effectively improve the stability of model precision, shorten quantization time consumption, further improve efficiency and save bandwidth resources.

Description

Model quantization method, electronic device and medium

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a model quantization method, an electronic device, and a medium.

Background

Model quantization is a model compression technique that converts floating point store or operations into integer store or operations. The precision loss of the quantized model is not serious under the condition of proper bit number, but the quantized model occupies less storage and bandwidth resources, so that the calculation reasoning speed can be faster. The Post-training quantization (PTQ, post-training Quantization) is a network for directly converting a trained single-precision floating point number (FP 32) network into fixed-point calculation, and the quantization process can be completed only by partial calibration data without any training on an original model in the process.

Currently, some unlabeled data is typically randomly selected from the training set as calibration data in the PTQ method. However, random data selection may result in model accuracy instability due to activation profile mismatch. In addition, the full amount of training data may be directly used as the calibration data, but the time of the quantization process required to use the full amount of data as the calibration data may be relatively long.

Disclosure of Invention

In view of some or all of the problems in the prior art, a first aspect of the present invention provides a model quantization method, including:

scoring the full training data pictures, and sorting the full training data pictures according to the score from high to low;

selecting pictures of which the scores are ranked front L, wherein L=M+N, N, M is a natural number, N is the number of the predicted pictures serving as calibration data, and M is the number of redundant pictures;

m pictures are removed from the L pictures, and the rest pictures are used as calibration data;

selecting different quantization strategies, and quantizing the model based on the calibration data; and

and selecting an optimal quantization strategy, and quantizing the full data by adopting the optimal quantization strategy to obtain a quantization model.

Further, scoring the full-scale training data picture includes:

determining a clustering centroid of each layer of activation values in the model based on the full training data picture;

and calculating the shortest distance between the activation value of each picture in the model and the cluster centroid, and taking the shortest distance as the score of the picture.

Further, the number of pictures predicted as calibration data is determined according to an expected quantization period, including:

calculating the time length required by quantizing the model to be quantized by using the full training data picture as calibration data; and

and determining the expected number of pictures serving as calibration data based on the expected quantization duration according to the proportional relation between the duration and the total number of the full training data pictures.

Further, removing M pictures from the L pictures includes:

rejecting pictures with scores higher than a preset value; and

and deleting M-K pictures from the pictures with the lowest scores, wherein K is the number of the pictures with the scores higher than a preset value.

Further, removing M pictures from the L pictures includes:

and removing M pictures from the L pictures by adopting a method for screening outliers.

Further, the quantization strategy includes: tensor (per tensor) quantization, channel-by-channel (per channel) quantization, relative entropy (KL divergence, kullback-Leibler divergence) based quantization, and hybrid quantization.

Based on the model quantization method as described above, a second aspect of the invention provides an electronic device of model quantization, comprising a memory and a processor, wherein the memory is configured to store a computer program which, when run by the processor, performs the model quantization method as described above.

The third aspect of the present invention also provides a computer readable storage medium storing a computer program which, when run on a processor, performs a model quantization method as described above.

According to the model quantization method, the electronic device and the medium, the calibration data is selected from the full training data through the clustering method, so that the activation value distribution of the calibration data is matched with the model, and the stability of the model accuracy is improved. In addition, quantization strategies are selected based on the calibration data, so that time consumption of each quantization can be effectively shortened, efficiency is improved, and bandwidth resources are saved. And the full-quantity data quantization is carried out based on the selected quantization strategy, so that an optimal quantization model can be obtained.

Drawings

To further clarify the above and other advantages and features of embodiments of the present invention, a more particular description of embodiments of the invention will be rendered by reference to the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. In the drawings, for clarity, the same or corresponding parts will be designated by the same or similar reference numerals.

Fig. 1 shows a flow chart of a model quantization method according to an embodiment of the present invention.

Detailed Description

In the following description, the present invention is described with reference to various embodiments. One skilled in the relevant art will recognize, however, that the embodiments may be practiced without one or more of the specific details, or with other alternative and/or additional methods or components. In other instances, well-known structures or operations are not shown or described in detail to avoid obscuring aspects of the invention. Similarly, for purposes of explanation, specific numbers and configurations are set forth in order to provide a thorough understanding of embodiments of the present invention. However, the invention is not limited to these specific details.

Reference throughout this specification to "one embodiment" or "the embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.

It should be noted that the embodiments of the present invention describe the steps of the method in a specific order, however, this is merely for the purpose of illustrating the specific embodiments, and not for limiting the order of the steps. In contrast, in different embodiments of the present invention, the sequence of each step may be adjusted according to the adjustment of the actual requirement.

The method aims to solve the problem that model accuracy is unstable caused by randomly selecting calibration data pictures in the PTQ quantization process and the problem that sampling the whole amount of data as the calibration pictures takes longer time. The invention provides a model quantization method based on optimized calibration data, which adopts a clustering (Cluster) method to select pictures from full training data, further eliminates abnormal data, finally obtains the calibration data, and then selects an optimal quantization strategy based on the calibration data. By the method, stability of model accuracy can be effectively improved, and system efficiency is improved.

The technical scheme of the invention is further described below with reference to the accompanying drawings of the embodiments.

Fig. 1 shows a flow chart of a model quantization method according to an embodiment of the present invention. As shown in fig. 1, a model quantization method includes:

first, at step 101, a picture is scored. And scoring the full training data pictures, and sorting the training data pictures according to the score from high to low. In one embodiment of the invention, the calibration data is selected from the full-scale training data pictures by a clustering method. In order to avoid mismatch of the calibration data activation value distribution, in one embodiment of the invention, a picture is chosen as the calibration data according to its activation value in the model. Specifically, firstly, determining a clustering centroid of each layer of activation values in a model based on full training data pictures, then calculating the shortest distance between the activation value of each picture in the model and the clustering centroid, taking the shortest distance as the score of the picture, and finding that the more distant the picture score is from the clustering centroid, the higher the picture score is;

next, at step 102, the pictures are ordered. Sorting the full training data pictures according to the scores obtained by calculation in the step 101 from high to low, and selecting the pictures with the scores of L before sorting, wherein L is a natural number;

next, at step 103, the anomaly score is culled. In the embodiment of the present invention, too high scores are regarded as abnormal scores, which have an influence on quantization accuracy, so that these pictures need to be removed. Based on this, in one embodiment of the present invention, the value of L is generally greater than the number N of pictures expected as calibration data, for example, when 100 pictures are expected to be needed as calibration data, the value of L may be 110.

In one embodiment of the invention, the pictures are rejected based on the definition of the anomaly score, thereby obtaining a number N of final calibration data. Specifically, the pictures with the scores higher than the preset value are removed, and because the number K of the pictures with the scores higher than the preset value may be smaller than L-N, L-N-K pictures need to be further deleted from the picture with the lowest score, and only N pictures are reserved as calibration data.

In another embodiment of the present invention, a common method of screening outliers may be used to reject L-N pictures from the L pictures to obtain the final N number of calibration data. The method for screening the outliers comprises a method based on correlation of statistics, clustering, classification, information theory, distance, density and the like. The statistical-based method is to preset a model for the scores of the L pictures, judge the abnormal scores according to the fitting degree of the objects in the data set and the preset model, and reject the abnormal scores. The clustering method is to apply a clustering algorithm to perform clustering operation on the scores of the L pictures, identify points which do not belong to any cluster as abnormal scores, and then reject the points. The classification-based method is to apply a classification algorithm to determine the anomaly score. The information theory-based method applies the theory of the information theory to the detection of abnormal scores. The distance-based method refers to that for any score, when the score exceeding a certain part and the distance are both larger than a preset value, the score is regarded as an abnormal score, and then the abnormal score is removed. And according to the density-based method, calculating local outlier factors of each score according to the density condition of the scores of the L pictures, identifying the outlier degree of the local outlier factors, and finally selecting L-N values with the largest outlier degree as abnormal scores for removing.

Further, since the time period required for quantization is proportional to the number of pictures of the calibration data, in one embodiment of the present invention, the number N of pictures expected as the calibration data is determined according to the expected quantization time period. Specifically, a time period T spent for quantizing a model by using a full-scale training data picture as calibration data is firstly deduced according to the full-scale training data picture specified by a user and the model to be quantized ₁ Then according to the quantized duration T acceptable to the user ₂ To determine the number N of pictures expected as calibration data:

N＝N ₁ T ₂ /T ₁ ，

wherein N is ₁ The number of training data pictures is the total;

next, in step 104, an optimal quantization strategy is selected. Based on the calibration data obtained in step 103, different quantization strategies are selected to quantize the model, and an optimal quantization strategy is selected. Since the calibration data contains few pictures, each quantization takes relatively little time. In one embodiment of the present invention, the quantization strategies include a per tensor (per tensor) quantization strategy, a per channel (per channel) quantization strategy, a relative entropy (KL divergence) based quantization strategy, a Kullback-Leibler divergence) based quantization strategy, a hybrid quantization strategy, and the like. Wherein the per-tensor quantization strategy refers to a set of quantization parameters (scale, zero-point) for the entire model. Per channel quantization strategy refers to a set of quantization parameters (scale, zero-point) for each channel in the model, compared to Per channel which requires more quantization parameters to be stored, but which has a higher fine granularity.

In one embodiment of the present invention, the evaluation of the quantization strategy includes comparing the vector of the output of the floating point before quantization with the vector of the output of the fixed point after quantization for each layer of the model with cosine similarity, and/or euclidean distance, etc., the closer the cosine similarity is to 1, and/or the closer the euclidean distance is to 0, the better the effect of the corresponding quantization strategy is indicated. It should be appreciated that in other embodiments of the present invention, other evaluation metrics and/or methods may be employed to evaluate the quantization strategy; and

finally, at step 105, the model is quantized. And quantizing the full data by adopting the optimal quantization strategy to obtain a quantization model.

Based on the model quantization method as described above, the invention further provides an electronic device for model quantization, comprising a memory and a processor, wherein the memory is configured to store a computer program which, when run by the processor, performs the model quantization method as described above.

Furthermore, the present invention provides a computer readable storage medium storing a computer program which, when run on a processor, performs a model quantization method as described above.

While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to those skilled in the relevant art that various combinations, modifications, and variations can be made therein without departing from the spirit and scope of the invention. Thus, the breadth and scope of the present invention as disclosed herein should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

1. A method for model quantization, comprising the steps of:

selecting pictures with the scores of L before sorting, wherein L=M+N, N, M is a natural number, N is the number of the pictures which are expected to be used as calibration data, M is the number of redundant pictures, and the value of N is determined according to the expected quantization duration:

calculating a time length T required for quantizing a model to be quantized by using a full-scale training data picture as calibration data ₁ The method comprises the steps of carrying out a first treatment on the surface of the And

according to the time length T ₁ Total number N of training data pictures with total quantity ₁ Proportional relation between them based on expected quantization period T ₂ Determining the number N of pictures expected as calibration data:

N＝N ₁ T ₂ /T ₁ ；

m pictures are removed from the L pictures, and the rest N pictures are used as calibration data;

quantizing a model based on the calibration data using a plurality of quantization strategies; and

and selecting an optimal quantization strategy from the plurality of quantization strategies, and quantizing the full-quantity data by adopting the optimal quantization strategy to obtain a quantization model.

2. The model quantization method of claim 1, wherein scoring the full-scale training data picture comprises the steps of:

3. The model quantization method according to claim 1, wherein the step of removing M pictures from the L pictures includes the steps of:

rejecting pictures with scores higher than a preset value; and

4. The model quantization method of claim 1, wherein removing M pictures from the L pictures comprises:

5. The model quantization method of claim 1, wherein the quantization strategy comprises: tensor quantization, channel-by-channel quantization, relative entropy-based quantization, and hybrid quantization.

6. An electronic device for model quantization, comprising a memory and a processor, wherein the memory is configured to store a computer program which, when run by the processor, performs the model quantization method according to any of claims 1 to 5.

7. A computer readable storage medium storing a computer program which, when run on a processor, performs the model quantization method according to any one of claims 1 to 5.