CN114580280A

CN114580280A - Model quantization method, device, apparatus, computer program and storage medium

Info

Publication number: CN114580280A
Application number: CN202210199396.1A
Authority: CN
Inventors: 张琦
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2022-03-02
Filing date: 2022-03-02
Publication date: 2022-06-03

Abstract

The embodiment of the disclosure discloses a model quantization method, a device, equipment, a computer program and a storage medium, which can improve the precision of model quantization and the universality of quantized model deployment. The method comprises the following steps: performing model reasoning by using an initial model to be quantized to obtain an initial data set; performing analog quantization on the initial model based on a quantization mode of a target deployment platform to obtain a quantization data set; the simulation quantization is used for quantizing and restoring the activation value output by each network layer in the initial model and then reasoning in the next network layer; calculating a quantization error set based on the initial data set and the quantization data set; adjusting the initial model based on the quantization error set to obtain a quantization model; generating a quantization parameter based on the truncation value of the activation value after each network layer simulation quantization; the quantitative parameters are used to deploy the quantitative model to the target deployment platform.

Description

Model quantization method, device, apparatus, computer program and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a device, a computer program, and a storage medium for model quantization.

Background

At present, in the deployment and application process of a neural network model such as a deep learning model, the size of the model is usually reduced significantly through model quantization, the running time of the model is shortened, and the algorithm efficiency is improved. However, the model encounters many challenges in the quantization process, firstly, the quantization of the model, such as from floating point quantization to 8-bit integer, usually causes loss of precision, and the related art usually reduces the loss or reduction of precision by a method of mixing precision, and the improvement of precision is very limited. Secondly, the current model quantification tool is configured by software/hardware of a target-oriented deployment platform to carry out model quantification, so that the universality of application of a quantification model on other platforms is reduced.

Disclosure of Invention

Embodiments of the present disclosure are intended to provide a model quantization method, apparatus, device, computer program, and storage medium, which can improve accuracy of model quantization and universality of quantized model deployment.

The technical scheme of the disclosure is realized as follows:

the embodiment of the disclosure provides a model quantization method, which includes:

performing model reasoning by using an initial model to be quantized to obtain an initial data set; the initial data set comprises a cutoff value of an activation value output by each network layer in the initial model;

performing analog quantization on the initial model based on a quantization mode of a target deployment platform to obtain a quantization data set; the simulation quantization is used for quantizing and restoring the activation value output by each network layer in the initial model and then reasoning in the next network layer, and the quantization data set comprises a cutoff value of the activation value after simulation quantization of each network layer;

calculating a set of quantization errors based on the initial set of data and the set of quantized data; adjusting the initial model based on the quantization error set to obtain a quantization model;

generating a quantization parameter based on the truncation value of the activation value after each network layer simulation quantization; the quantitative parameters are used for deploying the quantitative model to the target deployment platform.

The embodiment of the present disclosure provides a model quantization apparatus, including:

the reasoning module is used for carrying out model reasoning by utilizing the initial model to be quantized to obtain an initial data set; the initial data set comprises a cutoff value of an activation value output by each network layer in the initial model;

the simulation quantification module is used for performing simulation quantification on the initial model based on a quantification mode of a target deployment platform to obtain a quantification data set; the simulation quantization is used for quantizing and restoring the activation value output by each network layer in the initial model and then reasoning in the next network layer, and the quantization data set comprises a cutoff value of the activation value after simulation quantization of each network layer;

an error adjustment module for calculating a set of quantization errors based on the initial data set and the set of quantized data; adjusting the initial model based on the quantization error set to obtain a quantization model;

the parameter generation module is used for generating a quantization parameter based on the truncation value of the activation value after each network layer simulation quantization; the quantitative parameters are used for deploying the quantitative model to the target deployment platform.

In the above apparatus, the inference module is further configured to obtain quantized calibration data; performing model reasoning on the quantitative calibration data through each network layer in the initial model to obtain at least one activation value output by each network layer; and performing statistical truncation processing on at least one activation value of each network layer by using at least one preset statistical algorithm to obtain a truncation value of the activation value corresponding to each network layer, wherein the truncation value is used as the initial data set.

In the above apparatus, the cutoff value of the activation value corresponding to each network layer includes any one of:

a maximum value of the at least one activation value, a minimum value of the at least one activation value, or a variance value of a median of the at least one activation value.

In the above apparatus, the analog quantization module is further configured to, for each network layer in the initial model, obtain a quantization scale corresponding to each network layer according to a cutoff value of an activation value corresponding to each network layer; reasoning the ith quantized data output by the ith-1 layer by utilizing an ith layer network in the initial model to obtain an ith initial activation value; wherein i is a positive integer greater than or equal to 2; the 1 st quantized data is obtained by reasoning the quantized calibration data by the first layer network in the initial model, and carrying out quantization processing and scale reduction on the first initial activation value obtained by reasoning; and according to the quantization scale corresponding to the i-th network, in combination with the quantization mode and preset quantization precision, performing quantization processing and scale reduction on the i-th initial activation value to obtain i-th quantization data, and realizing analog quantization on the i-th network layer until each network layer in the initial model is subjected to analog quantization to obtain the quantization data set.

In the above apparatus, the quantization mode includes: the analog quantization module is further configured to perform rounding on a ratio of the ith initial activation value to the quantization scale to obtain an ith initial quantization value; adjusting the ith initial quantization value by using a zero coefficient to obtain an ith intermediate quantization value; utilizing the preset quantization precision to truncate the ith intermediate quantization value to obtain an ith quantization value, and finishing the quantization processing process; and taking the product of the ith quantization value and the quantization scale as the ith quantization data to complete the scale reduction process.

In the above apparatus, the error adjustment module is further configured to calculate a cosine distance between each initial data in the initial data set and each quantized data in the quantized data set, as a quantization error corresponding to each network layer, so as to obtain the quantized error set.

In the above apparatus, the error adjustment module is further configured to determine an error adjustment mode according to a platform type of the target deployment platform when it is determined that a preset adjustment condition is reached by performing error numerical analysis and/or error distribution analysis on the quantization error set; and adjusting the network parameters of the initial model by adopting the error adjustment mode to obtain the quantitative model.

In the above apparatus, the error adjustment module is further configured to, when it is determined that a preset adjustment condition is reached by performing error numerical analysis and/or error distribution analysis on the quantization error set, adjust the network parameters of the initial model based on an analysis result of the error numerical analysis and/or the error distribution, so as to obtain the quantization model.

In the above apparatus, the error adjustment module is further configured to, when it is determined that a preset adjustment condition is met by performing error numerical analysis and/or error distribution analysis on the quantization error set, perform tensor comparison on the initial data set and the quantization data set, and adjust the network parameters of the initial model to obtain the quantization model.

In the above apparatus, the error adjustment module is further configured to obtain tensor distribution information by comparing tensor numerical distributions of the initial data set and the quantized data set; and/or obtaining quantized grouping information by comparing tensor scale information of the initial data set and the quantized data set; the tensor scale characterizes a grouping form for quantization; taking the tensor distribution information and/or the quantized grouping information as error performance information, and determining an error adjustment mode based on the error performance information or the combination of the error performance information and the platform type of a target deployment platform; and adjusting the network parameters of the initial model by adopting the error adjustment mode to obtain the quantitative model.

In the above apparatus, the error adjustment module is further configured to obtain a distribution characteristic difference between tensor values of the initial data set and the quantized data set in each channel according to tensor distribution information in the error expression information; determining a model cross-layer averaging algorithm as the error adjustment mode under the condition that the distribution characteristic difference meets a preset tensor distribution adjustment condition; and/or determining an offset correction algorithm as the error adjustment mode under the condition that at least two quantization packets exist in the quantization packet information representation in the error performance information; and/or determining a preset quantization precision corresponding to the platform type under the condition that the distribution characteristic difference meets a preset tensor distribution adjustment condition, or at least two quantization groups exist in the representation of quantization group information; and under the condition that the preset quantization precision is higher than a preset precision threshold, determining an adaptive rounding algorithm as the error adjusting mode.

In the above apparatus, the error adjustment module is further configured to determine that a preset adjustment condition is reached when a maximum value of quantization errors included in the quantization error set is greater than a first preset error threshold; and/or determining that the preset adjustment condition is reached under the condition that the number of quantization errors larger than a second preset error threshold value in the quantization error set is larger than an error number threshold value; the second preset error threshold is smaller than the first preset error threshold.

In the above apparatus, the parameter generating module is configured to perform at least one of parameter screening, serialization, and packing on the truncated value of the activation value after the simulation quantization of each network layer according to a preset quantization index of a target deployment platform, so as to generate the quantization parameter.

An embodiment of the present disclosure provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the model quantization method provided by the embodiment of the disclosure when executing the executable instructions stored in the memory.

The embodiment of the present disclosure provides a computer-readable storage medium, which stores executable instructions for causing a processor to implement the model quantization method provided by the embodiment of the present disclosure when executed.

The disclosed embodiments provide a computer program product comprising a computer program or instructions, which when executed by a processor, implement the model quantization method provided by the disclosed embodiments.

The embodiment of the disclosure has the following beneficial effects:

obtaining a quantized data set through analog quantization, and carrying out error comparison on the initial data set and the quantized data set to obtain a quantized error set; therefore, error analysis can be carried out based on the quantization error set, and the initial model is adjusted, so that the precision of the quantization model is effectively improved. And based on the quantization mode of the target deployment platform, the initial model is subjected to analog quantization to generate quantization parameters for deployment on the target deployment platform, so that support for multi-platform deployment is realized, model quantization can be performed in different quantization modes according to different software and hardware configurations of the target deployment platform to generate different quantization parameters, and then the quantization model can be deployed on the target deployment platform by using the quantization parameters corresponding to the platform, so that high-precision multi-platform model quantization is realized, and the application universality of the quantization model is improved.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an alternative model quantization method provided by an embodiment of the present disclosure;

FIG. 2 is an alternative flow chart diagram of a model quantization method provided by an embodiment of the present disclosure;

FIG. 3 is an alternative flow chart diagram of a model quantization method provided by the embodiments of the present disclosure;

FIG. 4 is an alternative flow chart diagram of a model quantization method provided by the embodiments of the present disclosure;

FIG. 5 is an alternative flow chart diagram of a model quantization method provided by an embodiment of the present disclosure;

FIG. 6 is an alternative flow chart diagram of a model quantization method provided by the embodiments of the present disclosure;

fig. 7 is an alternative flow chart illustrating the application of the model quantization method provided by the embodiment of the present disclosure to an actual scene;

fig. 8 is a schematic structural diagram of a model quantization apparatus provided in an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

For the purpose of making the purpose, technical solutions and advantages of the present disclosure clearer, the present disclosure will be described in further detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present disclosure, and all other embodiments obtained by a person of ordinary skill in the art without making creative efforts shall fall within the protection scope of the present disclosure.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where permissible, so that the disclosed embodiments described herein can be practiced in other than the order shown or described herein.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing embodiments of the disclosure only and is not intended to be limiting of the disclosure.

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure.

The embodiment of the disclosure provides a model quantization method, a device, equipment, a computer program and a storage medium, which can improve the precision of model quantization and the universality of quantized model deployment. The following describes an exemplary application of the electronic device provided by the embodiment of the present disclosure, which may be implemented as various types of terminals or user terminals such as a smart phone, a smart watch, a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), an intelligent voice interaction device, an intelligent appliance, and a vehicle-mounted terminal, and may also be implemented as a server. As another implementable manner of the embodiment of the present disclosure, the server may be an independent physical server, may be a server cluster or a distributed system formed by a plurality of physical servers, and may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, but is not limited thereto.

The model quantization method according to the embodiment of the present disclosure is described below with an electronic device as an execution subject. Fig. 1 is an alternative flowchart of a trajectory management method provided in an embodiment of the present disclosure, which will be described with reference to the steps shown in fig. 1.

S101, performing model reasoning by using an initial model to be quantized to obtain an initial data set; the initial data set contains a cutoff value for the activation value output by each network layer in the initial model.

The method and the device are suitable for quantizing the trained neural network model to obtain the quantization model with the reduced model volume, so that the quantization model is deployed in an application platform, and scenes of relevant functions of the neural network model are realized on the application platform through the quantization model.

In the embodiment of the present disclosure, the electronic device acquires the neural network model of the original precision that has completed training as an initial model to quantify the initial model.

In the embodiment of the disclosure, the electronic device performs model inference by using the initial model, and counts intermediate data output by each network layer in the initial model by using the network layer as a unit. And the electronic equipment calculates the cutoff value of the activation value by using the intermediate data output by each network layer to obtain the cutoff value of the activation value output by each network layer.

In some embodiments, the initial model may be a floating-point type network model, which is selected according to actual situations, and the embodiments of the present disclosure are not limited thereto.

S102, performing analog quantization on the initial model based on a quantization mode of a target deployment platform to obtain a quantization data set; the simulation quantization is used for quantizing and restoring the activation value output by each network layer in the initial model and then reasoning in the next network layer, and the quantization data set comprises the truncation value of the activation value after simulation quantization of each network layer.

In the embodiment of the disclosure, the electronic device may perform analog quantization on the initial model based on the quantization mode of the target deployment platform, so as to simulate a performance of the initial model after quantization is performed by using the quantization mode of the target deployment platform, thereby obtaining a quantized data set.

In the embodiment of the disclosure, in the process of performing analog quantization, the electronic device quantizes the activation value output by each network layer of the initial model, then reduces or dequantizes the quantized activation value, and uses the reduced or dequantized activation value as the input of the next network layer, thereby performing analog of the quantized numerical precision on each network layer.

In the embodiment of the present disclosure, the electronic device performs calculation of the cutoff value of the activation value based on the analog quantization corresponding to each network layer, that is, the quantized and restored activation value, to obtain the cutoff value of the activation value after the analog quantization of each network layer, which is used as quantization data, and uses a set of quantization data of at least one network layer in the electronic device as a quantization data set.

In some embodiments, the electronic device may quantize the floating-point activation value to an 8-bit integer, and perform inverse quantization to restore the floating-point activation value to a floating-point type, thereby completing the analog quantization process.

S103, calculating a quantization error set based on the initial data set and the quantization data set; and adjusting the initial model based on the quantization error set to obtain a quantization model.

In the embodiment of the disclosure, the electronic device performs error comparison on a quantization data set obtained by analog quantization and an initial data set obtained by inference of an initial model of original precision to obtain a quantization error set. The electronic equipment can perform error analysis such as error distribution, error size and the like according to the quantization error set, so that a network layer with large precision loss after quantization in the initial model or distribution characteristics of errors in each network layer and the like are positioned based on the quantization error set, and a corresponding error adjusting mode is selected based on the result of the error analysis to adjust the network parameters of the initial model to obtain the quantization model.

It can be understood that the quantization model is obtained through error adjustment, and therefore, compared with a directly quantized model or a model with mixed precision in the related art, the quantization model can achieve higher precision.

S104, generating a quantization parameter based on the truncation value of the activation value after each network layer simulation quantization; the quantitative parameters are used to deploy the quantitative model to the target deployment platform.

In the embodiment of the disclosure, the electronic device generates the quantization parameter based on the cutoff value of the activation value after each network layer simulation quantization, so as to issue the quantization parameter and the quantization model to the target deployment platform together, so that the target deployment platform can deploy and apply the quantization model according to the quantization parameter.

In some embodiments, the electronic device may perform at least one of parameter screening, serialization, and packing on the truncated value of the activation value after each network layer simulation quantization according to a preset quantization index of the target deployment platform, so as to generate a quantization parameter.

For example, the electronic device may perform parameter screening on the truncated value of the activation value after analog quantization of each network layer according to different processing of the truncated value by the target deployment platform, requirements on a data format, and requirements on the sum of network layers, serialize or package the truncated value of the activation value after screening, and generate the quantization parameter in the form of a file. In this way, when the target deployment platform deploys the quantitative model, a specific deployment process can be implemented according to the quantitative parameters, so as to complete the application of the quantitative model in an actual scene.

It can be understood that, in the embodiment of the present disclosure, a quantized data set is obtained through analog quantization, and an error comparison is performed on an initial data set and the quantized data set to obtain a quantized error set; therefore, error analysis can be carried out based on the quantization error set, and the initial model is adjusted, so that the precision of the quantization model is effectively improved. And based on the quantization mode of the target deployment platform, the initial model is subjected to analog quantization to generate quantization parameters for deployment on the target deployment platform, so that support for multi-platform deployment is realized, model quantization can be performed in different quantization modes according to different software and hardware configurations of the target deployment platform to generate different quantization parameters, and then the quantization model can be deployed on the target deployment platform by using the quantization parameters corresponding to the platform, so that high-precision multi-platform model quantization is realized, and the application universality of the quantization model is improved.

In some embodiments, based on fig. 1, as shown in fig. 2, S101 in fig. 1 may be implemented by performing the processes of S1011-S1013, which will be described in conjunction with the steps.

And S1011, acquiring quantitative calibration data.

In the embodiment of the present disclosure, the quantized calibration data is a preset sample data set, and is used for verifying or calibrating the inference result of model quantization. For example, the quantitative calibration data may be a test set or a verification set containing a large number of picture samples, and the like, which is selected according to actual situations, and the embodiments of the present disclosure are not limited thereto.

And S1012, performing model reasoning on the quantitative calibration data through each network layer in the initial model to obtain at least one activation value output by each network layer.

In the embodiment of the disclosure, the electronic device obtains the quantized calibration data, performs forward propagation model inference and prediction on the quantized calibration data through each network layer in the initial model, and counts intermediate data output by each network layer as at least one activation value.

In some embodiments, the electronic device may perform model inference on the quantized calibration data based on any one of a maximum value minimum value, kl (kullback leibler) divergence quantization, eq (easy quant) quantization, and the like, to obtain activation values corresponding to different algorithms. The specific choice is made according to the actual situation, and the embodiment of the disclosure is not limited.

And S1013, performing statistical truncation processing on at least one activation value of each network layer by using at least one preset statistical algorithm to obtain a truncation value of the activation value corresponding to each network layer, wherein the truncation value is used as an initial data set.

In the embodiment of the disclosure, at least one preset statistical algorithm is deployed in advance on the electronic device, and for each network layer, the electronic device may select any one statistical algorithm from the at least one preset statistical algorithm, and perform statistical truncation processing on at least one activation value of each network layer, so that different network layers may be calculated by using different statistical algorithms to perform truncation values of the activation values, and a truncation value of the activation value corresponding to each network layer is obtained and used as an initial data set.

In some embodiments, S1013 may include any of the following:

for each network layer, counting a maximum value from at least one activation value as a cutoff value of the activation value; or, counting the minimum value from at least one activation value as a cutoff value for the activation value; or, calculating a median of at least one activation value, and performing variance calculation based on the median, for example, performing multiple variance calculation to obtain a cutoff value of the activation value.

It can be understood that, by using any one of at least one preset statistical algorithm to calculate the cutoff values of the activation values of different network layers, a more suitable statistical algorithm can be selected according to different activation values output by different network layers to obtain the cutoff values of the activation values with higher precision.

In some embodiments, based on fig. 1 or fig. 2, as shown in fig. 3, S102 may be implemented by S1021-S1023, which will be described in conjunction with the steps.

And S1021, for each network layer in the initial model, obtaining a quantization scale corresponding to each network layer according to the cutoff value of the activation value corresponding to each network layer.

In the embodiment of the present disclosure, in the case of performing analog quantization, the electronic device performs quantization scale based on the cutoff value of the activation value corresponding to each network layer obtained by the statistical algorithm, that is, the scaling factor used in the quantization process is calculated, so as to obtain the quantization scale corresponding to each network layer.

In some embodiments, taking an 8-bit integer with a quantization target being signed as an example, the electronic device may divide the truncated value of the activation value corresponding to each network layer by 128 to obtain a quantization scale corresponding to each network layer.

S1022, reasoning the ith quantized data output by the ith-1 layer by utilizing the ith layer network in the initial model to obtain an ith initial activation value; wherein i is a positive integer greater than or equal to 2; the 1 st quantization data is obtained by reasoning the quantization calibration data by the first layer network in the initial model, and performing quantization processing and scale reduction on the first initial activation value obtained by reasoning.

And S1023, according to the quantization scale corresponding to the ith network, combining a quantization mode and preset quantization precision, performing quantization processing and scale reduction on the ith initial activation value to obtain ith quantization data, and realizing analog quantization on the ith network layer until each network layer in the initial model is subjected to analog quantization to obtain a quantization data set.

In the embodiment of the disclosure, for a first network layer in an initial model, the electronic device uses the quantized calibration data as input data of the first network layer, and performs quantized reasoning on the quantized calibration data by using the first network layer to obtain an initial activation value output by the first network layer.

In the embodiment of the disclosure, the electronic device performs quantization processing and scale reduction on the activation value output by the first network layer according to the quantization scale corresponding to the first network layer in combination with the quantization mode and the preset quantization precision to obtain first quantization data. And taking the first quantized data as input data of a second network layer, and continuing the subsequent analog quantization process.

In the embodiment of the disclosure, for each network layer after the first network layer, the quantized data after the analog quantization of the previous network layer is used as the input data of the previous network layer to obtain the initial activation value corresponding to the network layer, and then, according to the quantization scale corresponding to the network layer, the quantization mode and the preset quantization precision are combined to perform quantization processing and scale reduction on the initial activation value corresponding to the output of the network layer, so as to obtain the quantized data corresponding to the network layer. And the electronic equipment processes the simulation quantization process until the simulation quantization of each network layer in the initial model is completed, and quantized data corresponding to each network layer is obtained and used as a quantized data set.

In some embodiments, the quantization modes include: the zero point coefficient can be realized by S301 to S304 based on fig. 3, and will be described with reference to the respective steps.

S301, rounding the ratio of the ith initial activation value to the quantization scale to obtain the ith initial quantization value.

S302, the ith initial quantization value is adjusted by utilizing the zero coefficient, and the ith intermediate quantization value is obtained.

And S303, utilizing preset quantization precision to truncate the ith intermediate quantization value to obtain an ith quantization value, and finishing the quantization processing process.

S304, taking the product of the ith quantization value and the quantization scale as ith quantization data to finish the scale reduction process.

In the embodiment of the present disclosure, the electronic device may calculate the quantized data corresponding to each network layer by using formula (1), as follows:

in the formula (1), the first and second groups,D_qquantized data corresponding to each network layer; the FP32 is input data corresponding to each network layer, the FP32 can be quantized calibration data for the first network layer, and the FP32 can be quantized data corresponding to the previous network layer for each network layer after the first network layer; scale is the quantization scale calculated in S1021; round denotes a rounding operation, such as rounding; zero _ ponit is a zero coefficient, and zero _ ponit of target deployment platforms with different software and hardware configurations can be different; min, max represents the quantization bit number, i.e. the quantization precision, e.g. min for 8-bit quantization, max is-128,127; clip representation is truncated with max and min values. The electronic equipment performs analog quantization on each network layer through a formula (1) to obtain quantized data corresponding to each network layer, and the quantized data are used as a quantized data set.

It can be understood that, by performing the simulation quantization process based on the zero coefficient of the target deployment platform, the electronic device can obtain the quantization effect of the initial model on the target deployment platform without actually quantizing the initial model, so that different simulation quantization processes can be performed for different target deployment platforms to obtain a plurality of quantized data sets corresponding to multiple platforms, and the quantization of the multiple platforms by using a set of model quantization tools can be simulated.

In some embodiments, based on fig. 1 or fig. 3, as shown in fig. 4, S103 may be implemented by performing S1031-S1033, which will be described in conjunction with the steps.

And S1031, calculating a cosine distance between each initial data in the initial data set and each quantized data in the quantized data set, and taking the cosine distance as a quantization error corresponding to each network layer, thereby obtaining a quantization error set.

In the embodiment of the disclosure, the electronic device may calculate a cosine distance between each initial data in the initial data set and each quantized data in the quantized data set, and use the cosine distance as an error between each initial data and each quantized data, thereby obtaining a quantized error set.

In some embodiments, the electronic device may also calculate and evaluate an error between the initial data and the quantized data by using other error calculation methods, which are specifically selected according to actual situations, and the embodiments of the present disclosure are not limited.

S1032, determining an error adjusting mode according to the platform type of the target deployment platform under the condition that the preset adjusting condition is determined to be achieved by carrying out error numerical analysis and/or error distribution analysis on the quantization error set.

In the embodiment of the disclosure, the electronic device may perform at least one of an error value and an error distribution analysis based on the quantization error set when obtaining the quantization error set, so as to estimate the quantization precision of the initial model through the quantization error set, and locate the network layer position that affects the quantization precision of the model most. Exemplarily, when the quantization error set contains too many quantization errors with large values and/or the quantization error set contains too many quantization errors with large values, it is described that the precision of the analog quantization cannot meet the requirements of actual deployment and application, the electronic device may determine that the quantization error set meets a preset adjustment condition, and then determine an error adjustment mode according to the platform type of the target deployment platform, so as to adjust and improve the precision of the initial model.

In some embodiments, the electronic device may determine a maximum value of quantization errors in the set of quantization errors, i.e., a maximum quantization error contained in each layer of the network layer. And under the condition that the maximum value of the quantization error is larger than a first preset error threshold value, the maximum error of the representation initial model is too large to meet the requirement of quantization precision, and the electronic equipment determines that preset adjustment conditions are met.

In some implementations, the electronic device may count a number of quantization errors in the set of quantization errors that are greater than a second preset error threshold; here, the second preset error threshold is smaller than the first preset error threshold. And under the condition that the quantity of the quantization errors larger than the second preset error threshold is larger than the error quantity threshold, the quantization precision effect of representing more network layers in the initial model is not ideal, and the electronic equipment determines that the preset adjustment condition is reached.

In some embodiments, the error adjustment at least comprises: at least one of a model cross-layer averaging algorithm, a bias modification algorithm, and an adaptive rounding algorithm.

The model cross-layer averaging algorithm is used for achieving the precision similar to channel-by-channel quantization by averaging the numerical value distribution range of each channel under the condition that the deployment rear end is in a non-channel-by-channel quantization mode, and can solve the precision problem caused by too large difference of the numerical value distribution range of each channel on non-channel-by-channel quantization hardware; the offset correction algorithm is a lightweight algorithm with wide application range, can be used for reducing the quantization precision of the model, and has higher speed; the adaptive rounding algorithm can fit the errors of the quantitative model and the full-precision model to the maximum extent.

In some embodiments, the electronic device may evaluate the quantization precision of the initial model by any one of the quantization error maximum values and the quantization error quantity statistics, and determine whether a preset adjustment condition is reached, or evaluate and determine by a combination of the two methods, specifically select according to an actual situation, which is not limited in the embodiments of the present disclosure.

In some embodiments, the electronic device may obtain, when it is determined that the preset adjustment condition is reached, a type of a quantization mode supported by a platform according to a platform type of a target deployment platform, and select and determine, as an error adjustment mode, a suitable error adjustment algorithm from at least one error adjustment algorithm pre-integrated in the electronic device according to the type of the quantization mode supported by the platform.

And S1033, adjusting the network parameters of the initial model by adopting an error adjusting mode to obtain a quantitative model.

In the embodiment of the disclosure, the electronic device may adjust the network parameters, such as network weights, of the initial model in an error adjustment manner to obtain new model parameters, store the new model parameters, and obtain the quantization model, thereby significantly reducing the accuracy error of the quantization model.

It is understood that the embodiments of the present disclosure may determine an appropriate error adjustment manner by providing a selection of a plurality of error adjustment manners to reduce the quantization error of the model. In practical use, one or more algorithms can be selected for use in a targeted manner, for example, a model with high accuracy loss can use a higher-level algorithm, a model with low accuracy loss can use a lower-level algorithm, and the accuracy of the quantized model can be remarkably improved by reasonably using an error adjustment algorithm.

In some embodiments, when it is determined that the preset adjustment condition is reached by performing error numerical analysis and/or error distribution analysis on the quantization error set, the electronic device may determine, based on an analysis result of the error numerical analysis and/or the error distribution, a target network layer in which quantization error distributions are concentrated and/or the quantization error values have a large influence on the model precision, and may further select an error adjustment algorithm for the target network layer to adjust the network parameters of the initial model, so as to obtain the quantization model. Therefore, through error analysis, a target network layer is positioned and then adjusted, the calculation amount for adjusting the whole initialized network can be reduced, and the model quantization and adjustment efficiency is improved.

In some embodiments, after S1031, as shown in fig. 5, the electronic device may further perform S1034 to adjust the network parameters of the initial model to obtain a quantized model, which will be described with reference to the steps.

S1034, under the condition that a preset adjusting condition is determined to be reached by carrying out error numerical analysis and/or error distribution analysis on the quantization error set, carrying out tensor comparison on the initial data set and the quantization data set, and adjusting the network parameters of the initial model to obtain the quantization model.

In the embodiment of the present disclosure, the electronic device determines that the quantization error set reaches the preset adjustment condition, which is consistent with the description in S1032, and is not described here again. Under the condition that the quantization error set reaches the preset adjustment condition, the electronic device can perform comparative analysis on the initial data set and the quantization data set from the tensor dimension, and then adjust the network parameters of the initial model to obtain the quantization model.

In the embodiment of the disclosure, the tensor represents the data processed by the network model, the tensor has the attributes of numerical value, dimension and size, the electronic device can compare the difference between the tensor output by the initial model and the tensor output by the quantized model from the dimension of the tensor, locate the network position and the reason influencing the quantization precision of the model, and select the corresponding error adjustment algorithm to adjust the network parameters of the initial model according to the reason location.

In some embodiments, based on fig. 5, as shown in fig. 6, S1034 may be implemented by performing S401-S403, which will be described in conjunction with the steps.

S401, obtaining tensor distribution information by comparing tensor numerical distribution of the initial data set and the quantized data set; and/or obtaining quantized grouping information by comparing tensor scale information of the initial data set and the quantized data set; the tensor scale characterizes the grouping form used for quantization.

In the embodiment of the disclosure, the electronic device may obtain the numerical distribution of the tensor corresponding to the initial data set and the numerical distribution of the tensor corresponding to the quantized data set, and compare the dimensions of the numerical distribution of the tensor to obtain the distribution characteristics of the tensors output by the initial model and the quantized model respectively and compare the distribution characteristics to obtain tensor distribution information, thereby locating the network layer with a large difference in the numerical distribution of the tensors output by the initial model and the quantized model based on the tensor distribution information.

In the embodiment of the present disclosure, the electronic device may also compare tensor scale information of the initial data set with tensor scale information of the quantized data set, obtain a grouping form for quantizing each network layer in the quantization process, and use the grouping form as quantization grouping information, thereby determining whether a quantization error is caused by quantization grouping, and thus locating an error cause.

S402, tensor distribution information and/or quantized grouping information are used as error expression information, and an error adjusting mode is determined based on the error expression information or the combination of the error expression information and the platform type of the target deployment platform.

In the embodiment of the disclosure, the electronic device uses tensor distribution information and/or quantized grouping information as error performance information, and further determines an error adjustment mode by performing comprehensive analysis of error positioning and error reasons based on the error performance information itself or by combining the error performance information with a platform type of a target deployment platform.

In some embodiments, determining the error adjustment manner based on the error performance information or based on a combination of the error performance information and the platform type of the target deployment platform in S402 may include: determining an error adjustment algorithm acting on quantization error distribution dimensions and/or quantization grouping dimensions as an error adjustment mode based on numerical distribution information and/or quantization grouping information of quantization errors contained in the error expression information; or determining an error adjustment algorithm as an error adjustment mode based on the error expression information and in combination with the network level scale represented by the platform type.

In some embodiments, determining the error adjustment manner in S402 based on the error performance information or based on the combination of the error performance information and the platform type of the target deployment platform may be implemented by at least one of the methods of S402-11 to S402-12, S402-21, and S402-31 to S402-33, which will be described in conjunction with the steps.

S402-11, obtaining distribution characteristic difference of tensor values of the initial data set and the quantized data set in each channel according to tensor distribution information in the error expression information.

S402-12, under the condition that the distribution characteristic difference meets the preset tensor distribution adjustment condition, determining a model cross-layer averaging algorithm as an error adjustment mode.

In an embodiment of the present disclosure, the preset tensor distribution adjustment condition may be that a difference between tensors output by the initial data set and the quantized data set in the numerical distribution dimension is greater than or equal to a preset tensor distribution difference.

In some embodiments, the initial model includes at least one channel corresponding to each network layer in the quantitative model. The electronic device can obtain the numerical distribution range of the tensor output by each channel in each network layer of the initial model and the numerical distribution range of the tensor output by each channel in each network layer of the quantitative model according to the tensor distribution information, and obtain the distribution characteristic difference of the tensor by comparing the numerical distribution ranges of the tensors on the channels. Under the condition that the distribution characteristic difference meets the preset tensor distribution adjustment condition, a model cross-layer averaging algorithm is determined as an error adjustment mode, the model cross-layer averaging algorithm is used for averaging the numerical value distribution range of the tensor output by each channel of the initial model, and the precision problem caused by too large difference of the numerical value distribution range of each channel is reduced.

S402-21, under the condition that the quantized grouping information in the error performance information represents that at least two quantized groupings exist, determining an offset correction algorithm as an error adjustment mode.

In the embodiment of the disclosure, when the quantization grouping information in the error representation information represents that at least two quantization groupings exist, the electronic device may determine that an error is caused by grouping quantization, and may affect quantization accuracy on the grouping granularity of the weight in the quantization process, and the electronic device uses the offset correction algorithm as an error adjustment mode to reduce the effect of the grouping granularity on the quantization accuracy through the offset correction algorithm, thereby improving the quantization accuracy.

S402-31, determining the preset quantization precision corresponding to the platform type under the condition that the distribution characteristic difference meets the preset tensor distribution adjustment condition or the quantized grouping information represents that at least two quantized groupings exist.

In the embodiment of the present disclosure, when the distribution characteristic difference satisfies the preset tensor distribution adjustment condition, or when the quantized grouping information represents that at least two quantized groups exist, the electronic device may further determine, according to the platform type of the target deployment platform, a requirement of the target deployment platform for quantization accuracy, that is, preset quantization accuracy.

S402-32, under the condition that the preset quantization precision is higher than the preset precision threshold value, determining an adaptive rounding algorithm as an error adjusting mode.

In the embodiment of the disclosure, under the condition that the preset quantization precision is higher than the preset precision threshold, it is indicated that the target deployment platform has more precision requirements, and the electronic device may determine the adaptive adjustment algorithm as an error adjustment mode, so as to utilize a training adjustment process included in the adaptive adjustment algorithm to fit the quantization errors of the quantization model and the initial model to the maximum extent, so as to meet the requirements of the high-precision target deployment platform.

And S403, adjusting the network parameters of the initial model by adopting an error adjustment mode to obtain a quantitative model.

Here, the process of S403 is consistent with the description of S1033, and is not described here again.

It can be understood that, by supporting multiple network hierarchical scales and multiple algorithms, the model quantization precision loss conditions of different degrees can be dealt with, the algorithms are selected in a targeted manner according to the characteristics of different platforms, and the model quantization precision is improved.

Next, with reference to fig. 7, an application of the embodiment of the present disclosure in a practical scenario is described. The model quantization method in the embodiment of the disclosure can be applied to a quantization tool, so that the trained original full-precision model is subjected to model quantization through the quantization tool, and meanwhile, quantization parameters required by deployment to a target deployment platform are generated.

S501, selecting an activation value algorithm, and performing activation value statistics and calculation of calculation values of the activation values.

In S501, the electronic device obtains a trained initial model and quantized calibration data to be quantized, performs a plurality of model inferences according to different activation value calibration algorithms (Minmax/Hist/KL), etc., stores intermediate data output by each network layer as statistics required by the algorithms, obtains a total statistics result of each layer of the quantized data under the model after the inference is finished, and generates a cutoff value of the activation value corresponding to each network layer according to the activation value calibration algorithms.

S502, performing analog quantization according to a quantization mode of the rear end to be deployed, generating error analysis in the model quantization mode, and obtaining an accuracy analysis result.

In S502, the electronic device performs analog quantization on the initial model by combining quantization mode information, such as zero coefficients, determined by hardware of different platforms and the chip itself, when the electronic device obtains a cutoff value of an activation value of each network layer, which is a target deployment platform to be deployed. Illustratively, analog quantization employs a linear quantization mode, and the calculation of quantized data is performed by equation (1).

In S502, the error comparison between the network output of the network model after the quantitative deployment and the network output during training can be given in the simulation quantization process to measure the accuracy loss of the model quantization. Meanwhile, the tool gives the quantized output tensor obtained by the layer-by-layer forward calculation of the model, and the quantized output tensor is compared with an original full-precision model, namely the output tensor of the initial model, through cosine error analysis, numerical distribution of the tensor, size of the tensor and other dimension information, so that the position and the reason influencing the maximum quantization precision of the model can be quickly found and used as a precision analysis result.

And S503, adopting an error adjusting algorithm for adjusting the model weight according to the precision analysis result.

In S503, the electronic device selects a proper algorithm from a plurality of error adjustment algorithms integrated in advance to adjust the weight parameter of the initial model to obtain a quantized model, when the precision analysis result representation reaches a preset error adjustment condition according to the precision analysis result.

S504, according to the platform type of the target deployment platform, a quantitative parameter corresponding to the target deployment platform is generated.

In S504, the electronic device performs operations such as parameter screening, packing, serialization and the like on the model parameters of the quantization model according to the platform type of the target deployment platform, and generates quantization parameters corresponding to the target deployment platform.

It is appreciated that the quantification tool provided by the embodiments of the present disclosure is capable of supporting multi-platform deployment, i.e., flexible deployment on multiple platforms without additional cost for the same model and data. In addition, the embodiment of the disclosure supports various quantitative calibration algorithms, can flexibly provide for users to perform experiments and trials, and greatly improves the accuracy of the model, and the robustness and convenience of tool use. Moreover, the embodiment of the disclosure supports the functions of error analysis and numerical analysis of the model quantization precision, so that when a user encounters a problem of the model precision, a network layer with a large error can be accurately positioned, further precision adjustment is performed in a targeted manner to improve the precision, and a high-precision, multi-platform and automatic offline quantization model production tool is realized.

The present disclosure further provides a model quantization apparatus, and fig. 8 is a schematic structural diagram of the model quantization apparatus provided in the embodiment of the present disclosure; as shown in fig. 8, the model quantizing device 1 includes:

the inference module 11 is configured to perform model inference by using an initial model to be quantized to obtain an initial data set; the initial data set comprises a cutoff value of an activation value output by each network layer in the initial model;

the simulation quantization module 12 is configured to perform simulation quantization on the initial model based on a quantization mode of a target deployment platform to obtain a quantization data set; the simulation quantization is used for quantizing and restoring the activation value output by each network layer in the initial model and then reasoning in the next network layer, and the quantization data set comprises a cutoff value of the activation value after simulation quantization of each network layer;

an error adjustment module 13, configured to calculate a quantization error set based on the initial data set and the quantization data set; adjusting the initial model based on the quantization error set to obtain a quantization model;

a parameter generating module 14, configured to generate a quantization parameter based on the truncated value of the activation value after each network layer simulation quantization; the quantitative parameters are used for deploying the quantitative model to the target deployment platform.

In some embodiments, the inference module 11 is further configured to obtain quantitative calibration data; performing model reasoning on the quantitative calibration data through each network layer in the initial model to obtain at least one activation value output by each network layer; and performing statistical truncation processing on at least one activation value of each network layer by using at least one preset statistical algorithm to obtain a truncation value of the activation value corresponding to each network layer, wherein the truncation value is used as the initial data set.

In some embodiments, the cutoff value of the activation value corresponding to each network layer includes any one of:

In some embodiments, the analog quantization module 12 is further configured to, for each network layer in the initial model, obtain a quantization scale corresponding to each network layer according to a cutoff value of the activation value corresponding to each network layer; reasoning the ith quantized data output by the ith-1 layer by utilizing an ith layer network in the initial model to obtain an ith initial activation value; wherein i is a positive integer greater than or equal to 2; the 1 st quantized data is obtained by reasoning the quantized calibration data by the first layer network in the initial model, and carrying out quantization processing and scale reduction on the first initial activation value obtained by reasoning; and according to the quantization scale corresponding to the ith network, in combination with the quantization mode and preset quantization precision, performing quantization processing and scale reduction on the ith initial activation value to obtain ith quantization data, and realizing analog quantization on the ith network layer until each network layer in the initial model is subjected to analog quantization to obtain the quantization data set.

In some embodiments, the quantization mode comprises: the analog quantization module 12 is further configured to perform rounding on a ratio of the ith initial activation value to the quantization scale to obtain an ith initial quantization value; adjusting the ith initial quantization value by using a zero coefficient to obtain an ith intermediate quantization value; utilizing the preset quantization precision to truncate the ith intermediate quantization value to obtain an ith quantization value, and finishing the quantization processing process; and taking the product of the ith quantization value and the quantization scale as the ith quantization data to complete the scale reduction process.

In some embodiments, the error adjustment module 13 is configured to calculate a cosine distance between each initial data in the initial data set and each quantized data in the quantized data set as the quantization error corresponding to each network layer, so as to obtain the quantized error set.

In some embodiments, the error adjustment module 13 is further configured to determine an error adjustment mode according to a platform type of the target deployment platform when it is determined that a preset adjustment condition is reached by performing error numerical analysis and/or error distribution analysis on the quantization error set; and adjusting the network parameters of the initial model by adopting the error adjustment mode to obtain the quantitative model.

In some embodiments, the error adjusting module 13 is further configured to, when it is determined that a preset adjusting condition is reached by performing error numerical analysis and/or error distribution analysis on the quantization error set, adjust a network parameter of the initial model based on an analysis result of the error numerical analysis and/or error distribution, so as to obtain the quantization model.

In some embodiments, the error adjusting module 13 is further configured to, when it is determined that a preset adjustment condition is reached by performing error numerical analysis and/or error distribution analysis on the quantization error set, perform tensor comparison on the initial data set and the quantization data set, and adjust a network parameter of the initial model to obtain the quantization model.

In some embodiments, the error adjusting module 13 is further configured to obtain tensor distribution information by comparing tensor numerical distributions of the initial data set and the quantized data set; and/or, obtaining quantized grouping information by comparing tensor scale information of the initial data set and the quantized data set; the tensor scale characterizes a grouping form for quantization; taking the tensor distribution information and/or the quantized grouping information as error performance information, and determining an error adjustment mode based on the error performance information or the combination of the error performance information and the platform type of a target deployment platform; and adjusting the network parameters of the initial model by adopting the error adjustment mode to obtain the quantitative model.

In some embodiments, the error adjusting module 13 is further configured to obtain a distribution characteristic difference between tensor values of each channel of the initial data set and the quantized data set according to tensor distribution information in the error performance information; determining a model cross-layer averaging algorithm as the error adjustment mode under the condition that the distribution characteristic difference meets a preset tensor distribution adjustment condition; and/or determining an offset correction algorithm as the error adjustment mode under the condition that at least two quantization groups exist in the quantization group information representation in the error performance information; and/or determining a preset quantization precision corresponding to the platform type under the condition that the distribution characteristic difference meets a preset tensor distribution adjustment condition, or at least two quantization groups exist in the representation of quantization group information; determining an adaptive rounding algorithm as the error adjustment mode when the preset quantization precision is higher than a preset precision threshold value

In some embodiments, the error adjusting module 13 is further configured to determine that a preset adjusting condition is reached if a maximum value of quantization errors included in the quantization error set is greater than a first preset error threshold; and/or determining that the preset adjustment condition is reached under the condition that the number of quantization errors larger than a second preset error threshold value in the quantization error set is larger than an error number threshold value; the second preset error threshold is smaller than the first preset error threshold.

In some embodiments, the parameter generating module 14 is configured to perform at least one of parameter screening, serialization, and packaging on the truncated value of the activation value after each network layer simulation quantization according to a preset quantization index of the target deployment platform, so as to generate the quantization parameter.

It should be noted that the above description of the embodiment of the apparatus, similar to the description of the embodiment of the method, has similar beneficial effects as the embodiment of the method. For technical details not disclosed in the embodiments of the apparatus of the present disclosure, reference is made to the description of the embodiments of the method of the present disclosure.

An embodiment of the present disclosure further provides an electronic device, fig. 9 is a schematic structural diagram of the electronic device provided in the embodiment of the present disclosure, and as shown in fig. 9, the electronic device 2 includes: a memory 22 and a processor 23, wherein the memory 22 and the processor 23 are connected by a communication bus 24; a memory 22 for storing executable instructions; the processor 23, when executing the executable instructions stored in the memory 22, implements the method provided by the embodiments of the present disclosure, for example, the model quantization method provided by the embodiments of the present disclosure.

The embodiment of the present disclosure provides a computer-readable storage medium, which stores executable model quantization instructions for causing the processor 23 to implement the method provided by the embodiment of the present disclosure when executing, for example, the model quantization method provided by the embodiment of the present disclosure.

In some embodiments of the present disclosure, the storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments of the present disclosure, the executable model quantifying instructions may be written in any form of programming language (including compiled or interpreted languages), or declarative or procedural languages, in the form of a program, software module, script, or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable model quantization instructions may, but need not, correspond to files in a file system, may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, the executable model quantification instructions may be deployed to be executed on one computing device, or on multiple computing devices located at one site, or distributed across multiple sites and interconnected by a communication network.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only for the preferred embodiment of the present disclosure, and is not intended to limit the scope of the present disclosure.

Claims

1. A method of model quantization, comprising:

performing analog quantization on the initial model based on a quantization mode of a target deployment platform to obtain a quantization data set; the simulation quantization is used for quantizing and restoring the activation value output by each network layer in the initial model and then reasoning in the next network layer, and the quantization data set comprises a truncated value of the activation value after simulation quantization of each network layer;

calculating a set of quantization errors based on the initial data set and the set of quantized data; adjusting the initial model based on the quantization error set to obtain a quantization model;

generating a quantization parameter based on the cutoff value of the activation value after each network layer simulation quantization; the quantitative parameters are used for deploying the quantitative model to the target deployment platform.

2. The method of claim 1, wherein performing model inference using the initial model to be quantified to obtain an initial data set comprises:

acquiring quantitative calibration data;

performing model reasoning on the quantitative calibration data through each network layer in the initial model to obtain at least one activation value output by each network layer;

and performing statistical truncation processing on at least one activation value of each network layer by using at least one preset statistical algorithm to obtain a truncation value of the activation value corresponding to each network layer, wherein the truncation value is used as the initial data set.

3. The method according to claim 2, wherein the cutoff value for the activation value corresponding to each network layer comprises any one of:

4. The method according to any one of claims 1 to 3, wherein the performing simulation quantization on the initial model based on the quantization mode of the target deployment platform to obtain a quantized data set comprises:

for each network layer in the initial model, obtaining a quantization scale corresponding to each network layer according to the cutoff value of the activation value corresponding to each network layer;

reasoning the ith quantized data output by the ith-1 layer by utilizing an ith layer network in the initial model to obtain an ith initial activation value; wherein i is a positive integer greater than or equal to 2; the 1 st quantized data is obtained by reasoning the quantized calibration data by the first layer network in the initial model, and carrying out quantization processing and scale reduction on the first initial activation value obtained by reasoning;

and according to the quantization scale corresponding to the i-th network, in combination with the quantization mode and preset quantization precision, performing quantization processing and scale reduction on the i-th initial activation value to obtain i-th quantization data, and realizing analog quantization on the i-th network layer until each network layer in the initial model is subjected to analog quantization to obtain the quantization data set.

5. The method of claim 4, wherein the quantization mode comprises: the zero coefficient, according to the quantization scale corresponding to the i-th network, in combination with the quantization mode and preset quantization precision, performing quantization processing and scale reduction on the i-th initial activation value to obtain i-th quantization data, including:

rounding the ratio of the ith initial activation value to the quantization scale to obtain an ith initial quantization value;

adjusting the ith initial quantization value by using a zero coefficient to obtain an ith intermediate quantization value;

utilizing the preset quantization precision to truncate the ith intermediate quantization value to obtain an ith quantization value, and finishing the quantization processing process;

and taking the product of the ith quantization value and the quantization scale as the ith quantization data to complete the scale reduction process.

6. The method of claim 1 or 5, wherein computing a set of quantization errors based on the initial set of data and the set of quantized data comprises:

and calculating the cosine distance between each initial data in the initial data set and each quantized data in the quantized data set as the quantization error corresponding to each network layer, thereby obtaining the quantization error set.

7. The method according to claim 1 or 5, wherein the adjusting the initial model based on the quantization error set to obtain a quantization model comprises:

determining an error adjustment mode according to the platform type of the target deployment platform under the condition that a preset adjustment condition is determined to be reached by carrying out error numerical analysis and/or error distribution analysis on the quantization error set;

and adjusting the network parameters of the initial model by adopting the error adjustment mode to obtain the quantitative model.

8. The method according to claim 1 or 5, wherein the adjusting the initial model based on the quantization error set to obtain a quantization model comprises:

and under the condition that a preset adjusting condition is determined to be reached by carrying out error numerical analysis and/or error distribution analysis on the quantization error set, adjusting the network parameters of the initial model based on the analysis result of the error numerical analysis and/or the error distribution to obtain the quantization model.

9. The method according to claim 1 or 5, wherein the adjusting the initial model based on the quantization error set to obtain a quantization model comprises:

and under the condition that a preset adjusting condition is determined to be met by carrying out error numerical analysis and/or error distribution analysis on the quantization error set, carrying out tensor comparison on the initial data set and the quantization data set, and adjusting network parameters of the initial model to obtain the quantization model.

10. The method of claim 9, wherein the adjusting network parameters of the initial model by tensor comparison of the initial data set and the quantized data set to obtain the quantized model comprises:

obtaining tensor distribution information by comparing tensor numerical distributions of the initial data set and the quantized data set;

and/or the like, and/or,

obtaining quantized grouping information by comparing tensor scale information of the initial data set and the quantized data set; the tensor scale characterizes a grouped form for quantization;

taking the tensor distribution information and/or the quantized grouping information as error performance information, and determining an error adjustment mode based on the error performance information or the combination of the error performance information and the platform type of the target deployment platform;

11. The method of claim 10, wherein determining an error adjustment based on the error performance information or based on a combination of the error performance information and a platform type of the target deployment platform comprises at least one of:

according to tensor distribution information in the error expression information, obtaining distribution characteristic difference of tensor values of the initial data set and the quantized data set in each channel;

determining a model cross-layer averaging algorithm as the error adjustment mode under the condition that the distribution characteristic difference meets a preset tensor distribution adjustment condition;

determining an offset correction algorithm as the error adjustment mode under the condition that at least two quantization packets exist in the quantization packet information representation in the error representation information;

determining a preset quantization precision corresponding to the platform type under the condition that the distribution characteristic difference meets a preset tensor distribution adjustment condition or at least two quantization groups exist in the representation of quantization grouping information;

and under the condition that the preset quantization precision is higher than a preset precision threshold, determining an adaptive rounding algorithm as the error adjusting mode.

12. The method according to any one of claims 7 to 11, wherein determining that a preset adjustment condition is reached by performing an error numerical analysis and/or an error distribution analysis on the quantization error set comprises at least one of:

determining that a preset adjustment condition is reached under the condition that the maximum value of quantization errors contained in the quantization error set is greater than a first preset error threshold;

determining that the preset adjustment condition is reached under the condition that the number of quantization errors larger than a second preset error threshold value in the quantization error set is larger than an error number threshold value; the second preset error threshold is smaller than the first preset error threshold.

13. The method of any of claims 1, 7 or 11, the generating a quantization parameter based on the truncated value of the each network layer modeled quantized activation value, comprising:

and performing at least one of parameter screening, serialization and packaging on the truncated value of the activation value after the simulation quantization of each network layer according to a preset quantization index of the target deployment platform to generate the quantization parameter.

14. A model quantization apparatus, comprising:

the simulation quantization module is used for performing simulation quantization on the initial model based on a quantization mode of a target deployment platform to obtain a quantization data set; the simulation quantization is used for quantizing and restoring the activation value output by each network layer in the initial model and then reasoning in the next network layer, and the quantization data set comprises a cutoff value of the activation value after simulation quantization of each network layer;

a parameter generating module, configured to generate a quantization parameter based on a cutoff value of the activation value after each network layer simulation quantization; the quantitative parameters are used for deploying the quantitative model to the target deployment platform.

15. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the method of any one of claims 1 to 13 when executing executable instructions stored in the memory.

16. A computer-readable storage medium having stored thereon executable instructions for causing a processor, when executing, to implement the method of any one of claims 1 to 13.

17. A computer program product comprising a computer program or instructions, wherein the computer program or instructions, when executed by a processor, implement the method of any one of claims 1 to 13.