CN114065913A

CN114065913A - Model quantization method and device and terminal equipment

Info

Publication number: CN114065913A
Application number: CN202111264248.5A
Authority: CN
Inventors: 刘勇; 蔡万伟
Original assignee: Jiangsu Yuntian Lifei Technology Co ltd; Shenzhen Intellifusion Technologies Co Ltd
Current assignee: Jiangsu Yuntian Lifei Technology Co ltd; Shenzhen Intellifusion Technologies Co Ltd
Priority date: 2021-10-28
Filing date: 2021-10-28
Publication date: 2022-02-18

Abstract

The application is applicable to the technical field of model quantization and provides a model quantization method, a device, a terminal device and a storage medium, wherein the model quantization method comprises the following steps: processing input data through a floating point model to obtain target output; for each layer to be quantized, carrying out quantization processing on the corresponding layer to be quantized according to the quantization function of the node to be quantized in the corresponding layer to be quantized, and obtaining a quantization layer corresponding to the corresponding layer to be quantized; processing the first input of the corresponding layer to be quantized through the quantization layer corresponding to the corresponding layer to be quantized to obtain a second output; according to the second output and the first output, optimizing the quantization function corresponding to the corresponding layer to be quantized to obtain a target quantization function corresponding to the corresponding layer to be quantized; and quantizing the floating point model according to the target quantization function corresponding to each layer to be quantized to obtain a target quantization model. By the method, the accuracy of the quantization model can be improved.

Description

Model quantization method and device and terminal equipment

Technical Field

The present application belongs to the field of model quantization technologies, and in particular, to a model quantization method, apparatus, terminal device, and computer-readable storage medium.

Background

Artificial intelligence technology has been rapidly developed in recent years, and has continuously penetrated into various application fields represented by computer vision, natural language processing, and speech recognition. However, in an actual application scenario, the huge data volume and the calculation complexity of the deep learning model are a huge test on the hardware computational power. Therefore, a quantization method for the deep learning model is also derived. The quantization technology can reduce the memory occupation of the neural network model, improve the data throughput and further reduce the reasoning time delay. However, the quantization model usually introduces a large loss of precision and increases the calculation error compared to the deep learning model before quantization.

Disclosure of Invention

In view of this, embodiments of the present application provide a model quantization method, apparatus, terminal device, and computer-readable storage medium, which can improve accuracy of a quantization model.

In a first aspect, an embodiment of the present application provides a model quantization method, including:

processing input data through a floating point model to obtain target output, wherein the target output comprises a first input and a first output of each layer to be quantized in the floating point model when the floating point model processes the input data, and each layer to be quantized comprises at least one node to be quantized;

for each layer to be quantized, carrying out quantization processing on the corresponding layer to be quantized according to the quantization function of the node to be quantized in the corresponding layer to be quantized, and obtaining a quantization layer corresponding to the corresponding layer to be quantized;

processing the first input of the corresponding layer to be quantized through the quantization layer corresponding to the corresponding layer to be quantized to obtain a second output;

according to the second output and the first output, optimizing the quantization function corresponding to the corresponding layer to be quantized to obtain a target quantization function corresponding to the corresponding layer to be quantized;

and quantizing the floating point model according to the target quantization function corresponding to each layer to be quantized to obtain a target quantization model.

In a second aspect, an embodiment of the present application provides a model quantization apparatus, including:

the device comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is used for processing input data through a floating point model to obtain target output, the target output comprises a first input and a first output of each layer to be quantized in the floating point model when the floating point model processes the input data, and each layer to be quantized comprises at least one node to be quantized;

the first quantization module is used for performing quantization processing on the corresponding layer to be quantized according to the quantization function of the node to be quantized in the corresponding layer to be quantized aiming at each layer to be quantized, so as to obtain a quantization layer corresponding to the corresponding layer to be quantized;

the second processing module is used for processing the first input of the corresponding layer to be quantized through the quantization layer corresponding to the corresponding layer to be quantized to obtain a second output;

the optimization module is used for optimizing the quantization function corresponding to the corresponding layer to be quantized according to the second output and the first output to obtain a target quantization function corresponding to the corresponding layer to be quantized;

and the second quantization module is used for quantizing the floating point model according to the target quantization function corresponding to each layer to be quantized to obtain a target quantization model.

In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the model quantization method according to the first aspect when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the model quantization method according to the first aspect.

In a fifth aspect, the present application provides a computer program product, which when run on a terminal device, causes the terminal device to execute the model quantization method described in the first aspect.

Compared with the prior art, the embodiment of the application has the advantages that: in the embodiment of the application, input data can be processed through a floating point model to obtain a target output, the target output comprises a first input and a first output of each layer to be quantized in the floating point model when the floating point model processes the input data, each layer to be quantized comprises at least one node to be quantized, and for each layer to be quantized, quantization processing is performed on the corresponding layer to be quantized according to a quantization function of the node to be quantized in the corresponding layer to be quantized to obtain a quantization layer corresponding to the corresponding layer to be quantized; processing the first input of the corresponding layer to be quantized through the quantization layer corresponding to the corresponding layer to be quantized to obtain a second output; according to the second output and the first output, optimizing the quantization function corresponding to the corresponding layer to be quantized to obtain a target quantization function corresponding to the corresponding layer to be quantized; and quantizing the floating point model according to the target quantization function corresponding to each layer to be quantized to obtain a target quantization model. At this time, for each quantization layer, based on the information of the second output of the corresponding quantization layer relative to the first output of the corresponding layer to be quantized, the output condition of the quantization layer in the quantization model can be known relatively comprehensively, so that the quantization function in each quantization layer is optimized in a targeted manner, the hierarchical refinement and optimization of the quantization model are realized, a target quantization model which is more in line with the expectation is obtained, the precision of the quantization model is improved, and the calculation error is reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flow chart of an implementation of a model quantization method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of an optimization of a quantization layer according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a model quantization apparatus provided in an embodiment of the present application;

fig. 4 is a schematic diagram of a terminal device provided in an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

Before explaining the embodiments of the present application, a brief description will be first made of some terms in the embodiments of the present application.

The following specifically describes the examples of the present application.

The model quantification method provided by the embodiment of the application can be applied to terminal equipment.

The terminal device may be, for example, a server, a desktop computer, a mobile phone, a tablet computer, a wearable device, an in-vehicle device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), or the like, and the specific type of the terminal device is not limited in this embodiment.

Specifically, as shown in fig. 1, the model quantization method may include:

step S101, processing input data through a floating point model to obtain target output, wherein the target output comprises a first input and a first output of each layer to be quantized in the floating point model when the floating point model processes the input data, and each layer to be quantized comprises at least one node to be quantized.

The floating point model is a deep learning model, and the specific structure of the floating point model is not limited herein, and in some examples, the floating point model may include at least one of a convolution layer, a pooling layer, a full-link layer, and an activation layer. The floating point model may be used for image processing or text processing. For example, the floating point model may be used for object detection or classification of images, or for text recognition, etc.

The data in the floating-point model, such as the corresponding weights, input data, and output data, may be floating-point data, for example, data of a float type or a double type.

The input data may contain images, video, and text, among other things. The specific data format of the input data is not limited herein. For example, the input data may be a pixel matrix corresponding to an image, or a data format meeting the input requirement of a floating point model is obtained after the pixel matrix is preprocessed through a preprocessing operation corresponding to the floating point model. The preprocessing may include data processing operations such as denoising, normalization, and the like.

It should be noted that the number of the input data is not limited here, and for example, the input data may include a plurality of pieces of image data, in this case, the floating point model may process the plurality of pieces of image data respectively, and in the process, the processing data when the floating point model processes each piece of image data may be obtained.

The specific format of the floating-point model may be varied. Illustratively, the floating-point model may be an original model constructed based on a preset deep learning framework (such as a framework of caffe, tensorflow, pitorch, mxnet, etc.); in addition, the original model may also be converted into a specified format to obtain the floating-point model, where the floating-point model may describe operators, weight values, topology structures, and the like, which are involved in the corresponding original model. The specified format may be determined based on actual application scenario requirements, and is not limited herein.

In general, a floating point model may include a plurality of layers, which may include at least one layer to be quantized, and each layer to be quantized may include at least one node to be quantized. The specific type of layer to be quantified is not limited herein. Illustratively, the layer to be quantized may include a convolutional layer, a fully-connected layer, and the like.

The specific position and attribute of the node to be quantized are not limited herein, and the node to be quantized may include nodes in the processing step of the floating point model, such as an input node and an output node of a specific layer in the floating point model, and may also include a parameter configuration node in the floating point model.

In the embodiment of the present application, the floating point model may process the input data through forward propagation to obtain the processed data at each node to be quantized, where the first input and the first output of each layer to be quantized in the floating point model may be included. When the input data is processed through the floating point model, the processing data at each node to be quantized can be collected and recorded.

In addition, in one example, a pseudo quantization operation may also be inserted at each node to be quantized of the floating-point model. Through pseudo quantization operation, each node to be quantized in the floating point model can be labeled, so that the node to be quantized can be quickly positioned in the subsequent model quantization process, and a quantization function at the node to be quantized can be efficiently determined.

It should be noted that, in this example, when the pseudo quantization operation is inserted at each node to be quantized, it is not necessary to actually calculate and quantize the quantization parameter in the corresponding quantization function, but it is used to label the node to be quantized, so as to facilitate the subsequent model quantization.

Step S102, for each layer to be quantized, according to the quantization function of the node to be quantized in the corresponding layer to be quantized, the corresponding layer to be quantized is quantized, and the quantization layer corresponding to the corresponding layer to be quantized is obtained.

In this embodiment, each layer to be quantized may be processed to obtain a quantization layer corresponding to each layer to be quantized, or the floating point model may be quantized according to a quantization function of each node to be quantized to obtain a quantization model, and then the quantization layer corresponding to each layer to be quantized is obtained in the quantization model.

In the embodiment of the present application, the specific form of the quantization function and the corresponding quantization parameter may be determined by a user in advance, or may be determined based on information such as the first output.

It should be noted that, in the embodiment of the present application, the quantization functions corresponding to the nodes to be quantized may be the same or different.

The structure of the quantization model is generally the same as that of the floating point model, and therefore, the to-be-quantized layer of the floating point model has a corresponding quantized layer in the quantization model, and the to-be-quantized node of the floating point model has a corresponding quantized node in the quantization model.

In some embodiments, before, for each layer to be quantized, performing quantization processing on the corresponding layer to be quantized according to a quantization function of a node to be quantized in the corresponding layer to be quantized, and obtaining a quantization layer corresponding to the corresponding layer to be quantized, the method further includes:

and aiming at each node to be quantized in the corresponding layer to be quantized, determining a quantization function of the corresponding node to be quantized according to target processing data, wherein the target processing data comprises processing data corresponding to the corresponding node to be quantized when the floating point model processes input data.

There are various specific ways to determine the quantization function according to the target processing data. Illustratively, a specified quantization parameter in a quantization function of a respective node to be quantized may be determined based on the target processing data. In one example, the quantization parameter may be a scaling factor that describes a data scaling factor at the time of quantization.

In the embodiment of the application, the corresponding quantization functions can be respectively determined for each node to be quantized in the layer to be quantized, so that the quantization precision is higher when the corresponding quantization operation is performed on each node to be quantized, and the performance of the quantization model is better.

In some embodiments, for each node to be quantized in the corresponding layer to be quantized, determining a quantization function of the corresponding node to be quantized according to target processing data, where the target processing data includes processing data corresponding to the corresponding node to be quantized when the floating point model processes input data, and the determining includes:

and aiming at each node to be quantized in the corresponding layer to be quantized, determining a quantization function of the corresponding node to be quantized according to the target processing data corresponding to the corresponding node to be quantized and the type of the target quantization data corresponding to the corresponding node to be quantized.

According to the target processing data corresponding to the corresponding node to be quantized, the value range of the data at the corresponding node to be quantized can be roughly estimated, and according to the target quantization data type corresponding to the corresponding node to be quantized, the range of the quantized data can be determined, so that according to the target processing data corresponding to the corresponding node to be quantized and the target quantization data type corresponding to the corresponding node to be quantized, the specified quantization parameters such as scaling factors in the quantization function of the corresponding node to be quantized can be determined, and the determined quantization function can meet the actual quantization requirement of the node to be quantized.

In some embodiments, the input data includes a plurality of sets of input subdata, for each node to be quantized, the target processing data includes a plurality of sets of processing data corresponding to the node to be quantized, and the plurality of sets of processing data correspond to the plurality of sets of input subdata one to one;

for each node to be quantized in the corresponding layer to be quantized, determining a quantization function of the corresponding node to be quantized according to the target processing data corresponding to the corresponding node to be quantized and the target quantization data type corresponding to the corresponding node to be quantized, including:

and determining a quantization function of the corresponding node to be quantized according to a first data range and a second data range for each node to be quantized in the corresponding layer to be quantized, wherein the first data range is determined based on the maximum value and/or the minimum value in the multiple groups of processing data corresponding to the corresponding node to be quantized, and the second data range is determined based on the value range of the target quantization data type corresponding to the corresponding node to be quantized.

In this embodiment, multiple sets of processing data of corresponding nodes to be quantized may be obtained based on multiple sets of input sub data, so as to approximately estimate a variation range of data at the nodes to be quantized, that is, a first data range.

And based on the target quantization data type corresponding to the corresponding node to be quantized, a second data range corresponding to the corresponding node to be quantized can be determined. For example, if the target quantization data type is int8 type, the second data range corresponding to int8 type is 128.

An exemplary implementation of the present embodiment is described below as a specific example.

In a specific example, the quantization function q corresponding to each node to be quantized may be:

q＝clip(round(data/Δ))

wherein, Δ is a scaling factor used for scaling the processing data corresponding to the corresponding node to be quantized to a suitable quantization range. round (.) is a rounding operation that can convert a floating point number to an integer. Clip (.) is a truncation operation that may limit the range of quantized integers.

The scaling factor Δ may be determined based on the first data range and the second data range corresponding to the respective node to be quantized.

In particular, the scaling factor Δ may be determined based on the following equation:

wherein threshold is a valid data threshold of the corresponding node to be quantized, and may be determined based on the first data range. In some examples, the value of the threshold may be a maximum value in the plurality of sets of processed data corresponding to the corresponding node to be quantized, and in other examples, the value of the threshold may be a difference between a maximum value and a minimum value in the plurality of sets of processed data corresponding to the corresponding node to be quantized. And the qualified range may quantify a data range of the data type for the target, which may be determined based on the second data range.

Step S103, processing the first input of the corresponding layer to be quantized through the quantization layer corresponding to the corresponding layer to be quantized, and obtaining a second output.

In this embodiment, the input of the quantization layer is the same as the input of the corresponding layer to be quantized, so that the error of the obtained second output relative to the error of the first output is free from errors introduced by different inputs, and interference terms are reduced, thereby providing a better basis for the subsequent optimization of the related quantization function.

The quantization layer may process the corresponding first input through forward propagation to obtain processing data of a corresponding node of each node to be quantized in the corresponding quantization layer, and information of a second output of the corresponding quantization layer.

And step S104, optimizing the quantization function corresponding to the corresponding layer to be quantized according to the second output and the first output to obtain a target quantization function corresponding to the corresponding layer to be quantized.

In this embodiment, the first output and the second output may be compared, so as to determine an error of the output of the corresponding quantization layer with respect to the output of the corresponding layer to be quantized in the floating point model. Therefore, through the first output and the second output, the quantization loss of the quantization layer relative to the corresponding layer to be quantized can be determined, and then the quantization parameter of the quantization function of the node to be quantized in the corresponding layer to be quantized is adjusted in a targeted manner, so that the precision loss of the corresponding quantization layer is improved in a targeted manner, and the precision loss of the whole quantization model is improved.

There are various ways in which the quantization loss of a quantization layer with respect to the corresponding layer to be quantized can be determined from the first output and the second output. In some examples, for any one layer to be quantized, a mean square error of a second output of a corresponding one of the layers to be quantized relative to a first output of the layer to be quantized is calculated to evaluate a loss of precision of the corresponding one of the layers to be quantized.

In some examples, the quantization function of the node to be quantized in the layer to be quantized may be iteratively optimized according to the first output and the second output until a quantization loss of a corresponding quantization layer of the layer to be quantized is minimized (e.g., a mean square error of the second output of the corresponding quantization layer of the corresponding layer to be quantized with respect to the first output of the layer to be quantized is minimized), or until the number of iterations reaches a preset number. After the iterative optimization is completed, a target quantization function may be obtained.

In some embodiments, optimizing the quantization function corresponding to the respective layer to be quantized according to the second output and the first output to obtain a target quantization function corresponding to the respective layer to be quantized includes:

optimizing a quantization function corresponding to the corresponding layer to be quantized according to the first output and the second output to obtain a corresponding target quantization function, and optimizing a weight value of at least one node to be quantized in the corresponding layer to be quantized to obtain a target weight value of at least one node to be quantized in the corresponding layer to be quantized;

quantizing the floating point model according to the target quantization function corresponding to each layer to be quantized to obtain a target quantization model, comprising:

and obtaining a target quantization model according to the target quantization function, the target weight value and the floating point model corresponding to each layer to be quantized.

For example, if the layer to be quantized is a convolution layer, the weight value of at least one node to be quantized in the layer to be quantized may be a weight value corresponding to a convolution kernel in the layer to be quantized. In addition, the node to be quantized corresponding to the weight value may also be another node in the floating-point model, which is not limited herein.

In this embodiment, not only the quantization parameter in the quantization function may be optimized, but also the weight value of at least one node to be quantized in the layer to be quantized may be optimized. The weight values of the nodes to be quantized can be regarded as part of model configuration parameters in the actual application process, and the optimization of the weight values of the nodes to be quantized can be regarded as the fine adjustment of the model configuration parameters of the quantization model. In this case, the adjustable factors are not limited to the quantization parameters of the quantization function, but also some model configuration parameters of the quantization model itself, so that the quantization model can be adjusted in multiple dimensions, so that the overall performance of the quantization model is further improved.

In some embodiments, optimizing the quantization function corresponding to the respective layers to be quantized according to the first output and the second output to obtain a corresponding target quantization function, and optimizing the weight value of at least one node to be quantized in the respective layers to be quantized to obtain a target weight value of at least one node to be quantized in the respective layers to be quantized includes:

calculating a mean square error of the second output relative to the first output;

and optimizing the quantization function corresponding to the corresponding layer to be quantized according to the mean square error to obtain a corresponding target quantization function, and optimizing the weight value of at least one node to be quantized in the corresponding layer to be quantized to obtain the target weight value of at least one node to be quantized in the corresponding layer to be quantized.

In the embodiment of the present application, the quantization loss of the corresponding quantization layer may be evaluated according to the mean square error of the output of the corresponding quantization layer of the layers to be quantized relative to the output of the layer to be quantized, so as to optimize the quantization function and the specific weight value of the layer to be quantized. In an example, the mean square error may be used as a corresponding loss of precision value, or the mean square error may be squared to obtain a root mean square error, and the root mean square error may be used as a corresponding loss of precision value.

Mean Squared Error (MSE) may be calculated based on the following equation:

wherein Q (-) refers to a quantization function, F (-) is an operation of a corresponding layer to be quantized for input, W is a weight value of at least one node to be quantized in the layer to be quantized, and X is processing data of other nodes to be quantized in the layer to be quantized, where the processing data corresponding to the other nodes to be quantized may not be a weight value, for example, the processing data corresponding to the other nodes to be quantized may be input or other data of the corresponding layer to be quantized.

Fig. 2 is an exemplary diagram illustrating iterative optimization for a layer to be quantified. Wherein an example is illustrated by an iterative optimization of the layer to be quantized a.

Illustratively, the input of the layer a to be quantized is the same as the input of the corresponding quantization layer a', and is the first input of the layer a to be quantized.

After the first input is processed by the quantization layer a 'to obtain a second output, a mean square error of the second output of the quantization layer a' with respect to the first output of the layer a to be quantized may be calculated, and if the mean square error does not meet a preset condition, a quantization function and a weight value corresponding to the layer a to be quantized are updated in reverse according to the mean square error;

after the quantization function and the weight value corresponding to the layer a to be quantized are updated reversely, the layer a to be quantized is quantized again according to the updated quantization function and weight value, and an updated layer a' can be obtained.

The updated layer a' can then be further evaluated for loss of accuracy. Specifically, the initial input of the layer a ' may be used as the input of the updated layer a ', the output of the updated layer a ' is obtained, and then the mean square error of the updated layer a ' is recalculated, so as to determine whether the optimization is completed according to the mean square error of the updated layer a '. The subsequent steps can be obtained by analogy based on the steps. After the optimization is completed, the corresponding layer to be quantized may be processed based on the target quantization function and the target weight value, so as to obtain a target quantization layer corresponding to the layer to be quantized.

At this time, compared with the whole quantization model, the data volume of single optimization is reduced by carrying out independent optimization on any layer to be quantized, the fineness of optimizing the quantization model is improved, the optimization quality is improved, and the precision of the finally obtained target quantization model can be improved.

In addition, because the first input of each layer to be quantized can be calculated in advance, the optimization of each layer can be considered to be independent, and therefore, each layer to be quantized can be optimized in parallel, and the data processing efficiency is improved.

In some embodiments, optimizing a quantization function corresponding to a respective layer to be quantized according to a mean square error to obtain a corresponding target quantization function, and optimizing a weight value of at least one node to be quantized in the respective layer to be quantized to obtain a target weight value of the at least one node to be quantized in the respective layer to be quantized includes:

according to the mean square error, carrying out iterative optimization on a quantization function corresponding to the corresponding layer to be quantized and the weight value of at least one node to be quantized in the corresponding layer to be quantized to obtain a corresponding target quantization function and a corresponding target weight value;

wherein, during each iteration;

if the mean square error corresponding to the iteration meets a preset condition, or the iteration frequency is not less than a preset frequency, taking the quantization function and the weighted value corresponding to the iteration as a corresponding target quantization function and a target weighted value;

otherwise, updating the quantization function and the weight value corresponding to the iteration based on the mean square error corresponding to the iteration, and re-executing the quantization function according to the node to be quantized in the corresponding layer to be quantized, and performing quantization processing on the corresponding layer to be quantized to obtain the quantization layer corresponding to the corresponding layer to be quantized and subsequent steps.

In this embodiment, the preset condition may be that the mean square error corresponding to the current iteration converges to a preset threshold, or the mean square error after the iteration optimization reaches the minimum, that is, the mean square error is the minimum

(Δ′_X,Δ′_W,W′)＝argmin(MSE)

Wherein, delta'_XIs a target quantization function, delta ', corresponding to a node to be quantized, such as an input node of a layer to be quantized'_WW' is a target weight value of a target quantization function corresponding to a node to be quantized with a weight value. In some examples, the target quantization function and the target weight values may be obtained by iterative optimization through a reverse optimization algorithm of a framework such as tensorflow, pytoch, etc.

And step S105, quantizing the floating point model according to the target quantization function corresponding to each layer to be quantized to obtain a target quantization model.

In the embodiment of the application, after the target quantization function is obtained, the corresponding layer to be quantized can be quantized through the target quantization function, and a target quantization layer corresponding to the layer to be quantized is obtained. After the target quantization layer corresponding to each layer to be quantized is obtained, the target quantization model may be obtained according to a non-quantization layer (i.e., a layer that does not need quantization operation) and the target quantization layer in the floating-point model.

The target quantization model functions identically to the corresponding floating-point model. That is, if the floating-point model is used for image processing (e.g., target detection or classification of images), the target quantization model may also be used for image processing; if the floating-point model is used for text processing (e.g., text recognition), the target quantization model may also be used for text recognition. In the embodiment of the application, compared with the terminal device with the floating point model, the terminal device with the target quantization model has the advantages that the data processing speed of image processing or text recognition through the deployed target quantization model is high, the consumption of storage resources and computing resources is low, meanwhile, high data precision can be guaranteed, and the computing error is small.

In some embodiments, after the target quantization model is obtained, the specified data to be processed may be processed by the target quantization model, and a data processing result is obtained.

For example, the data to be processed may be image data, and the data processing result may be a classification result of the image data or a target detection result, etc., according to the function of the target quantization model; alternatively, the data to be processed may be text data, and the data processing result may be a text recognition result or the like. Compared with the prior art, the target quantization model optimization of the scheme reduces the precision loss and the calculation error of data, thereby improving the accuracy of the data processing result. When the data to be processed is image data and the data processing result is the classification result or the target detection result of the image data, the accuracy of the classification result or the target result of the image data can be improved due to the reduction of the precision loss and the calculation error of the data processing in the model. When the data to be processed is text data and the data processing result is a text recognition result, the accuracy of the text recognition result can be improved due to the fact that precision loss and calculation errors of data processing in the model are reduced.

Therefore, in the embodiment of the application, for each quantization layer, based on the information of the second output of the corresponding quantization layer relative to the first output of the corresponding layer to be quantized, the output condition of the quantization layer in the quantization model can be known relatively comprehensively, so that the quantization function in each quantization layer is optimized in a targeted manner, the hierarchical refinement and optimization of the quantization model are realized, a target quantization model which is more in line with the expectation is obtained, the precision of the quantization model is improved, and the calculation error is reduced.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 3 shows a block diagram of a model quantization apparatus provided in the embodiment of the present application, which corresponds to the above-described model quantization method in the above embodiment, and only shows the relevant parts in the embodiment of the present application for convenience of description.

Referring to fig. 3, the model quantizing device 3 includes:

the first processing module 301 is configured to process input data through a floating point model to obtain a target output, where the target output includes a first input and a first output of each layer to be quantized in the floating point model when the floating point model processes the input data, and each layer to be quantized includes at least one node to be quantized;

a first quantization module 302, configured to, for each layer to be quantized, perform quantization processing on the corresponding layer to be quantized according to a quantization function of a node to be quantized in the corresponding layer to be quantized, so as to obtain a quantization layer corresponding to the corresponding layer to be quantized;

the second processing module 303 is configured to process the first input of the corresponding layer to be quantized through the quantization layer corresponding to the corresponding layer to be quantized, so as to obtain a second output;

an optimizing module 304, configured to optimize, according to the second output and the first output, a quantization function corresponding to the corresponding layer to be quantized, to obtain a target quantization function corresponding to the corresponding layer to be quantized;

the second quantization module 305 is configured to quantize the floating point model according to the target quantization function corresponding to each layer to be quantized, so as to obtain a target quantization model.

Optionally, the optimization module 304 is specifically configured to:

the second quantization module 305 is specifically configured to:

Optionally, the optimization module 304 specifically includes:

a calculation unit for calculating a mean square error of the second output relative to the first output;

and the optimization unit is used for optimizing the quantization function corresponding to the corresponding layer to be quantized according to the mean square error to obtain a corresponding target quantization function, and optimizing the weight value of at least one node to be quantized in the corresponding layer to be quantized to obtain the target weight value of at least one node to be quantized in the corresponding layer to be quantized.

Optionally, the optimization unit is specifically configured to:

wherein, during each iteration;

Optionally, the model quantization apparatus 3 further includes:

and the determining module is used for determining a quantization function of each node to be quantized in the corresponding layer to be quantized according to target processing data, wherein the target processing data comprises processing data corresponding to the corresponding node to be quantized when the floating point model processes the input data.

Optionally, the determining module is specifically configured to:

Optionally, the input data includes multiple sets of input subdata, for each node to be quantized, the target processing data includes multiple sets of processing data corresponding to the node to be quantized, and the multiple sets of processing data correspond to the multiple sets of input subdata one to one;

the determination module is specifically configured to:

In the embodiment of the application, input data can be processed through a floating point model to obtain a target output, the target output comprises a first input and a first output of each layer to be quantized in the floating point model when the floating point model processes the input data, each layer to be quantized comprises at least one node to be quantized, and for each layer to be quantized, quantization processing is performed on the corresponding layer to be quantized according to a quantization function of the node to be quantized in the corresponding layer to be quantized to obtain a quantization layer corresponding to the corresponding layer to be quantized; processing the first input of the corresponding layer to be quantized through the quantization layer corresponding to the corresponding layer to be quantized to obtain a second output; according to the second output and the first output, optimizing the quantization function corresponding to the corresponding layer to be quantized to obtain a target quantization function corresponding to the corresponding layer to be quantized; and quantizing the floating point model according to the target quantization function corresponding to each layer to be quantized to obtain a target quantization model. At this time, for each quantization layer, based on the information of the second output of the corresponding quantization layer relative to the first output of the corresponding layer to be quantized, the output condition of the quantization layer in the quantization model can be known relatively comprehensively, so that the quantization function in each quantization layer is optimized in a targeted manner, the hierarchical refinement and optimization of the quantization model are realized, a target quantization model which is more in line with the expectation is obtained, the precision of the quantization model is improved, and the calculation error is reduced.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

Fig. 4 is a schematic diagram of a terminal device according to an embodiment of the present application. As shown in fig. 4, the terminal device 4 of this embodiment includes: a processor 40, a memory 41, and a computer program 42 stored in the memory 41 and executable on the processor 40. The processor 40, when executing the computer program 42, implements the steps in the various model quantization method embodiments described above, such as the steps S101 to S105 shown in fig. 1. Alternatively, the processor 40, when executing the computer program 42, implements the functions of the modules/units in the above-described device embodiments, such as the functions of the modules 301 to 303 shown in fig. 3.

Illustratively, the computer program 42 may be partitioned into one or more modules/units, which are stored in the memory 41 and executed by the processor 40 to implement the embodiments of the present application. One or more of the modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 42 in the terminal device 4. For example, the computer program 42 may be divided into a first processing module, a first quantization module, a second processing module, an optimization module, and a second quantization module, each module having the following specific functions:

The terminal device 4 may be a wearable device, a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 40, a memory 41. Those skilled in the art will appreciate that fig. 4 is merely an example of a terminal device 4 and does not constitute a limitation of terminal device 4 and may include more or fewer components than shown, or some components may be combined, or different components, e.g., the terminal device may also include input-output devices, network access devices, buses, etc.

The Processor 40 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 41 may be an internal storage unit of the terminal device 4, such as a hard disk or a memory of the terminal device 4. The memory 41 may also be an external storage device of the terminal device 4, such as a plug-in hard disk provided on the terminal device 4, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 41 may also include both an internal storage unit of the terminal device 4 and an external storage device. The memory 41 is used for storing computer programs and other programs and data required by the terminal device. The memory 41 may also be used to temporarily store data that has been output or is to be output.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above method embodiments.

The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed.

The integrated unit may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. The computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file or some intermediate form. The computer-readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the above modules or units is only one logical function division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method of model quantization, comprising:

processing input data through a floating point model to obtain a target output, wherein the target output comprises a first input and a first output of each layer to be quantized in the floating point model when the floating point model processes the input data, and each layer to be quantized comprises at least one node to be quantized;

optimizing the quantization function corresponding to the corresponding layer to be quantized according to the second output and the first output to obtain a target quantization function corresponding to the corresponding layer to be quantized;

2. The model quantization method of claim 1, wherein the optimizing the quantization function corresponding to the respective layer to be quantized according to the second output and the first output to obtain a target quantization function corresponding to the respective layer to be quantized comprises:

the quantizing the floating point model according to the target quantization function corresponding to each layer to be quantized to obtain a target quantization model, including:

3. The model quantization method of claim 2, wherein the optimizing a quantization function corresponding to a respective layer to be quantized according to the first output and the second output to obtain a corresponding target quantization function, and optimizing a weight value of at least one node to be quantized in the respective layer to be quantized to obtain a target weight value of the at least one node to be quantized in the respective layer to be quantized comprises:

4. The model quantization method of claim 3, wherein the optimizing a quantization function corresponding to a respective layer to be quantized according to the mean square error to obtain a corresponding target quantization function, and optimizing a weight value of at least one node to be quantized in the respective layer to be quantized to obtain a target weight value of the at least one node to be quantized in the respective layer to be quantized comprises:

according to the mean square error, carrying out iterative optimization on a quantization function corresponding to the corresponding layer to be quantized and a weight value of at least one node to be quantized in the corresponding layer to be quantized to obtain a corresponding target quantization function and a corresponding target weight value;

wherein, during each iteration;

5. The model quantization method of any one of claims 1 to 4, further comprising, before performing quantization processing on the corresponding layer to be quantized according to a quantization function of a node to be quantized in the corresponding layer to be quantized for each layer to be quantized, to obtain a quantization layer corresponding to the corresponding layer to be quantized:

and aiming at each node to be quantized in the corresponding layer to be quantized, determining a quantization function of the corresponding node to be quantized according to target processing data, wherein the target processing data comprises processing data corresponding to the corresponding node to be quantized when the floating point model processes the input data.

6. The model quantization method of claim 5, wherein the determining, for each node to be quantized in the corresponding layer to be quantized, a quantization function of the corresponding node to be quantized according to target processing data, the target processing data including processing data corresponding to the corresponding node to be quantized when the floating point model processes the input data, comprises:

7. The model quantization method of claim 6, wherein the input data includes a plurality of sets of input sub-data, and for each node to be quantized, the target processed data includes a plurality of sets of processed data corresponding to the node to be quantized, the plurality of sets of processed data corresponding to the plurality of sets of input sub-data one-to-one;

the determining a quantization function of each node to be quantized in the corresponding layer to be quantized according to the target processing data corresponding to the corresponding node to be quantized and the target quantization data type corresponding to the corresponding node to be quantized includes:

and determining a quantization function of each node to be quantized in the corresponding layer to be quantized according to a first data range and a second data range, wherein the first data range is determined based on the maximum value and/or the minimum value in the multiple groups of processing data corresponding to the corresponding node to be quantized, and the second data range is determined based on the value range of the target quantization data type corresponding to the corresponding node to be quantized.

8. A model quantization apparatus, comprising:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.