CN114692864A

CN114692864A - Quantization method, quantization device, storage medium, and electronic apparatus

Info

Publication number: CN114692864A
Application number: CN202011640084.7A
Authority: CN
Inventors: 不公告发明人
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2022-07-01

Abstract

The application provides a quantization-based method, which comprises the following steps: acquiring a target operator and data to be quantized corresponding to the target operator; according to the target operator, obtaining a quantization parameter corresponding to the target operator from a quantization parameter manager; and carrying out quantitative reasoning on the target operator according to the quantitative parameters corresponding to the target operator and the data to be quantized corresponding to the target operator. The quantization method provided by the application reduces the calculation amount and too many repeated operations, is convenient to use, improves the quantization efficiency and optimizes the user experience.

Description

Quantization method, quantization device, storage medium, and electronic apparatus

Technical Field

The present application relates to the field of computers, and in particular, to a quantization method, apparatus, storage medium, and electronic device.

Background

In recent years, deep neural networks have excellent performance in the aspects of image classification, detection and the like, but with the improvement of network performance, the computational complexity of models in the training and reasoning process is continuously improved, so that the deployment of the models on central nodes or edge equipment is complicated, and the reasoning speed is limited.

Model quantization (quantization) is a method for compressing a model, which can reduce the size of the model and the storage space occupied by the model, reduce memory consumption and accelerate reasoning speed, and is one of effective techniques for optimizing the model. Two quantification tools, namely TensorFlow-Lite and TensorRT, are already provided in a TensorFlow framework commonly used for deep learning, but when the two quantification tools are used for model quantification, not only is data preprocessing required to be performed on different models by writing functions independently, but also the two quantification tools only support models with two formats, namely given _ model or freqen _ graph _ def, and a user needs to perform format conversion on the models before quantification, so that a large amount of repetitive work is required when the user uses the quantification tools to perform model quantification, and the user experience is poor, and the quantification efficiency is low.

Disclosure of Invention

The application provides a quantization method and related equipment, and the problem that a great amount of repeatability is needed for model quantization in a TensorFlow frame is effectively solved.

In a first aspect, an embodiment of the present application provides a quantization method, where the method includes: acquiring a target operator and data to be quantized corresponding to the target operator; according to the target operator, obtaining a quantization parameter corresponding to the target operator from a quantization parameter manager; and carrying out quantitative reasoning on the target operator according to the quantitative parameters corresponding to the target operator and the data to be quantized corresponding to the target operator.

In a second aspect, an embodiment of the present application provides a quantization apparatus, including: the acquisition unit is used for acquiring a target operator and data to be quantized corresponding to the target operator; the acquisition unit is further used for acquiring a quantization parameter corresponding to the target operator from a quantization parameter manager according to the target operator; and the processing unit is used for carrying out quantitative reasoning on the target operator according to the quantization parameter corresponding to the target operator and the data to be quantized corresponding to the target operator.

In a third aspect, an embodiment of the present application provides a quantization apparatus, including: a processor and a memory, the processor executing code in the memory to perform a method as provided by any one of the implementations of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform the method as provided in any one of the implementations of the first aspect.

According to the method and the device, quantization is divided into two stages of obtaining quantization parameters and quantization reasoning, the calculation graph is directly modified in the stage of obtaining the quantization parameters to obtain the quantization parameters corresponding to a plurality of operators to be quantized, the quantization parameters are stored in the quantization parameter manager, the calculation graph is directly modified in the quantization reasoning stage, and the quantization reasoning is carried out according to the quantization parameters. Therefore, the quantization parameters are only needed to be calculated once and stored in the quantization parameter manager, and when reasoning is carried out on the model to be quantized each time, the corresponding quantization parameters are only needed to be obtained from the quantization parameter manager without recalculating the quantization parameters, so that the calculation amount is reduced; meanwhile, the embodiment of the application directly modifies the corresponding calculation diagram of the model, does not need to consider data preprocessing and convert the type of the model into two formats, namely, saved _ model or free graph _ def supported in TensorFlow-Lite and TensorRT, reduces excessive repeated operation, is convenient to use, improves the quantization efficiency and optimizes the user experience.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the background art of the present application, the drawings required to be used in the embodiments or the background art of the present application will be described below.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;

FIG. 2a is a schematic flow chart of a quantization method provided by an embodiment of the present application;

fig. 2b is a schematic flow chart of another quantization method provided in the embodiment of the present application;

fig. 3 is a schematic diagram of a computation graph corresponding to a model to be quantized according to an embodiment of the present application;

fig. 4 is a schematic diagram illustrating modification of a computation graph corresponding to a model to be quantized according to an embodiment of the present application;

fig. 5 is a schematic diagram illustrating modification of a computation graph corresponding to another model to be quantized according to an embodiment of the present application;

FIG. 6 is a schematic view of a quantization apparatus provided in an embodiment of the present application;

fig. 7 is a schematic diagram illustrating modification of a computation graph corresponding to another model to be quantized according to an embodiment of the present application;

fig. 8 is a structural diagram of a combined processing device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a board card provided in an embodiment of the present application.

Detailed Description

The terminology used in the examples section of this application is for the purpose of describing particular embodiments of the invention only and is not intended to be limiting of the invention.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first", "second" and "third" in the embodiments of the present application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", "third" may explicitly or implicitly include one or more of the features. It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

First, an application scenario according to the present application will be described.

In recent years, deep neural networks have excellent performance in the aspects of image classification, detection and the like, but with the improvement of network performance, the computational complexity of models in the training and reasoning process is continuously improved, so that the deployment of the models on central nodes or edge equipment is complicated, and the reasoning speed is limited. The model quantization is a method for compressing the model, the method can reduce the size of the model and the storage space occupied by the model, reduce the memory consumption and accelerate the reasoning speed, and is one of effective technologies for optimizing the model.

In the process of quantifying the model of the deep neural network model, two quantification tools of TensorFlow-Lite and TensorRT are already provided in a TensorFlow frame commonly used for deep learning, and when a user quantifies the model by adopting the two quantification tools, firstly, data preprocessing needs to be considered, for example, when the data to be quantified is a picture, the picture needs to be cut, stretched and the like. Moreover, for different models, a user needs to write a function independently, the format of the model to be quantized is converted into two formats, namely a reserved _ model or a free _ def format supported by TensorFlow-Lite and TensorRT, and then the quantization algorithm is adopted to carry out quantization inference on the model, so that the process needs a large number of repeated processes, and the user experience is poor, and the quantization efficiency is low. The embodiment of the application provides a quantization method, the method directly quantizes a model by modifying a computation graph corresponding to the model to be quantized in a quantization parameter obtaining stage and a quantization reasoning stage, for different models, the model quantization can be completed only by modifying few codes by a user or even without modifying the codes by the user, the models in formats of ckpt, h5 and the like can be supported, and the pre-processing of data does not need to be considered, so that the method provides a convenient quantization method, reduces excessive repeated operation, improves the quantization efficiency and optimizes the user experience.

The following describes embodiments of the present application in detail.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, and as shown in fig. 1, the electronic device 100 includes a processor 110, an input device 120, an output device 130, and a memory 140, where the electronic device 100 may further include a communication bus 150, and the processor 110, the input device 120, the output device 130, and the memory 140 may be connected to each other through the bus.

The processor 110 is configured to implement the following steps when executing the program stored in the memory 140:

acquiring a target operator and data to be quantized corresponding to the target operator from a memory, and acquiring a quantization parameter corresponding to the target operator from a quantization parameter manager according to the target operator; and carrying out quantitative reasoning on the target operator according to the quantitative parameters corresponding to the target operator and the data to be quantized corresponding to the target operator.

Further, the processor 140 may be a Central Processing Unit (CPU), an intelligent Processing Unit (NPU), a Graphics Processing Unit (GPU), or an Image Processing Unit (Image Processing Unit), which is not limited in the present application. According to different processors, the method for determining hardware performance provided by the embodiment of the application can be applied to the artificial intelligence application fields of image recognition processing, deep learning processing, computer vision processing, intelligent robot processing, natural language processing and the like, and complex function programs in the artificial intelligence field can be executed.

The embodiment of the present application provides a quantization method, which is applied to the electronic device 100 and the processor 110 thereof in fig. 1. The method comprises the following steps: acquiring a target operator and data to be quantized corresponding to the target operator; according to the target operator, obtaining a quantization parameter corresponding to the target operator from a quantization parameter manager; and carrying out quantitative reasoning on the target operator according to the quantitative parameters corresponding to the target operator and the data to be quantized corresponding to the target operator.

Referring to fig. 2a, fig. 2a is a schematic flowchart of a quantization method applied to a processor of an electronic device according to the present application. As shown in fig. 2a, the method comprises the steps of:

s101, acquiring a target operator and data to be quantized corresponding to the target operator.

Specifically, when the model to be quantized is subjected to inference, a target operator and data to be quantized corresponding to the target operator are obtained, wherein the target operator is one of a plurality of operators to be quantized in the model to be quantized, the data to be quantized comprises one or more of input data and a weight of the target operator, and the input data comprises one or more of voice data, text data and image data.

And S102, acquiring a quantization parameter corresponding to the target operator from the quantization parameter manager according to the target operator.

The quantization parameter manager is used for storing quantization parameters, wherein the quantization parameters are stored in a key value pair mode in the quantization parameter manager, a key in the key value pair is a quantization parameter identifier of an operator to be quantized, and a value in the key value pair is the content of the quantization parameters. And correspondingly acquiring the quantization parameter corresponding to the target operator from the quantization parameter manager according to the target operator.

S103, carrying out quantitative reasoning on the target operator according to the quantitative parameters corresponding to the target operator and the data to be quantized corresponding to the target operator.

Specifically, in the quantitative inference stage, a graph optimizer is added to modify the computation graph, the modified computation graph retains all functions of the original computation graph, and the output values of the modified computation graph and the original computation graph are the same under the condition that the input is the same. Besides the function of the original computation graph, the modified computation graph also has the function of computing and saving the quantization parameters. The modified calculation graph can carry out quantitative reasoning on the target operator according to the quantitative parameters corresponding to the target operator and the data to be quantized corresponding to the target operator.

In one possible implementation, before performing the calculation, a second graph optimizer is added, the second graph optimizer is started, a third operator is registered, and the target operator is replaced by the third operator, where the third operator includes the target operator. During reasoning, firstly, the third operator can quantize the data to be quantized corresponding to the target operator according to the quantization parameter corresponding to the target operator to obtain quantized data; and then the third operator carries out quantitative reasoning according to the quantized data.

Referring to fig. 2b, fig. 2b is a schematic flow chart of another quantization method provided in the present application, which is applied to a processor of an electronic device. As shown in fig. 2b, the method comprises the steps of:

s201, obtaining a quantization instruction, a calibration data set and a plurality of operators to be quantized.

The quantization instruction comprises a storage path for storing the quantization parameter, and specifically, the user can specify the quantization parameter storage path by configuring an environment variable. The multiple operators to be quantized are all quantifiable operators in the model to be quantized, and the model to be quantized can be a model for convergence of training.

In a possible implementation manner, the operator to be quantized may be one or more of a Local Response Normalization (LRN) operator, a Conv2D operator, a MatMul operator, and a Conv2 dbackproplnput operator, and may also be other operators, which is not limited in this embodiment.

In a possible implementation manner, the calibration data set includes input data corresponding to an operator to be quantized and a weight corresponding to the operator to be quantized, where the input data may be selected from a verification data set of a model to be quantized, and may be one or more of voice data, text data, and image data, which is not limited in this embodiment of the present application.

S202, obtaining a quantization parameter corresponding to each operator to be quantized according to each operator to be quantized in the plurality of operators to be quantized and the calibration data set, and storing the quantization parameter in a quantization parameter manager.

Specifically, a model network for constructing a computation graph, namely a model, is adopted in the TensorFlow, and the computation graph comprises a plurality of operators to be quantized in the model. In the embodiment of the application, a graph optimizer custom _ optimizer is added before calculation to modify the computation graph, the modified computation graph retains all functions of the original computation graph, and the output values of the modified computation graph and the original computation graph are the same under the condition that the input values of the modified computation graph and the input values of the original computation graph are the same. Besides all functions of the original calculation graph, the modified calculation graph also has the functions of calculating and storing the quantization parameters. When the modified calculation graph executes calculation, the quantization parameter corresponding to each operator to be quantized input can be calculated according to the calibration data set and each operator to be quantized, so as to obtain the quantization parameter corresponding to each operator to be quantized, namely the quantization parameter corresponding to the input data of the operator to be quantized and the quantization parameter corresponding to the weight of the operator to be quantized. And after the quantization parameter corresponding to each operator to be quantized is obtained, the quantization parameter corresponding to each operator to be quantized is stored in a quantization parameter manager in a key value pair mode.

In one possible implementation, the quantization parameter may include a scaling factor scale, a decimal point position, where the decimal point position is a position of a decimal point in the quantized data. The scaling factor refers to the ratio between the maximum value of the quantized data and the maximum absolute value of the data to be quantized. Further, the quantization parameter may further include an offset, where the offset is a median value of a plurality of elements in the data to be quantized, for asymmetric data to be quantized, and specifically, the offset may be a median value of a plurality of elements in the data to be quantized.

In a possible implementation manner, the modified computation graph calculates, according to the calibration data set and each operator to be quantized, a quantization parameter corresponding to each operator to be quantized, which may specifically be:

wherein, max _ abs is the maximum absolute value of the calibration data set corresponding to the operator to be quantized, position is the decimal point position, scale is the scaling factor, and ceil represents rounding. The calculation method in this embodiment of the present application is only an example, and the calculation method of the modified calculation graph on the quantization parameter may also be another method, which is not limited in this embodiment of the present application.

In a possible implementation manner, before performing calculation, a first graph optimizer is added, the first graph optimizer is started, a corresponding first operator is registered for each operator to be quantized in a plurality of operators to be quantized, and each operator to be quantized is replaced by the corresponding first operator. The graph optimizer may be turned on or off by setting a flag through an environment variable, or by setting tf.configproto of tensorflow, where tf.configproto is a configurable parameter of tensorflow, or by turning on or turning off in other ways, which is not limited in this embodiment of the present invention.

Illustratively, the neural network model shown in fig. 3 includes an activated ReLU operator, a convolution Conv2D operator, and a batch normalization BN operator, and if the operator to be quantized is a Conv2D operator, as exemplified by a Conv2D operator, a cambquantcon 2D operator corresponding to the Conv2D operator is registered, and a cambquantcon 2D replaces the Conv2D operator, as shown in fig. 4. The CambQuantConv2D operator can calculate the quantization parameter of the Conv2D operator according to the calibration data set, and store the quantization parameter in a quantization parameter manager in a key value pair form, wherein a key in the key value pair is a quantization parameter identifier of the operator to be quantized, and a value in the key value pair is the content of the quantization parameter. For example, in the quantization parameter manager, the quantization parameters corresponding to the Conv2D operator are stored in the form of scale _ map { key Conv 0/input-v 1.7, key Conv 0/filter-v 3.6, key Conv 1/input-v 6.7, and key Conv 1/filter-v 7.7 }.

In a possible implementation manner, the first operator may be one operator or a combination of multiple operators. When the first operator is a combination of a plurality of operators, the first operator includes any combination and all possible combinations of the corresponding operator to be quantized, the operator for calculating the quantization parameter, and the operator for storing the quantization parameter, which is not limited in this embodiment of the present application. The calculation quantization parameter operator is used for calculating the quantization parameter of the operator to be quantized corresponding to the first operator, and the storage quantization parameter operator is used for storing the quantization parameter of the operator to be quantized corresponding to the first operator in the storage path of the quantization parameter manager according to the quantization instruction.

In a possible implementation manner, before performing the calculation, a first graph optimizer is added, the first graph optimizer is started, a corresponding second operator is registered for each operator to be quantized in the multiple operators to be quantized, and the second operator corresponding to each operator to be quantized is inserted in front of the corresponding operator to be quantized.

Exemplarily, as shown in fig. 3, the neural network model takes the Conv2D operator as an example for explanation, a second operator CambQuant operator corresponding to the Conv2D operator is registered, and the CambQuant operator is inserted in front of the Conv2D operator, as shown in fig. 5, the neural network model calculates the CambQuant operator and then calculates the Conv2D operator. The CambQuant operator may calculate the quantization parameter of the Conv2D operator from the calibration data set and store the quantization parameter in a quantization parameter manager.

In a possible implementation manner, the second operator may be an operator, which is used to calculate and store the quantization parameter; the second operator may also be two operators, including a calculation quantization parameter operator and a storage quantization parameter operator, which is not limited in this embodiment of the present application. The calculation quantization parameter operator is used for calculating the quantization parameter of the operator to be quantized corresponding to the first operator, and the storage quantization parameter operator is used for storing the quantization parameter of the operator to be quantized corresponding to the first operator in the storage path of the quantization parameter manager according to the quantization instruction.

According to the embodiment of the application, the model quantization is divided into a quantization parameter obtaining stage and a quantization reasoning stage, the calculation graph is directly modified before calculation is executed in the quantization parameter obtaining stage, the quantization parameter of an operator to be quantized can be obtained in the calculation, and the quantization parameter is stored in the quantization parameter manager. Therefore, the quantization parameter is calculated once only in the quantization parameter acquisition stage, and the quantization parameter does not need to be calculated every time under the condition that the model input data are different, so that the calculation amount is reduced.

S203, acquiring a target operator and data to be quantized corresponding to the target operator, and acquiring a quantization parameter corresponding to the target operator from a quantization parameter manager according to the target operator.

Specifically, when the model to be quantized is subjected to inference, a target operator and data to be quantized corresponding to the target operator are obtained, wherein the target operator is one of a plurality of operators to be quantized in the model to be quantized, the data to be quantized comprises one or more of input data and a weight of the target operator, and the input data comprises one or more of voice data, text data and image data. And according to the target operator, taking out the quantization parameter corresponding to the target operator from the quantization parameter manager.

Illustratively, in the quantization parameter manager, the quantization parameters corresponding to the Conv2D operator are stored in the form of scale _ map { key Conv 0/input-v 1.7, key Conv 0/filter-v 3.6, key Conv 1/input-v 6.7, and key Conv 1/filter-v 7.7 }. According to the operator conv0, the corresponding input data has a quantization parameter of 1.7 and a weight value of 3.6. According to the operator conv1, the quantization parameter of the corresponding input data is 6.7, and the weight parameter is 7.7.

And S204, carrying out quantitative reasoning on the target operator according to the quantitative parameters corresponding to the target operator and the data to be quantized corresponding to the target operator.

In a possible implementation manner, before performing the calculation, a second graph optimizer is added, the second graph optimizer is started, a third operator is registered, and the target operator is replaced by the third operator. During reasoning, firstly, the third operator can quantize the data to be quantized corresponding to the target operator according to the quantization parameter corresponding to the target operator to obtain quantized data; and then the third operator carries out quantitative reasoning according to the quantized data.

In a possible implementation manner, the third operator may be one operator or a combination of multiple operators. When the third operator is a combination of multiple operators, the third operator includes any combination of the corresponding target operator and the quantization operator, and all possible combinations, which is not limited in this embodiment of the present application. The quantization operator is used for quantizing the data to be quantized.

In a possible implementation manner, a specific calculation manner for the third operator to quantize the data to be quantized corresponding to the target operator according to the quantization parameter corresponding to the target operator is as follows:

F_x≈I_x×2^position×scale (4)

wherein, I_xRepresenting fixed-point quantized data, F_xRepresenting the data before quantization. In this embodiment of the present application, a specific manner in which the third operator quantizes to-be-quantized data corresponding to the target operator according to the quantization parameter corresponding to the target operator is merely an example, and the specific calculation manner may also be another calculation manner, which is not limited in this application.

In the embodiment of the present application, quantization refers to converting data in a first format into data in a second format. The data in the first format may be floating-point data or fixed-point data, and the data in the second format may be fixed-point data, where a representation range of the data in the first format is greater than a representation range of the data in the second format, and the data in the first format has higher precision than the data in the second format. For example, the data in the first format may be 64-bit floating point type data or 32-bit floating point type data, the data in the second format may be 16-bit fixed point type data or 8-bit fixed point type data, and the quantization may be performed by converting 64-bit floating point type data into 16-bit fixed point type data or 8-bit fixed point type data, or may be performed by converting 32-bit floating point type data into 16-bit fixed point type data or 8-bit fixed point type data. Other data types may also be converted, and the embodiment of the present application is not limited in this respect.

In one possible implementation, the neural network model shown in fig. 3 is described below with the target operator being the Conv2D operator as an example. When the model carries out reasoning, the second graph optimizer is started, a third operator intConv2D is registered, and the Conv2D operator is replaced by the intConv2D operator, as shown in FIG. 6. The intConv2D operator may quantize the quantized data according to the quantization parameter corresponding to the Conv2D operator to obtain quantized data, and perform inference according to the quantized data.

In a possible implementation manner, the first graph optimizer and the second graph optimizer may be the same optimizer or different optimizers.

In summary, in the embodiment of the present application, quantization is divided into two stages, namely, a quantization parameter obtaining stage and a quantization reasoning stage, a computation graph is directly modified in the quantization parameter obtaining stage to obtain quantization parameters corresponding to a plurality of operators to be quantized, the quantization parameters are stored in a quantization parameter manager, and in the quantization reasoning stage, the computation graph is directly modified and quantization reasoning is performed according to the quantization parameters. Therefore, the quantization parameters are only needed to be calculated once and stored in the quantization parameter manager, and when reasoning is carried out on the model to be quantized each time, the corresponding quantization parameters are only needed to be obtained from the quantization parameter manager without recalculating the quantization parameters, so that the calculation amount is reduced; meanwhile, the embodiment of the application directly modifies the calculation diagram corresponding to the model, does not need to consider data preprocessing and convert the type of the model into two formats, namely reserved _ model or free graph _ def supported in TensorRT and TensorFlow-Lite, reduces excessive repeated operation, is convenient to use, improves the quantization efficiency and optimizes user experience.

The above description has introduced the solution of the embodiment of the present application mainly from the perspective of the method-side implementation process. It is understood that the electronic device comprises corresponding hardware structures and/or software modules for performing the respective functions in order to realize the above-mentioned functions. Those of skill in the art will readily appreciate that the present application is capable of hardware or a combination of hardware and computer software implementing the various illustrative elements and algorithm steps described in connection with the embodiments provided herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiment of the present application, the electronic device may be divided into the functional units according to the method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

Referring to fig. 7, fig. 7 is a block diagram illustrating functional units of an apparatus 700 for determining hardware performance according to an embodiment of the present application, as shown in fig. 7, the apparatus 700 for determining hardware performance includes an obtaining unit 710 and a processing unit 720, wherein,

the obtaining unit 710 is configured to obtain a target operator and data to be quantized corresponding to the target operator, and obtain a quantization parameter corresponding to the target operator from a quantization parameter manager according to the target operator;

the processing unit 720 is configured to perform quantitative inference on the target operator according to the quantization parameter corresponding to the target operator and the data to be quantized corresponding to the target operator.

In a specific embodiment of the quantization apparatus of the present invention, the specific operation of the obtaining unit 710 obtaining the target operator and the data to be quantized corresponding to the target operator and obtaining the quantization parameter corresponding to the target operator from the quantization parameter manager according to the target operator may refer to the related operation in S203, which is not described herein again; the operations related to the quantitative inference of the target operator by the processing unit 720 may refer to the operations related to S204, which are not described herein again.

In a specific implementation manner, the obtaining unit 710 is further configured to obtain a quantization instruction, a calibration data set, and a plurality of operators to be quantized. The specific operations of the obtaining unit 710 for obtaining the quantization instruction, the calibration data set, and the plurality of operators to be quantized may be related to the operations in S201, which are not described herein again. The processing unit 720 is further configured to obtain a quantization parameter corresponding to each to-be-quantized operator according to each to-be-quantized operator in the plurality of to-be-quantized operators and the calibration data set, and store the quantization parameter corresponding to each to-be-quantized operator in the storage path of the quantization parameter manager according to the quantization instruction. The specific operation of the processing unit 720 obtaining the quantization parameter and storing the quantization parameter in the storage path of the quantization parameter manager may be the related operation in S202, which is not described herein again.

It is to be understood that the functions of each program module of the apparatus for determining hardware performance in the embodiment of the present application may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the related description of the foregoing method embodiment, which is not described herein again.

Embodiments of the present application also provide a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program enables a computer to execute part or all of the steps of any one of the methods as described in the above method embodiments.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any of the methods as described in the above method embodiments. The computer program product may be a software installation package.

Fig. 8 is a block diagram illustrating a combined processing device 800 according to an embodiment of the present disclosure. As shown in fig. 8, the combined processing device 800 includes a computing processing device 802, an interface device 804, other processing devices 806, and a storage device 808. Depending on the application scenario, one or more computing devices 810 may be included in the computing processing device, and may be configured to perform the operations described herein in conjunction with fig. 2a and 2 b.

In various embodiments, a computing processing device of the present disclosure may be configured to perform user specified operations. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core. When multiple computing devices are implemented as artificial intelligence processor cores or as part of a hardware structure of an artificial intelligence processor core, computing processing devices of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure.

In an exemplary operation, the computing processing device of the present disclosure may interact with other processing devices through an interface device to collectively perform user-specified operations. Other Processing devices of the present disclosure may include one or more types of general and/or special purpose processors, such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), and artificial intelligence processors, depending on the implementation. These processors may include, but are not limited to, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic, discrete hardware components, etc., and the number may be determined based on actual needs. As mentioned previously, the computational processing device of the present disclosure can be considered to have a single core structure or an isomorphic multiple core structure only. However, when considered together, a computing processing device and other processing devices may be considered to form a heterogeneous multi-core structure.

In one or more embodiments, the other processing device can interface with external data and controls as a computational processing device of the present disclosure (which can be embodied as an artificial intelligence, e.g., a computing device associated with neural network operations), performing basic controls including, but not limited to, data handling, starting and/or stopping of the computing device, and the like. In further embodiments, other processing devices may also cooperate with the computing processing device to collectively perform computational tasks.

In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing processing device may obtain input data from other processing devices via the interface device, and write the input data into a storage device (or memory) on the computing processing device chip. Further, the computing processing device may obtain the control instruction from the other processing device via the interface device, and write the control instruction into the control cache on the computing processing device slice. Alternatively or optionally, the interface device may also read data in the memory device of the computer processing device and transmit the data to the other processing device.

Additionally or alternatively, the combined processing device of the present disclosure may further include a storage device. As shown in fig. 8, the storage means is connected to the calculation processing means and the other processing means, respectively. In one or more embodiments, the storage device may be used to hold data for the computing processing device and/or the other processing devices. For example, the data may be data that is not fully retained within internal or on-chip storage of a computing processing device or other processing device.

In some embodiments, the present disclosure also discloses a neural network chip (e.g., chip 902 shown in fig. 9). In one implementation, the Chip is a System on Chip (SoC) and is integrated with one or more combinatorial processing devices as shown in fig. 7. The chip may be connected to other associated components through an external interface device (such as external interface device 806 shown in fig. 8). The relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card, or a wifi interface. In some application scenarios, other processing units (e.g., video codecs) and/or interface modules (e.g., DRAM interfaces) and/or the like may be integrated on the chip. In some embodiments, the disclosure also discloses a chip packaging structure, which includes the chip. In some embodiments, the disclosure further discloses a board card including the chip packaging structure. The board will be described in detail below with reference to fig. 8.

Fig. 9 is a schematic diagram illustrating a structure of a board card 900 according to an embodiment of the disclosure. As shown in fig. 9, the board includes a memory device 904 for storing data, which includes one or more memory units 910. The memory device may be connected and data transferred to and from the control device 908 and the chip 902 described above by means of, for example, a bus. Further, the board card further includes an external interface device 906 configured for a data relay or transfer function between the chip (or the chip in the chip package structure) and an external device 912 (such as a server or a computer). For example, the data to be processed may be transferred to the chip by an external device through an external interface. For another example, the calculation result of the chip may be transmitted back to an external device via the external interface device. According to different application scenarios, the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface or the like.

In one or more embodiments, the control device in the disclosed card may be configured to regulate the state of the chip. Therefore, in an application scenario, the control device may include a single chip Microcomputer (MCU) for controlling the operating state of the chip.

From the above description in conjunction with fig. 8 and 9, it will be understood by those skilled in the art that the present disclosure also discloses an electronic device or apparatus, which may include one or more of the above boards, one or more of the above chips and/or one or more of the above combination processing devices.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud end, an edge end, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in which acts or modules are involved, which are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.

In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are divided based on the logic functions, and there may be other dividing manners in actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the solution of the embodiment of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In some implementation scenarios, the integrated units may be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated units may be stored in a computer readable memory. In this regard, when aspects of the present disclosure are embodied in the form of a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described in embodiments of the present disclosure. The Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In view of this, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause a1. a method of quantification, comprising:

acquiring a target operator and data to be quantized corresponding to the target operator;

according to the target operator, obtaining a quantization parameter corresponding to the target operator from a quantization parameter manager;

and carrying out quantitative reasoning on the target operator according to the quantitative parameters corresponding to the target operator and the data to be quantized corresponding to the target operator.

Clause a2. according to the method of clause a1, before the obtaining the target operator and the data to be quantized corresponding to the target operator, the method further includes:

obtaining a quantization instruction, wherein the quantization instruction comprises a saving path used by the quantization parameter manager to store quantization parameters;

acquiring a calibration data set and a plurality of operators to be quantized, wherein the target operator is one of the operators to be quantized;

obtaining a quantization parameter corresponding to each operator to be quantized according to each operator to be quantized in the plurality of operators to be quantized and the calibration data set, and storing the quantization parameter corresponding to each operator to be quantized under the storage path of the quantization parameter manager according to the quantization instruction.

Clause a3. according to the method of clause a2, obtaining a quantization parameter corresponding to each to-be-quantized operator of the plurality of to-be-quantized operators according to each to-be-quantized operator and the calibration data set, and storing the quantization parameter corresponding to each to-be-quantized operator under the storage path of the quantization parameter manager according to the quantization instruction, includes:

registering a corresponding first operator for each operator to be quantized, wherein the first operators comprise the operators to be quantized corresponding to the first operators;

obtaining a quantization parameter corresponding to each operator to be quantized according to the first operator corresponding to each operator to be quantized and the calibration data set;

and according to the quantization instruction, the first operator corresponding to each quantization operator stores the quantization parameter corresponding to each operator to be quantized in the quantization parameter manager.

Clause a4. according to the method of clause a2, obtaining a quantization parameter corresponding to each to-be-quantized operator of the plurality of to-be-quantized operators according to each to-be-quantized operator and the calibration data set, and storing the quantization parameter corresponding to each to-be-quantized operator under the storage path of the quantization parameter manager according to the quantization instruction, includes:

registering a corresponding second operator for each operator to be quantized;

determining a quantization parameter corresponding to each operator to be quantized according to the second operator corresponding to each operator to be quantized and the calibration data set;

and according to the quantization instruction, the second operator corresponding to each quantization operator stores the quantization parameter corresponding to each operator to be quantized in the quantization parameter manager.

Clause a5. according to the method of clause a1, the performing quantitative inference on the target operator according to the quantization parameter corresponding to the target operator and the data to be quantized corresponding to the target operator includes:

registering a third operator, wherein the third operator comprises a target operator, the third operator is used for quantizing the data to be quantized according to a quantization parameter corresponding to the target operator to obtain quantized data, so that the third operator performs inference according to the quantized data,

and the third operator carries out quantitative reasoning according to the parameter to be quantized corresponding to the target operator.

Clause a6. the method according to clause a5, applied to a tensrflow architecture, the data to be quantized comprising one or more of input data of the target operator, including one or more of speech data, text data, and image data, and weight values.

A7. A quantization apparatus comprising:

the acquisition unit is used for acquiring a target operator and data to be quantized corresponding to the target operator;

the acquisition unit is further used for acquiring a quantization parameter corresponding to the target operator from a quantization parameter manager according to the target operator;

and the processing unit is used for carrying out quantitative reasoning on the target operator according to the quantization parameter corresponding to the target operator and the data to be quantized corresponding to the target operator.

A8. The apparatus of clause a7, comprising:

the obtaining unit is further configured to obtain a quantization instruction, where the quantization instruction includes a saving path used by the quantization parameter manager to store a quantization parameter; acquiring a calibration data set and a plurality of operators to be quantized, wherein the target operator is any one of the operators to be quantized;

the processing unit is further configured to obtain a quantization parameter corresponding to each to-be-quantized operator according to each to-be-quantized operator in the plurality of to-be-quantized operators and the calibration data set, and store the quantization parameter corresponding to each to-be-quantized operator in the storage path of the quantization parameter manager according to the quantization instruction.

A9. A quantification apparatus comprising: a processor and memory, the processor executing code in the memory to perform the method of any of clauses a1-a 6.

A10. A computer-readable storage medium comprising a computer program stored for data exchange, which computer program, when executed by a processor, implements the method of any of clauses a1-a 6.

While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is intended that the following claims define the scope of the disclosure and that equivalents or alternatives within the scope of these claims be covered thereby.

Claims

1. A method of quantization, comprising:

2. The method of claim 1, wherein before the obtaining the target operator and the data to be quantized corresponding to the target operator, the method further comprises:

obtaining a quantization parameter corresponding to each operator to be quantized according to each operator to be quantized in the plurality of operators to be quantized and the calibration data set;

and saving the quantization parameter corresponding to each quantization operator in the saving path of the quantization parameter manager according to the quantization instruction.

3. The method according to claim 2, wherein obtaining a quantization parameter corresponding to each operator to be quantized according to each operator to be quantized in the plurality of operators to be quantized and the calibration data set comprises:

registering a corresponding first operator for each operator to be quantized, wherein the first operator is used for realizing the operation of the corresponding operator to be quantized and calculating and storing the quantization parameter of the corresponding operator to be quantized;

and obtaining a quantization parameter corresponding to each operator to be quantized according to the first operator corresponding to each operator to be quantized and the calibration data set.

4. The method of claim 3, wherein the first operator is an operator, the method further comprising:

replacing each operator to be quantized with a first operator corresponding to each operator to be quantized;

5. The method according to claim 3, wherein the first operator further comprises at least one of a calculation quantization parameter operator and a saving quantization parameter operator, wherein the calculation quantization parameter operator is used to calculate a quantization parameter of an operator to be quantized corresponding to the first operator, and the saving quantization parameter operator is used to save the quantization parameter of the operator to be quantized corresponding to the first operator under the saving path of the quantization parameter manager according to the quantization instruction.

6. The method according to claim 2, wherein obtaining a quantization parameter corresponding to each operator to be quantized according to each operator to be quantized in the plurality of operators to be quantized and the calibration data set comprises:

registering a corresponding second operator for each operator to be quantized, wherein the second operator is used for storing the quantization parameters of the operator to be quantized corresponding to the second operator in the storage path of the quantization parameter manager according to the quantization instruction;

and determining a quantization parameter corresponding to each operator to be quantized according to the second operator corresponding to each operator to be quantized and the calibration data set.

7. The method according to claim 1, wherein the performing quantization inference on the target operator according to the quantization parameter corresponding to the target operator and the data to be quantized corresponding to the target operator comprises:

registering a third operator, wherein the third operator comprises a target operator, and the third operator is used for quantizing the data to be quantized according to a quantization parameter corresponding to the target operator to obtain quantized data, so that the third operator performs inference according to the quantized data;

8. The method according to claim 7, wherein the method is applied to a TensorFlow architecture, the data to be quantized comprises one or more of input data and weight values of the target operator, and the data to be input comprises one or more of voice data, text data and image data.

9. A quantization apparatus, comprising:

10. The quantization apparatus according to claim 7, wherein the obtaining unit, before obtaining the target operator and the data to be quantized corresponding to the target operator, is further configured to:

acquiring a calibration data set and a plurality of operators to be quantized, wherein the target operator is any one of the operators to be quantized;

11. A quantization apparatus, comprising: a processor and memory, the processor executing code in the memory to perform the method of any of claims 1 to 6.

12. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any of claims 1 to 6.