CN112085175A

CN112085175A - Data processing method and device based on neural network calculation

Info

Publication number: CN112085175A
Application number: CN201910517485.4A
Authority: CN
Inventors: 陈超; 徐斌; 谢展鹏
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-06-14
Filing date: 2019-06-14
Publication date: 2020-12-15
Anticipated expiration: 2039-06-14
Also published as: CN112085175B; WO2020249085A1

Abstract

The application provides a data processing method and device based on a quantized neural network. In the embodiment of the present application, the first calculation result is subjected to weighting processing based on a weighting coefficient, where the weighting coefficient is equal to the first data quantization coefficient multiplied by the first weight quantization coefficient and divided by the second data quantization coefficient. The traditional first inverse quantization operation and the traditional second quantization operation are combined through the weighting processing, so that the multiple loading processes of the data and the weight in the first inverse quantization operation and the second quantization operation are combined into a primary data loading process and a primary weight quantization coefficient loading process corresponding to the weighting operation, and the time for loading the data and the weight is favorably shortened.

Description

Data processing method and device based on neural network calculation

Technical Field

The present application relates to the field of data processing, and more particularly, to a data processing method and apparatus based on neural network computation.

Background

Neural network-based computing devices are required to consume significant computing resources in the processing of data (e.g., convolution operations). Particularly, in the two types of data processing layers, namely the convolutional layer and the fully-connected layer, since the corresponding data processing processes in the two types of data processing layers are substantially matrix multiplication operations based on floating point numbers, that is, matrix multiplication is performed on a data matrix of a floating point type and a weight matrix of a floating point type, the calculation amount in the data processing process is large, and the occupied calculation resources are very high.

In order to reduce the amount of calculation when a computing device performs data processing on data, a quantization operation is introduced in the industry, that is, floating point data involved in the data processing process is converted into fixed point data, and the amount of calculation when the computing device performs data processing on the data is reduced by using the characteristic that the calculation resources required by matrix multiplication based on the fixed point data are smaller than the calculation resources required by matrix multiplication based on the floating point data. The quantization operation is inherently a lossy transformation, and the less the quantization bit width of the quantized fixed-point data, the less the precision of the data processing by the computing device. Generally, quantized data needs to be input data for the next calculation after being subjected to linear calculation, and therefore, in order to reduce the influence of the quantization operation on the data accuracy, the quantized data needs to be subjected to inverse quantization operation after being subjected to linear calculation.

Generally, the quantization operation and the inverse quantization operation are combined with a "three-level data processing structure" to realize a processing procedure of data. The three-level data processing structure comprises a first Linear computing layer, a modified Linear Unit (ReLU) layer and a second Linear computing layer. The quantization sublayer and the inverse quantization sublayer are respectively introduced into the first linear computation layer and the second linear computation layer to perform quantization operation on input data of the first linear computation layer, perform inverse quantization operation on a first computation result of the first linear computation layer, perform quantization operation on input data of the second linear computation layer, and perform inverse quantization operation on a second computation result of the second linear computation layer.

However, in the above data processing process, each time of the quantization operation, the linear calculation, the inverse quantization operation, and the ReLU calculation, data and weight required for each operation need to be loaded, which results in a long time for loading data and weight, and affects the data processing performance of the neural network.

Disclosure of Invention

The application provides a data processing method and device based on neural network computing, which are beneficial to reducing the time occupied by loading data and weight in a scene of combining quantization operation, inverse quantization operation and a three-level data processing structure.

In a first aspect, the present application provides a data processing method based on a quantization-based neural network, the quantization-based neural network comprising a three-level data processing structure, the three-level data processing structure comprising: the device comprises a first linear computing layer, a modified linear unit ReLU layer and a second linear computing layer; wherein the content of the first and second substances,

the first linear computation layer comprises a first quantization sublayer, a first computation sublayer and a first inverse quantization sublayer; the first quantization sublayer is used for quantizing the input data according to the first data quantization coefficient to obtain first quantization data; the first calculation sublayer is used for calculating the first quantized data according to the quantized first weight to obtain a first calculation result, and the first inverse quantization sublayer is used for inverse quantizing the first calculation result to obtain first output data; the quantized first weight is obtained by quantizing the coefficient according to the first weight;

the ReLU layer is used for performing ReLU operation on the first output data to obtain intermediate output data;

the second linear computation layer comprises a second quantization sublayer, a second computation sublayer and a second inverse quantization sublayer; the second quantization sublayer is used for quantizing the intermediate output data according to a second data quantization coefficient to obtain second quantized data; the second calculation sublayer is used for calculating the second quantized data to obtain a second calculation result, and the second inverse quantization sublayer is used for performing inverse quantization on the second calculation result according to a second inverse quantization coefficient to obtain second output data;

the data processing method comprises the following steps:

obtaining the first calculation result by adopting the same method as that in the three-level data processing structure;

carrying out re-quantization on the first calculation result to obtain a re-quantized first calculation result, wherein the re-quantization comprises: multiplying the first calculation result by a weighting coefficient to obtain the first calculation result after the weight; performing the ReLU operation on the re-quantized first calculation result to obtain the second quantized data; or, performing a ReLU on the first calculation result to obtain a first calculation result after a ReLU operation, and performing a re-quantization on the first calculation result after the ReLU operation, where the re-quantization includes: multiplying the first calculation result after the ReLU operation by a weighting coefficient to obtain the second quantized data;

processing the second quantized data in the same way as in the three-level data processing structure;

wherein the re-quantization coefficient is equal to the first data quantization coefficient multiplied by the first weighted quantization coefficient divided by the second data quantization coefficient.

In an embodiment of the present application, the first calculation result is subjected to a weighting process based on a weighting coefficient, where the weighting coefficient is equal to the first data quantization coefficient multiplied by the first weighted quantization coefficient and divided by the second data quantization coefficient. That is, it can be understood that, by combining the conventional first inverse quantization operation and the conventional second quantization operation through the weighting process, multiple loading processes of data and weights in the first inverse quantization operation and the second quantization operation are combined into a primary data loading process and a primary re-quantization coefficient loading process corresponding to the weighting operation, which is beneficial to reducing the time taken to load data and weights.

In one possible implementation, the storing the first calculation result in a memory, and the performing a ReLU resulting ReLU operation on the first calculation result includes: reading out the first calculation result from the memory; and finishing the ReLU operation by the first calculation result through a comparator on a data path to obtain a first calculation result after the ReLU operation.

In the embodiment of the application, the ReLU operation is completed through the comparator on the data path, so that the ReLU operation and the weighting process are prevented from being executed by the computing unit, and the calculation amount of the computing unit is favorably reduced.

In one possible implementation, the requantization is processed by a requantization circuit, the output of the comparator being an input to the requantization circuit.

In the embodiment of the application, the weighting process and the ReLU operation of the data are realized through the weighting circuit and the comparator, and the weighting process and the ReLU operation are distributed to different units for realization, so that the reasonable distribution of the calculated amount of the weighting quantization process and the ReLU operation is facilitated, and the data processing speed is improved.

In one possible implementation, the data path is a path between data from the memory to an input of the re-quantization circuit.

In one possible implementation, before the obtaining the first calculation result by the same method as that in the three-level data processing structure, the method further includes: acquiring calibration input data and weight corresponding to each layer in a full-precision neural network model, wherein the calibration input data of a first layer in the full-precision neural network model is data in a calibration data set prepared in advance, and the calibration input data of the rest layers is output data of the previous layer; acquiring the optimal maximum value of the calibration input data and the optimal maximum value of the weight corresponding to each layer; determining a plurality of candidate weight quantization coefficients, a plurality of candidate data bit widths, a plurality of candidate data inverse quantization coefficients and a plurality of candidate weight quantization coefficients for each layer according to the optimal maximum value of the calibration input data, the optimal maximum value of the weights and a plurality of selectable data formats for each layer; obtaining a plurality of quantized candidate weights of each layer according to the candidate weight quantization coefficients of each layer and the weight of each layer in the full-precision neural network model; determining a plurality of quantization-based neural network models according to the plurality of candidate data quantization coefficients, the plurality of candidate data inverse quantization coefficients, the plurality of candidate weight quantization coefficients and the plurality of quantized candidate weights of each layer; inputting data in the calibration data set into the plurality of quantization-based neural network models, and counting a plurality of operation results; selecting the quantization-based neural network model with an operation result meeting a preset condition from the plurality of quantization-based neural network models according to the operation results.

In the embodiment of the application, the calibration data set is input to the full-precision neural network model, and the quantization coefficient, the inverse quantization coefficient, the weight quantization coefficient and the weight which meet the preset conditions in each layer are determined, that is, the coefficient and the weight which are required by each layer are determined by taking the layer as a unit, so that the calculation precision and the calculation speed of the neural network are favorably considered. The method avoids the problem that the quantization coefficient, the inverse quantization coefficient and the weight quantization coefficient used by each layer in the traditional neural network are the same, so that the traditional neural network can only adopt one of the calculation precision and the calculation speed. For example, when the quantization coefficients, inverse quantization coefficients, and weighted quantization coefficients used in each layer ensure that the calculation accuracy of the neural network is high, the calculation of data processing is large, and the time taken is long. Or, the quantization coefficient, the inverse quantization coefficient and the weight quantization coefficient used in each layer ensure that the precision of data processing is poor when the calculation speed of the neural network is faster than that of a block.

In one possible implementation, the data format includes: integer INT4, integer INT8, integer

INT16, floating point FP16 or floating point FP 32.

In one possible implementation manner, the preset condition includes: optimal performance, optimal power consumption or optimal precision.

In the embodiment of the application, the quantization coefficients, the inverse quantization coefficients and the weighted quantization coefficients used by each layer in the neural network can be customized by setting preset conditions, so that the rationality of the quantization coefficients, the inverse quantization coefficients and the weighted quantization coefficients used by each layer is improved.

In a second aspect, there is provided a quantization-based neural network computing device, the quantization-based neural network comprising a three-level data processing structure, the three-level data processing structure comprising: a first linear computation layer, a data processing layer, and a second linear computation layer, wherein,

the computing device to implement functionality of a three-level data processing architecture, the computing device comprising: a first quantization circuit, a first calculation circuit, a re-quantization circuit, a ReLU circuit;

the first computing circuit is used for obtaining the first computing result by adopting the same method as that in the three-level data processing structure;

the re-quantization circuit is configured to re-quantize the first calculation result to obtain a re-quantized first calculation result, where the re-quantization includes: multiplying the first calculation result by a weighting coefficient to obtain the first calculation result after the weight; the ReLU circuit is used for performing the ReLU operation on the re-quantized first calculation result to obtain the second quantized data; or, the ReLU circuit is configured to perform a ReLU on the first calculation result to obtain a first calculation result after a ReLU operation, and the re-quantization circuit is configured to re-quantize the first calculation result after the ReLU operation, where the re-quantizing includes: multiplying the first calculation result after the ReLU operation by a weighting coefficient to obtain the second quantized data;

the first quantization circuit is configured to process the second quantized data in the same manner as in the three-level data processing structure;

In an embodiment of the present application, the first calculation result is subjected to a weighting process based on a weighting coefficient, where the weighting coefficient is equal to the first data quantization coefficient multiplied by the first weighted quantization coefficient and divided by the second data quantization coefficient. That is, the conventional first inverse quantization operation and the conventional second quantization operation are combined through the weighting processing, so that the multiple loading processes of the data and the weight in the first inverse quantization operation and the second quantization operation are combined into a primary data loading process and a primary weight quantization coefficient loading process corresponding to the weighting operation, which is beneficial to reducing the time taken for loading the data and the weight.

In a possible implementation manner, the ReLU circuit includes a comparator, the comparator is disposed in a data path between a memory of the computing device and an input of the requantization circuit, and the comparator is configured to perform the ReLU operation on a first calculation result obtained from the memory to obtain a first calculation result after the ReLU operation; the re-quantization circuit is used for obtaining the first calculation result after the ReLU operation from the comparator.

In the embodiment of the application, the ReLU operation is completed through the comparator on the data path, so that the ReLU operation and the weighting process are prevented from being executed by the weighting circuit, and the calculation amount of the weighting circuit is favorably reduced.

In one possible implementation, the output of the comparator is used as an input to the re-quantization circuit.

In one possible implementation, the computing device includes a vector calculation circuit that includes the re-quantization circuit and the ReLU circuit.

In the embodiment of the present application, the ReLU circuit and the weighting circuit are implemented by a vector calculation circuit, and the functions of the existing vector calculation circuit are multiplexed, so that the improvement on the existing calculation device can be reduced, and the application range of the embodiment of the present application is improved.

In one possible implementation, the data format of the first linear computation layer or the data format of the second linear computation layer is any one of a plurality of data formats, including integer INT4, integer INT8, integer INT16, floating point FP16, or floating point FP 32.

In the embodiment of the present application, the data format of the first linear computation layer or the data format of the second linear computation layer is any one of the following multiple data formats, which avoids that the data formats of each linear computation layer in the neural network in the prior art must be the same, and is beneficial to improving the flexibility of quantization of the neural network.

In one possible implementation, the first weighted quantized coefficient, the first data quantized coefficient, and the second data quantized coefficient are determined based on preset conditions, where the preset conditions include: optimal performance, optimal power consumption or optimal precision.

In one possible implementation, the computing device of the second aspect may be a System on Chip (SoC).

In a third aspect, a computing system is provided, comprising a controller and a computing device, wherein the controller controls the computing device to execute the method by transmitting a plurality of instructions to the computing device.

In a possible implementation, the above computing system further includes a training device of a neural network model, the training device including at least one processor and at least one memory, the at least one processor being configured to:

acquiring calibration input data and weight corresponding to each layer in a full-precision neural network model from the at least one memory, wherein the calibration input data of a first layer in the full-precision neural network model is data in a calibration data set prepared in advance, and the calibration input data of the rest layers is output data of a previous layer;

acquiring the optimal maximum value of the calibration input data and the optimal maximum value of the weight corresponding to each layer;

determining a plurality of candidate weight quantization coefficients, a plurality of candidate data bit widths, a plurality of candidate data inverse quantization coefficients and a plurality of candidate weight quantization coefficients for each layer according to the optimal maximum value of the calibration input data, the optimal maximum value of the weights and a plurality of selectable data formats for each layer;

obtaining a plurality of quantized candidate weights of each layer according to the candidate weight quantization coefficients of each layer and the weight of each layer in the full-precision neural network model;

determining a plurality of quantization-based neural network models according to the plurality of candidate data quantization coefficients, the plurality of candidate data inverse quantization coefficients, the plurality of candidate weight quantization coefficients and the plurality of quantized candidate weights of each layer;

inputting data in the calibration data set into the plurality of quantization-based neural network models, and counting a plurality of operation results;

selecting the quantization-based neural network model with an operation result meeting a preset condition from the plurality of quantization-based neural network models according to the operation results.

Optionally, the training device may be a server or a computing cloud.

In a fourth aspect, a computing device is provided, which includes a controller and the computing device according to any one of the possible implementation manners of the second aspect.

In one possible implementation, the computing device may be a computing device including a controller (e.g., a CPU) and the computing device described in any one of the possible implementations of the second aspect.

In a fifth aspect, there is provided a computer readable medium storing program code for execution by a computing device, the program code comprising instructions for performing the method of the first aspect described above.

In a sixth aspect, there is provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the above method.

Drawings

FIG. 1 is a schematic architecture diagram of a convolutional neural network.

Fig. 2 is a schematic architecture diagram of an architecture 100 of another convolutional neural network.

Fig. 3 is a schematic diagram of the hardware architecture of a neural network.

Fig. 4 is a schematic diagram of the linear quantization principle.

Fig. 5 is a schematic diagram of a three-level data processing structure of a conventional neural network.

FIG. 6 is a schematic block diagram of a three-stage processing architecture of an embodiment of the present application.

Fig. 7 is a flowchart of a data processing method based on a neural network according to an embodiment of the present application.

Fig. 8 is a schematic flow diagram of a quantization operation based on a neural network.

Fig. 9 is a flowchart of a selection process of each layer parameter of the quantization-based neural network according to an embodiment of the present application.

Fig. 10 is a schematic diagram of a computing device based on a quantitative neural network according to an embodiment of the present application.

Fig. 11 is a schematic diagram of a quantized neural network-based computing system 1100 according to an embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

For ease of understanding, a convolutional neural network to which the embodiments of the present application are applicable will be briefly described. It should be noted that the embodiment of the present application may also be applied to other types of Neural networks, for example, Deep Neural Networks (DNNs), and the embodiment of the present application is not limited to this.

A Convolutional Neural Network (CNN) can be understood as a deep neural Network with a convolutional structure. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

The following describes the architecture of the neural network in detail by taking CNN as an example. FIG. 1 is a schematic architecture diagram of a convolutional neural network. The neural network 100 depicted in fig. 1 includes an input layer 110, a convolutional/pooling layer 120, and a neural network layer 130. It should be understood that the pooling layer is optional.

Convolutional/pooling layers 120, convolutional/pooling layers 120 may include, for example, 121-126 layers, in one implementation, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 is convolutional layers, 126 is pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation. In the case where the neural network does not include a pooling layer, the 121 to 126 layers may be convolutional layers.

Taking convolutional layer 121 as an example, convolutional layer 121 may include a plurality of convolutional operators, which are also called kernels, and function as a filter for extracting specific information from an input image matrix in image processing, and the convolutional operators may be essentially a weight matrix, which is usually defined in advance, and weight values in the weight matrix need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values may extract characteristic information from input data, thereby assisting convolutional neural network 100 to make correct prediction.

When convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 100 increases, the more convolutional layers (e.g., 126) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.

A pooling layer:

since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce pooling layers after the convolutional layer, i.e. the layers 121-126 as illustrated by 120 in fig. 3, may be one convolutional layer followed by one pooling layer, or may be multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a particular range to produce an average. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

The neural network layer 130, after being processed by the convolutional layer/pooling layer 120, is not enough for the convolutional neural network 100 to output the required output information. Because, as previously described, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (class information or other relevant information as needed), the convolutional neural network 100 needs to generate one or a set of outputs of the number of classes as needed using the neural network layer 130. Accordingly, a plurality of hidden layers (such as 131, 132, to 13n shown in fig. 3) and an output layer 140 may be included in the neural network layer 130, and parameters included in the plurality of hidden layers may be pre-trained according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.

After the hidden layers in the neural network layer 130, i.e. the last layer of the whole convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the class cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from 110 to 140 in fig. 1) of the whole convolutional neural network 100 is completed, the backward propagation (i.e. the propagation from 140 to 110 in fig. 3 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 100 and the error between the result output of the convolutional neural network 100 through the output layer and the ideal result.

It should be noted that the convolutional neural network 100 shown in fig. 1 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, and fig. 2 shows another architecture 200 of the convolutional neural network, and compared to the connection manner between the convolutional layers/pooling layers 121 to 126 shown in fig. 1, a plurality of convolutional layers/pooling layers in fig. 2 may be parallel, that is, the features extracted respectively are all input to the global neural network layer 130 for processing. In fig. 1 and 2, the same reference numerals are used for the same elements.

A computing device for implementing the relevant functions of the convolutional neural network described above is described below in conjunction with fig. 3. The computing device shown in FIG. 3 may be a Neural Network Processing Unit (NPU) 320.

The NPU 320 is mounted as a coprocessor to a main CPU (host CPU)320, and tasks are allocated by the main CPU 320. The core portion of the NPU 310 is an arithmetic circuit 303, and the controller 304 controls the arithmetic circuit 303 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 303 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuitry 303 is a two-dimensional systolic array. The arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 303 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 303 fetches the data corresponding to the matrix B from the weight memory 302 and buffers the data on each PE in the arithmetic circuit 303. The arithmetic circuit 303 takes the matrix a data from the input memory 301 and performs matrix arithmetic with the matrix B, and a partial result or a final result of the obtained matrix is stored in an accumulator (accumulator) 308.

The unified memory 306 is used to store input data as well as output data. The weight data is directly transferred from the external Memory 340 to the weight Memory 302 via a Memory Access Controller (DMAC) 305. Input data is also carried from the external memory 340 into the unified memory 306 through the DMAC.

A Bus Interface Unit (BIU) 330 for interaction with the DMAC and an Instruction Fetch memory (Instruction Fetch Buffer)309 based on the AXI Bus.

The BIU 330 is used for the instruction fetch memory 309 to fetch instructions from the external memory, and is also used for the DMAC 303 to fetch the original data of the input matrix a or the weight matrix B from the external memory.

The DMAC 303 is mainly used to carry input data in the external memory DDR to the unified memory 306 or carry weight data into the weight memory 302 or carry input data into the input memory 301.

If necessary, a plurality of operation processing units in the vector calculation circuit 307 further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. The method is mainly used for non-convolution/FC layer network calculation in the neural network, such as Pooling (Pooling), Batch Normalization (Batch Normalization), Local Response Normalization (Local Response Normalization) and the like.

In some implementations, the vector calculation circuitry 307 may store the processed output vector to the unified buffer 306. For example, the vector calculation circuit 307 may apply a non-linear function to the output of the operation circuit 303. In some implementations, the vector calculation circuit 307 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 303, for example, for use in subsequent layers in a neural network.

The controller 304 is coupled to an instruction fetch memory 309 for storing instructions used by the controller 304.

Unified memory 306, input memory 301, weight memory 302, and instruction fetch memory 309 may all be On-Chip (On-Chip) memory, and external memory 340 may be independent of the NPU hardware architecture.

The operations of the layers in the convolutional neural networks shown in fig. 1 and 2 can be performed by the operation circuit 303 or the vector calculation circuit 307. For example, operations corresponding to convolutional layers may be performed by the arithmetic circuit 303, and operations to activate output data based on an activation function may be performed by the vector calculation circuit 307.

As can be seen from the above description of the neural network, the computation amount involved in data processing is very large in the process of data processing by using the full-precision neural network, for example, matrix multiplication of convolutional layers, and the like. In order to reduce the amount of calculation when a computing device performs data processing on data, a quantization operation is introduced in the industry, that is, floating point data involved in the data processing process is converted into fixed point data, and the amount of calculation when the computing device performs data processing on the data is reduced by using the characteristic that the calculation resources required by matrix multiplication based on the fixed point data are smaller than the calculation resources required by matrix multiplication based on the floating point data. Such neural networks employing quantization operations are also known as "quantization-based neural networks".

Accordingly, as described above in relation to the architecture of the neural network, output data of a previous data processing layer may be used as input data of a next data processing layer, and the quantization operation is lossy conversion, so that, in order to ensure the accuracy of input data of the next data processing layer, after the input data is quantized in the previous data processing layer, inverse quantization operation is also required to be performed on the output data in the previous data processing layer. Wherein, the inverse quantization operation may be understood as converting fixed point data into floating point data.

The quantization process in the data processing layer is briefly described below in connection with fig. 3. It should be noted that the data processing layer may be any layer in the neural network that can perform quantization operations, for example, may be a convolutional layer, an input layer, an implicit layer, and the like shown in fig. 1 and fig. 2. For data processing in one data processing layer, not only input data (e.g., a data matrix) but also weights (e.g., a weight matrix) used in the data processing are quantized. For example, combining the quantization operation with the convolution operation requires not only the quantization operation on the data but also the quantization of the weights in the convolution operation.

Fig. 4 is a schematic diagram of the linear quantization principle. Since the quantization operation is a lossy transform, in order to ensure the accuracy of the data processing based on the quantized data, a quantization parameter may be introduced in the quantization process: the optimal maximum value (max) accurately represents the value range of the original data (i.e. the data before the quantization operation) by using the optimal max, and the original data is limited in the range of- | max |, so that the original data can be accurately represented in a preset fixed point data range (see [ -127, +127] shown in fig. 4) to obtain a better quantization effect. The method avoids the problem that the maximum max value of the input data is directly used, noise is introduced in the early stage of data processing of the input data, so that the difference between the maximum max value of the input data and the max value of the original data is large, the finally determined quantization coefficient is not accurate enough, and the quantization effect is reduced.

After the optimal max is determined, the quantized coefficients may be calculated from the optimal max. For example, to quantize the original data into 8-bit fixed point data, the max value is mapped to +127, and the quantization coefficient quant _ data _ scale of the data and the quantization coefficient quant _ weight _ scale of the weight that needs to be used in the data processing process of the neural network can be calculated by the following formula.

quant_data_scale＝max_data_value/127(1)

quant_weight_scale＝max_weight_value/127(2)

Wherein max _ data _ value represents the optimal max of data, and max _ weight _ value represents the optimal max of weight.

After the quantization coefficients of the data and the quantization coefficients of the weights are determined, the mapping relationship between the original data and the quantized data may be determined according to the quantization coefficients of the data, and the mapping relationship between the original weights and the quantized weights may also be determined according to the quantization coefficients of the weights.

quant_data＝round(ori_data/quant_data_scale) (3)

quant_weight＝round(ori_weight/quant_weight_scale) (4)

Wherein, ori _ data represents original data, quant _ data represents quantized data, ori _ weight represents original weight, quant _ weight represents quantized weight, and round () represents rounding operation.

Therefore, all the original data and the original weight can be converted into fixed point numbers in the range of-127- +127 from floating point numbers, and the calculation process of the fixed point numbers can be completed by using a fixed point multiplier, so that the effects of accelerating the operation speed and reducing the power consumption are achieved.

quant_result＝quant_data*quant_weight (5)

Where quant _ result represents the fixed point result.

After the fixed point result is obtained, the fixed point number of the fixed point result can be mapped into a floating point number through inverse quantization operation. That is to say that the first and second electrodes,

ori_result＝quant_result*quant_data_scale*quant_weight_scale (6)

here, ori _ result represents a floating-point result obtained by inverse quantization, and quant _ data _ scale and quant _ weight _ scale, which are products of quant _ data _ scale and quant _ weight _ scale, represent inverse quantization coefficients.

It should be noted that the same reference numerals used for the same physical quantities are given in the formulae (1) to (6) referred to in the above quantization process and dequantization process, and for the sake of brevity, each reference numeral is given a physical quantity that it represents only when it first appears.

As can be seen from the neural network architectures shown in fig. 1 and fig. 2, the computations mainly involved in the neural network, such as convolution calculation, pooling calculation, etc., can be understood as a linear computation per se, and for convenience of description, the convolution layer, pooling layer, etc., will be referred to as "linear computation layer" hereinafter, i.e., the output data of each linear computation layer is a linear function of the input data of the layer. Then the output data can be regarded as a linear combination of the input data no matter how many linear computation layers there are in the neural network, in which case the neural network is less nonlinear or has no function to describe the nonlinear model.

Therefore, in order to improve the description of the neural network on the non-linear model, an Activation function (Activation functions) layer is added in the neural network. Currently, the ReLU, which is a more common activation function in neural networks, can be set between two linear computation layers. That is, the output data of the linear computation layer of the first layer is used as the output data of the ReLU layer, and the output data of the ReLU layer is used as the input data of the computation layer of the second layer. In this way, the nonlinear characteristics are introduced into the neural network, so that the neural network can describe any nonlinear function which is approximated arbitrarily, and thus, the neural network can be applied to a plurality of nonlinear models.

The ReLU function may be represented by a formula

Where x represents the input data. As can be seen from the formula, when the input data is a positive number or 0, the value of the input data is not changed by the ReLU operation; when the input data of the input data is negative, the ReLU operation adjusts the input data to 0.

Based on the quantization principle described above, the linear activation function takes a linear computation layer as a convolutional layer as an example, and a computation flow between quantization operation and inverse quantization operation matrix computation in the convolutional layer is described with reference to fig. 4. The linear computation layer may be any layer that needs quantization operation or inverse quantization operation in the neural network, for example, may be a full connection layer, an input layer, and the like, which is not particularly limited in this embodiment of the present application.

Fig. 5 is a schematic diagram of a three-level data processing structure of a conventional neural network. The three-level data processing architecture 500 shown in FIG. 5 includes a first linear computation layer 510, a ReLU layer 520, and a second linear computation layer 530. The data processing functions corresponding to the three-level data processing architecture shown in FIG. 5 may be performed by the computing device shown in FIG. 3 above.

The first linear computation layer 510 includes a first quantization sublayer 511, a first computation sublayer 512, and a first dequantization sublayer 513. The first quantization sublayer 511 is configured to quantize the input data according to the first data quantization coefficient to obtain first quantized data. The first calculation sublayer 512 is configured to calculate the first quantized data according to the quantized first weight to obtain a first calculation result. The first inverse quantization sublayer 513 is configured to perform inverse quantization on the first calculation result to obtain first output data, where the quantized first weight is obtained by quantizing the original weight based on a first weight quantization coefficient.

The ReLU layer 520 is configured to perform a ReLU operation on the first output data to obtain intermediate output data.

The second linear computation layer 530 includes a second quantization sublayer 531, a second computation sublayer 532, and a second inverse quantization sublayer 533. The second quantization sublayer 531 is configured to quantize the intermediate output data according to the second data quantization coefficient to obtain second quantized data. The second calculation sublayer 532 is configured to calculate the second quantized data to obtain a second calculation result. The second dequantization sublayer 533 is configured to dequantize the second calculation result according to the second dequantization coefficient to obtain second output data.

It should be noted that the first linear computation layer may be the convolution layer described above, and the computation sublayer may be used to implement convolution computation and the like, and the specific computation may refer to the description above, and is not described herein again for brevity.

In the data processing procedure described in fig. 5, parameters required for data processing, such as data quantization coefficients and quantized weights (e.g., first weights), need to be loaded from the storage device at each step. Each time of loading the parameters required for data processing needs to take a certain time, which results in a larger overall time delay of the neural network for data processing.

In order to reduce the time taken for loading parameters required by data processing in the process of processing data by a neural network, the embodiment of the application provides a method for processing data based on a weighting operation. By using the characteristics of the ReLU function described above, that is, when the input data of the ReLU operation is a positive number, the inverse quantization operation, ReLU operation, and quantization operation in the above steps 3 to 5 are combined without changing the value of the input data, thereby realizing the weighting operation.

The data processing method according to the embodiment of the present application is described below with reference to fig. 6. In the method shown in fig. 6, the data is processed by weighting, that is, the first inverse quantization sublayer, the ReLU layer, and the quantization sublayer in the above steps 3 to 5 (i.e., 513, 520, 531 in fig. 5) are combined. It should be noted that the above-mentioned "merging" is only a functional merging, and the hardware for performing the weighting operation may still be a vector calculation circuit (e.g., the vector calculation circuit 307 in fig. 3).

For the convenience of understanding the schemes shown in fig. 6 and fig. 7, the principle of the re-quantization in the embodiments of the present application will be described.

Combining the formulas in the steps 3 to 5, wherein the combined formula is as follows:

quant_data_2＝round(ReLU(quant_result_1*(data_scale_1*weight_scale_1))/data_scale_2)

the first data quantization coefficient data _ scale _1, the first weight quantization coefficient weight _ scale _1, and the second data quantization coefficient data _ scale _2 are all positive numbers, and according to the characteristics of the ReLU function described above, that is, when the input data is a positive number, the value of the input data is not changed by the ReLU. Therefore, the above formula can be transformed to obtain a transformed formula:

quant_data_2＝round(ReLU(quant_result_1*(data_scale_1*weight_scale_1/data_scale_2)))

and defining the product of the first data quantization coefficient and the first weight quantization coefficient and dividing the second data quantization coefficient into the weight quantization coefficient, wherein the weight quantization coefficient is expressed by a equal _ scale.

That is, the equal value of the current _ scale _1 is weight _ scale _1/data _ scale _ 2.

Then, the above transformed formula can be expressed as:

quant_data_2＝round(ReLU(quant_result_1*requant_scale)) (7)

as described above, the first data quantization coefficient data _ scale _1, the first weight quantization coefficient weight _ scale _1, and the second data quantization coefficient data _ scale _2 for determining the weighting coefficients may be determined through an off-line calibration process, and thus, in order to reduce the time taken to load the parameters required for the data processing, the weighting coefficients are also determined through the off-line calibration process. In this way, the weighting coefficients are loaded only once during data processing based on the re-quantization. The data processing parameters need to be loaded three times in the data processing from step 3 to step 5 in the conventional data processing based on fig. 5.

Based on the characteristics of the ReLU function introduced above, the above transformed equation (7) may also be equivalent to equation (8):

quant_data_2＝round(Relu(quant_result_1)*requant_scale) (8)

that is, after the first output data quant _ result _1 is acquired, the linear rectification operation may be performed by the ReLU, and then the weighting operation may be performed based on the weighting coefficient. That is, the Relu operation may be performed before the weighting operation, or the Relu operation may be performed after the weighting operation. The embodiment of the present application does not limit this.

Based on the above description, the embodiments of the present application provide a new three-level data processing structure. FIG. 6 is a schematic block diagram of a three-stage processing architecture of an embodiment of the present application. FIG. 6 illustrates a three-level data processing architecture 600 that includes a first linear computing layer 610, a weighting layer 620, and a second linear computing layer 630. The data processing functions corresponding to the three-stage data processing architecture 600 shown in fig. 6 may still be performed by the computing device shown in fig. 3 above.

It should be noted that, for the sake of comparison with fig. 5, the hierarchy levels in the three-level data processing architecture 600 shown in fig. 6 that have the same functions as the three-level data processing architecture 500 shown in fig. 5 are numbered identically. In addition, in the data processing procedure described in fig. 5 and 6, the same terms are used.

The first linear computation layer 610 includes a first quantization sublayer 511 and a first computation sublayer 512, and is configured to output a first computation result.

The weighting layer 620 performs a re-quantization operation and a ReLU operation on the first calculation result output by the first linear calculation layer 610, and outputs second quantized data.

The second linear computation layer 630 includes a second computation sublayer 532 and a second inverse quantization sublayer 533, which are used for processing the second quantized data.

Based on the three-level data processing structure described in fig. 6, a data processing method based on a neural network according to an embodiment of the present application is described. Fig. 7 is a flowchart of a data processing method based on a neural network according to an embodiment of the present application. The method shown in fig. 7 includes steps 710 to 730.

The first calculation result is obtained 710 in the same manner as in the three-stage data processing structure.

That is, the first calculation result may be obtained by the first quantization sublayer 511 and the first calculation sublayer 512.

And 720, performing re-quantization operation and ReLU operation on the first calculation result to obtain second quantized data.

Based on the above formulas (7) and (8), two implementations of the above re-quantization operation and the ReLU operation are possible. Note that, for convenience of description, attention is paid to distinguishing "the weighting" and "the weighting operation" in the embodiments of the present application. Wherein "re-quantization" includes the "re-quantization operation" and the ReLU operation described above.

In a first implementation manner, the first calculation result is re-quantized to obtain a re-quantized first calculation result, and the re-quantizing includes: multiplying the first calculation result by a weighting coefficient to obtain the first calculation result after the weight; and performing the ReLU operation on the re-quantized first calculation result to obtain the second quantized data.

The first implementation manner is a calculation order expressed by equation (7).

In a second implementation manner, the ReLU is performed on the first calculation result to obtain a first calculation result after the ReLU operation, and the first calculation result after the ReLU operation is subjected to requantization, where the requantization includes: multiplying the first calculation result after the ReLU operation by a weighting coefficient to obtain the second quantized data.

The second implementation manner is the calculation order introduced in equation (8).

In the second implementation manner, since the ReLU operation can be implemented by the comparator, after the operation sequence of the ReLU operation and the requantization is adjusted, in the process of reading the first calculation result from the memory and inputting the first calculation result into the data processing circuit (e.g., the vector calculation circuit 307), the first calculation result directly flows through the comparator, i.e., the ReLU operation (channel dependent calculation) is completed, and then the first calculation result after the ReLU operation is directly input into the data processing circuit.

Alternatively, the above-described ReLU operation may be implemented using a comparator, and the above-described re-quantization operation may be performed using the vector calculation circuit 307.

The second quantized data is processed 730 in the same way as in the three-level data processing architecture.

That is, the second quantized data may be processed by the second calculation sublayer 532 and the second inverse quantization sublayer 533.

Generally, in order to save time taken for data processing on line, the quantization-based neural network may be determined in an off-line manner, i.e., determining data quantization coefficients, quantized weights, data inverse quantization coefficients, inverse quantization coefficients of weights, and the like of each layer.

For ease of understanding, the process of performing the quantization operation on the data online and performing the quantization operation on the weights offline will be briefly described with reference to fig. 8. Fig. 8 is a schematic flow diagram of a quantization operation based on a neural network. The neural network based quantization process may be divided into an offline process 810 and an online process 820.

The offline process 810, also called "offline calibration (calibration) process", is a process of counting calibration data and weights corresponding to each layer of the neural network, determining a quantization bit width, a weight quantization coefficient and a data quantization coefficient according to the calibration data, and quantizing an original weight of each layer according to the determined quantization bit width and the weight quantization coefficient to obtain a quantized weight.

It should be noted that the above-mentioned offline quantization process may be performed by a CPU on the Soc (e.g., as shown in fig. 3

NPU) may also be done by the computing cloud.

In the online process 820, the input data is quantized by using the quantization bit width and the data quantization coefficient determined in the offline process 810, and then the quantized input data and the quantized weight are input to a neural network dedicated engine (for example, the operation circuit 303 in fig. 3) for calculation, and finally the calculation result is dequantized.

In a conventional neural network computing architecture, the quantization bit width of the corresponding quantization operation in each linear computing layer is the same. For example, the quantization bit width in the first linear computation layer and the quantization bit width in the second linear computation layer shown in fig. 5 are both 8 bits (bits). The method for configuring the quantization bit width ensures that the calculation amount reduced by the quantization operation is very limited on the premise of ensuring certain data processing precision of the neural network.

Based on the related introduction of the quantization principle, the quantization operation is inherently a lossy transformation, and the use of the quantization operation reduces the amount of calculation, and at the same time, reduces the data processing accuracy of the neural network. According to the current configuration mode of the quantization bit width, namely, the whole neural network uses a set of quantization bit width, if the requirement on the calculation speed of the neural network is high, the quantization bit width used by each linear calculation layer in the neural network needs to be reduced, so that the whole neural network is very likely to fail to maintain the basic data processing precision. If the requirement on the data processing precision of the neural network is high, the quantization bit width of each linear computation layer in the neural network is generally high, so that the computation amount finally reduced by the quantization operation is very limited. Therefore, based on the configuration mode according to the current quantization bit width, the data processing method based on neural network calculation cannot meet the user requirements in the aspects of data processing precision or data processing speed, and the user experience is not good enough.

In order to enable the data processing method based on the neural network computing to meet the user requirements in the aspects of data processing precision or data processing speed and the like and improve the user experience, the application provides a novel data processing method based on the neural network computing, namely, the quantization bit width corresponding to each linear computing layer in the neural network is flexibly configured based on the parameters of the user requirements.

The selection process of each layer parameter of the neural network based on quantization according to the embodiment of the present application is described in detail below with reference to fig. 9. Fig. 9 is a flowchart of a selection process of each layer parameter of the quantization-based neural network according to an embodiment of the present application. The neural network includes a plurality of linear computation layers (e.g., convolutional layers, input layers, hidden layers, etc., as shown in fig. 1 or 2). The method is used for data processing of at least one linear computing layer. It should be understood that the method described in fig. 9 may be performed by a computing cloud, or a server. The method shown in fig. 9 includes steps 910 through 970.

And 910, acquiring calibration input data and weights corresponding to each layer in a full-precision neural network model, wherein the calibration input data of a first layer in the full-precision neural network model is data in a calibration data set prepared in advance, and the calibration input data of the rest layers is output data of the previous layer.

The full-precision neural network model can be understood as a neural network which is not subjected to quantization and inverse quantization operations, that is, the weight of input data of each layer in the full-precision neural network can be floating point type data, for example, fp 32.

And 920, acquiring an optimal maximum value (max) of the calibration input data and an optimal maximum value (max) of the weight corresponding to each layer.

And 930, determining a plurality of candidate weight quantization coefficients, a plurality of candidate data bit widths, a plurality of candidate data inverse quantization coefficients and a plurality of candidate weight quantization coefficients for each layer according to the optimal maximum value of the calibration input data, the optimal maximum value of the weights and the plurality of data formats selectable for each layer.

Optionally, the data format includes: integer INT4, integer INT8, integer INT16, floating point FP16 or floating point FP 32.

940, a plurality of quantized candidate weights of each layer are obtained according to the plurality of candidate weight quantization coefficients of each layer and the weight of each layer in the full-precision neural network model.

The above-mentioned obtaining of the multiple candidate quantized weights of each layer according to the multiple candidate weight quantization coefficients and the weight of each layer of the full-precision neural network model is the same as the process of quantizing one weight when the quantization principle is introduced in the foregoing, and for brevity, detailed description is not provided here.

950, determining a plurality of quantization-based neural network models according to the plurality of candidate data quantization coefficients, the plurality of candidate data inverse quantization coefficients, the plurality of candidate weight quantization coefficients and the plurality of quantized candidate weights for each layer.

At least one layer of the quantized neural network model includes a quantization operation, a data processing operation, and an inverse quantization operation.

960, inputting the data in the calibration data set to the plurality of quantization-based neural network models, and counting a plurality of operation results.

The operation result may include: performance data of the neural network, power consumption data of the neural network, or precision data of the neural network, and the like.

970, selecting a quantization-based neural network model of which the operation result satisfies a preset condition from the plurality of quantization-based neural network models according to the operation results.

The quantization configuration provided by the embodiment of the application is beneficial to improving the flexibility of the quantization configuration based on the neural network compared with the traditional quantization configuration mode that the same quantization coefficient must be used by a plurality of linear calculation layers of the whole neural network.

Optionally, the quantization-based neural network model satisfying the preset condition may process data in the online process.

It should be noted that, the selecting of the data quantization coefficient, the inverse quantization coefficient, the quantized weight, and the data bit width of each layer that meet the preset condition may be implemented in various ways, which is not specifically limited in this embodiment of the present application. For example, the data quantization coefficients, the dequantization coefficients, the quantized weights, and the data bit widths of each layer that meet the preset condition respectively include multiple cases, that is, multiple candidate data quantization coefficients, multiple candidate dequantization coefficients, multiple candidate quantized weights, and multiple candidate data bit widths (for convenience of description, these 4 parameters are referred to as candidate parameters) of each layer all meet the preset condition, then one of the candidate parameters that meet the preset condition may be randomly selected, or one optimal value of the candidate parameters that meet the preset condition may be selected.

Selecting an optimal value from the candidate parameters meeting the preset condition, for example, if the preset condition is speed priority, selecting the data bit width and the weight bit width which are the least from the candidate parameters; and if the preset condition is that the precision is first, selecting the candidate parameters with the most data bit width and the most weight bit width.

The above-mentioned configuration of a plurality of candidate data quantization coefficients, a plurality of candidate dequantization coefficients, and a plurality of candidate quantized weights for each layer may be obtained by various quantization configurations. The quantization configuration may include a plurality of configuration options, and at least one of the different quantization configurations in the plurality of quantization configurations may have different corresponding parameters.

It should be understood that the difference in at least one of the above different quantization configurations may include a different selection of one configuration parameter in different quantization configurations or a different selection of multiple configuration parameters in different quantization configurations.

Optionally, the configuration options include quantization bit width, quantization mode, similarity calculation mode, quantization coefficient configuration mode, and the like.

The quantized coefficients comprise quantized bit width and quantized coefficients, wherein the quantized bit width comprises weighted bit width and data bit width, and the quantized coefficients comprise weighted quantized coefficients and data quantization.

Optionally, the parameter for quantizing the bit width may include 4 bits, 8 bits, 16 bits, and the like.

Alternatively, the parameters of the quantization mode may include an asymmetric quantization mode with offset (offset) and a symmetric quantization mode without offset (offset).

Optionally, the similarity calculation method is used to determine the similarity between the output data calculated based on the quantized neural network and the output data calculated based on the unquantized neural network (full-precision neural network). Specifically, the calculation method may include a calculation method based on KL (Kullback-Leibler) divergence, a calculation method based on symmetric KL (Kullback-Leibler) divergence, a calculation method based on JS (Jensen-Shannon divergence), and the like.

That is, the above-described similarity calculation method can be used to select the optimal max. That is, the result of data processing performed by the neural network on the raw data (full-precision data) and the result of data processing performed by the neural network on the plurality of types of quantized data are compared by the above-described similarity calculation method, and a result of processing on a certain quantized data that is closer to the result of processing on the raw data is selected, and the maximum value used for obtaining the quantized data is set as the optimum max.

For example, table 1 shows a selectable table of various configuration options according to the embodiment of the present application, and the various quantized configurations may be generated according to selectable parameters corresponding to the configuration options shown in table 1.

TABLE 1

Optionally, the above-mentioned configuration of the quantization coefficients is used to indicate a configuration unit corresponding to the quantization coefficients in one layer of the linear computation layer. The configuration unit may include a linear computation layer, each convolution kernel in the linear computation layer, an input channel corresponding to the linear computation layer, an output channel corresponding to the linear computation layer, a data tensor of input data corresponding to the linear computation layer, and the like. For example, when the quantization coefficients are arranged in units of linear calculation layers, the quantization coefficients corresponding to one linear calculation layer are the same. For another example, when the quantization coefficients are arranged in units of convolution kernels, one linear computation layer corresponds to a plurality of quantization coefficients, and each convolution kernel corresponds to one quantization coefficient.

It should be noted that the parameters corresponding to the configuration options may include all parameters that may correspond to the configuration options in the prior art, for example, the similarity calculation manner may include all similarity calculation manners in the prior art. The parameters corresponding to the configuration options can be compatible with all the parameters possibly corresponding to the configuration options in the future.

As described above, the data quantization coefficients and the weight quantization coefficients are determined according to the data optimal max and the weight optimal max, and if the data quantization coefficients are determined directly using the maximum value of the input data and the weight quantization coefficients are determined using the maximum value of the original weights, the accuracy of the quantization operation may be reduced. Therefore, it is necessary to find the data-optimal max and the weight-optimal max based on the calibration data, determine the data quantization coefficient based on the data-optimal max, and determine the weight quantization coefficient based on the weight-optimal max, in order to improve the accuracy of the quantization operation.

The manner of determining the data-optimal max and the weight-optimal max may be the manner of determining the data-optimal max and the weight-optimal max in the prior art, or the manner of determining the data-optimal max and the weight-optimal max described below. It should be noted that the following description describes a manner of determining the data optimal max and a principle of determining the weight optimal max in the same way, and for brevity, the following description takes an example of a manner of determining the data optimal max, and the manner of determining the weight optimal max may be implemented with reference to the following description.

That is, each set of quantized coefficients in the set of quantized coefficients comprises a quantized coefficient, the method further comprising: acquiring first calibration data, and determining a plurality of candidate optimal maximum values corresponding to each linear calculation layer of the first calibration data; the determining, based on the plurality of quantization configurations, a set of quantization coefficients corresponding to each of the plurality of linear computation layers comprises: determining a plurality of quantization coefficients corresponding to each linear calculation layer according to the plurality of quantization configurations and the optimal maximum value of a plurality of candidates corresponding to each linear calculation layer, wherein the optimal maximum value of one candidate in the optimal maximum values of the plurality of candidates corresponds to one quantization coefficient in the plurality of quantization coefficients.

Of the above-mentioned plurality of candidate optimal maxThere are many ways of determining this as the maximum value | max! of the first calibration data_realThe candidate optimal maximum value is searched for nearby, for example, may be [0.7| max ]_real，1.3|max|_real]A plurality of candidate optimal max are selected within the value range of (a). The maximum value | max &' of the first calibration data may be used_realAnd selecting a plurality of candidate optimal max by taking a preset step length as a transformation quantity as an initial maximum value. The optimal maximum value of the plurality of candidates may also be selected based on a determination of a plurality of similarities of the configurations in the quantized configuration.

That is, the determining the optimal maximum value of the first calibration data for the plurality of candidates corresponding to each linear computation layer includes: determining a plurality of maximum values corresponding to the first calibration data at each linear calculation layer according to the first calibration data; and selecting a plurality of candidate optimal maximum values from the plurality of maximum values corresponding to each linear calculation layer according to a plurality of preset similarity calculation modes.

Optionally, in this embodiment of the application, an optimal max value may be selected for each linear computation layer by comprehensively considering the determination manners of the three similarities, then quantization systems corresponding to different combination manners are calculated by combining different combination manners in the quantization configuration, and then a target quantization coefficient corresponding to each linear computation layer may be selected according to the foregoing based on the preset data processing accuracy.

Optionally, the preset condition may be a parameter required by the user, and may be a performance parameter of the neural network, which is used to indicate a requirement of the user, for example, the precision of data processing calculated by the neural network, the speed of data processing calculated by the neural network, a combination of the two, or power consumption of the neural network, and the like, which is not specifically limited in this embodiment of the application.

The user requirement may be provided by a user before configuring the quantization coefficients for the neural network in an offline process, or may be provided by the user before performing the offline process so as to serve as a reference factor for configuring the quantization coefficients for the neural network in the offline process.

In the embodiment of the application, the data quantization coefficient, the quantized weight value and the inverse quantization coefficient of the parameter based on the user requirement are beneficial to enabling the quantization process, the inverse quantization process and the data processing process of the data of the current linear computation layer to meet the user requirement, and the user experience is improved. The problem that the data quantization coefficient, the quantized weight value and the inverse quantization coefficient cannot be configured in combination with the user requirements in the traditional quantization configuration mode is solved.

Further, if most of the linear calculation layers in the neural network or all of the linear calculation layers are configured according to the method of the embodiment of the application, the quantization process, the inverse quantization process and the data processing process of the data based on the neural network calculation can meet the user requirements, and the user experience can be improved. The problem that in a traditional quantitative configuration mode, the whole neural network uses a set of fixed data quantization coefficients and inverse quantization coefficients, and cannot be configured according to user requirements is solved.

Optionally, the quantization bit widths corresponding to at least two linear computation layers in the quantization-based neural network may be different.

Alternatively, the quantized coefficients of the data of each layer, the quantized weights of each layer, and the dequantized coefficients of each layer in the quantization-based neural network may be stored in one parameter file, or may be stored in a plurality of parameter files. For example, the data quantization coefficients and the inverse quantization coefficients may be stored in one parameter file and stored in the unified memory 306 shown in fig. 3, and the quantized weights may be stored in another parameter file and stored in the weight memory 302. The embodiment of the present application is not particularly limited to this.

Optionally, the bit width of the data in the quantized data coefficient and the bit width of the data in the dequantized data coefficient corresponding to any one linear computation layer (e.g., the first linear computation layer or the second linear computation layer) in the quantization-based neural network may be one or more. If the data bit width in the data quantization coefficient and the data bit width in the inverse quantization coefficient are one, it can be understood that the data bit width in the data quantization coefficient and the data bit width in the inverse quantization coefficient used in the current linear computation layer, the data processing process, and the inverse quantization process are the same. If there are multiple data bit widths in the data quantization coefficients and multiple data bit widths in the inverse quantization coefficients, it can be understood that the multiple quantization processes, data processing processes, and data bit widths in the data quantization coefficients and data bit widths in the inverse quantization coefficients used in the inverse quantization processes in the current linear computation layer may be different.

For example, there are multiple data tensors (tensors) in the current linear computation layer, and each data tensor may correspond to one data bit width. For another example, when the current linear computation layer corresponds to a plurality of input channels (input channels), each input channel may correspond to one data bit width. For another example, when the current linear computation layer corresponds to a plurality of output channels (output channels), each output channel may correspond to one data bit width.

The method shown in fig. 9 described above may be used in combination with the weighting method described in fig. 7, that is, after determining the first data dequantization coefficient in the first linear computation layer, the first weight dequantization coefficient, and the second data quantization system in the second linear computation layer in the quantization-based neural network by the method of fig. 9, the weighting coefficient may be calculated offline. Of course, the method shown in fig. 9 can also be used in combination with the conventional three-level data processing structure shown in fig. 5, which is not limited by the embodiment of the present application.

The method of the embodiment of the present application is described above with reference to fig. 1 to 9, and the apparatus of the embodiment of the present application is described below with reference to fig. 10 and 11. It should be noted that the apparatuses shown in fig. 10 and fig. 11 can implement the steps in the above method, and are not described herein again for brevity.

Fig. 10 is a schematic diagram of a quantization-based neural network computing device according to an embodiment of the present application. Wherein the quantized neural network comprises a three-level data processing structure comprising: the first linear computation layer, the data processing layer, and the second linear computation layer may specifically refer to the related description of fig. 6, and for brevity, are not described herein again. The computing device 1000 illustrated in fig. 10 includes a first quantizing circuit 1010, a first calculating circuit 1020, a weighting circuit 1030, and a ReLU circuit 1040.

The first calculation circuit 1020, configured to obtain the first calculation result by using the same method as that in the three-level data processing structure;

the weighting circuit 1030 is configured to perform re-quantization on the first calculation result to obtain a re-quantized first calculation result, where the weighting includes: multiplying the first calculation result by a weighting coefficient to obtain the first calculation result after the weight; the ReLU circuit 1040 is configured to perform the ReLU operation on the re-quantized first calculation result to obtain the second quantized data; alternatively, the ReLU circuit 1040 is configured to perform a ReLU on the first calculation result to obtain a first calculation result after a ReLU operation, and the re-quantization circuit 1030 is configured to re-quantize the first calculation result after the ReLU operation, where the re-quantizing includes: multiplying the first calculation result after the ReLU operation by a weighting coefficient to obtain the second quantized data;

the first quantization circuit 1010 is configured to process the second quantized data in the same manner as in the three-level data processing structure; wherein the re-quantization coefficient is equal to the first data quantization coefficient multiplied by the first weighted quantization coefficient divided by the second data quantization coefficient.

Optionally, as an embodiment, the ReLU circuit 1040 includes a comparator, the comparator is disposed in a data path between a memory of the computing device and an input of the requantization circuit 1030, and the comparator is configured to perform the ReLU operation on a first calculation result obtained from the memory to obtain a first calculation result after the ReLU operation; the re-quantization circuit 1030 is configured to obtain the first calculation result after the ReLU operation from the comparator.

Optionally, the first calculating circuit 1020 may include the operation circuit 303, and the first calculating circuit 1120 may further include the accumulator 308.

Alternatively, the above-described re-quantization circuit 1030 may belong to the vector calculation circuit 307.

Alternatively, the above-described ReLU circuit 1040 may also belong to the vector calculation circuit 307. If the above-mentioned ReLU circuit is implemented by a comparator, the comparator may be located in the accumulator 308 and the vector calculation circuit 307. At this time, the above memory can be understood as a storage unit in the accumulator 308.

Optionally, as an embodiment, an output of the comparator is used as an input of the re-quantization circuit.

Optionally, as an embodiment, the computing device includes a vector calculation circuit including the re-quantization circuit and the ReLU circuit.

Optionally, as an embodiment, a quantization bit width in the first linear computation layer is different from a quantization bit width in the second linear computation layer.

Optionally, as an embodiment, the first weighted quantization coefficient, the first data quantization coefficient, and the second data quantization coefficient are determined based on a preset condition, where the preset condition includes: optimal performance, optimal power consumption or optimal precision.

Fig. 11 is a schematic diagram of a quantized neural network-based computing system 1100 according to an embodiment of the present application. The computing system 1100 includes the computing device 1000 shown in fig. 10, and further includes a training device 1110 for a neural network model. The training device 1110 includes at least one processor 1111 and at least one memory 1112.

The at least one processor 1111 is configured to:

acquiring calibration input data and weights corresponding to each layer in a full-precision neural network model from the at least one memory 1112, wherein the calibration input data of a first layer in the full-precision neural network model is data in a calibration data set prepared in advance, and the calibration input data of the rest layers is output data of a previous layer;

Optionally, the training device of the neural network may be a server or a computing cloud.

Optionally, as an embodiment, the data format includes: integer INT4, integer INT8, integer INT16, floating point FP16 or floating point FP 32.

Optionally, as an embodiment, the preset condition includes: optimal performance, optimal power consumption or optimal precision.

Optionally, as an embodiment, the operation result includes: performance data, power consumption data, or accuracy data.

It will be appreciated that in embodiments of the present application, the memory may comprise both read-only memory and random access memory, and may provide instructions and data to the processor. A portion of the processor may also include non-volatile random access memory. For example, the processor may also store information of the device type.

It should be understood that determining B from a does not mean determining B from a alone, but may also be determined from a and/or other information.

It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be read by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Versatile Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A data processing method based on a quantization-based neural network, wherein the quantization-based neural network comprises a three-level data processing structure, and the three-level data processing structure comprises: the device comprises a first linear computing layer, a modified linear unit ReLU layer and a second linear computing layer; wherein the content of the first and second substances,

the data processing method comprises the following steps:

2. The method of claim 1, wherein the first calculation result is stored in a memory, and wherein performing a ReLU-derived ReLU operation on the first calculation result comprises:

reading out the first calculation result from the memory;

and finishing the ReLU operation by the first calculation result through a comparator on a data path to obtain a first calculation result after the ReLU operation.

3. The method of claim 2, wherein the requantization is processed by a requantization circuit, an output of the comparator being an input to the requantization circuit.

4. The method of claim 3, wherein the data path is a path between data from the memory to an input of the re-quantization circuit.

5. The method of any of claims 1-4, wherein prior to said obtaining the first computed result using the same method as in the three-level data processing architecture, the method further comprises:

acquiring calibration input data and weight corresponding to each layer in a full-precision neural network model, wherein the calibration input data of a first layer in the full-precision neural network model is data in a calibration data set prepared in advance, and the calibration input data of the rest layers is output data of the previous layer;

6. The method of claim 5, wherein the data format comprises: integer INT4, integer INT8, integer INT16, floating point FP16 or floating point FP 32.

7. The method of claim 5 or 6, wherein the preset conditions include: optimal performance, optimal power consumption or optimal precision.

8. The method of any of claims 5-7, wherein the operation results comprise: performance data, power consumption data, or accuracy data.

9. A quantization-based neural network-based computing device, the quantization-based neural network comprising a three-level data processing structure, the three-level data processing structure comprising: a first linear computation layer, a data processing layer, and a second linear computation layer, wherein,

10. The computing device of claim 9, wherein the ReLU circuit includes a comparator disposed in a data path between the re-quantization circuit and a memory of the computing device,

the comparator is used for carrying out the ReLU operation on the first calculation result acquired from the memory to obtain a first calculation result after the ReLU operation;

the re-quantization circuit is used for obtaining the first calculation result after the ReLU operation from the comparator.

11. The computing device of claim 10, wherein an output of the comparator is an input to the re-quantization circuit.

12. The computing device of claim 9, wherein the computing device comprises a vector calculation circuit comprising the re-quantization circuit and the ReLU circuit.

13. The computing device of any of claims 9-12, wherein the data format of the first linear computing layer or the data format of the second linear computing layer is any of a plurality of data formats, including integer INT4, integer INT8, integer INT16, floating point FP16, or floating point FP 32.

14. The computing device of any one of claims 9-13, wherein the first weighted quantized coefficient, the first data quantized coefficient, and the second data quantized coefficient are determined based on preset conditions, the preset conditions including: optimal performance, optimal power consumption or optimal precision.

15. A computing system comprising a controller and a computing device,

the controller controls the computing device to perform the data processing method of any one of claims 1-8 by transmitting a plurality of instructions to the computing device.

16. A computing system comprising a controller and a computing device as claimed in any one of claims 9 to 14.