CN112085186B

CN112085186B - Method for determining quantization parameter of neural network and related product

Info

Publication number: CN112085186B
Application number: CN201910888626.3A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2019-06-12
Filing date: 2019-09-19
Publication date: 2024-03-05
Anticipated expiration: 2039-09-19
Also published as: JP7167405B2; CN112085183B; JP2021177369A; CN112085181B; EP3998554A1; CN112085193A; CN112085186A; JP7166704B2; KR20210011462A; CN112085190A; US11675676B2; CN112085188A; WO2020248424A1; CN112085191B; CN112085189A; US20220261634A1; EP3998554A4; CN112085185B; KR20210018352A; KR20210011461A

Abstract

The embodiment of the application discloses a method for determining quantization parameters of a neural network and related products, wherein a board card in the related products comprises the following steps: a memory device, an interface device, and a control device, and an artificial intelligence chip; wherein the artificial intelligent chip is respectively connected with the storage device, the control device and the interface device; the storage device is used for storing data; the interface device is used for realizing data transmission between the artificial intelligent chip and external equipment; the control device is used for monitoring the state of the artificial intelligent chip. The board card may be used to perform artificial intelligence operations.

Description

Method for determining quantization parameter of neural network and related product

Related application:

the application is filed on the following 6 th and 27 th days of 2019, and the application number is 201910570125.0, and the invention is entitled "a method for determining quantization parameters of a neural network and a priority of related products".

The application claims priority to the method and the device for quantifying the neural network and related products, which are filed on 6 th and 12 th of 2019, and are provided with application number 201910505239.7.

The application claims priority to a quantization parameter adjustment method, a quantization parameter adjustment device and related products, which are filed on 6 and 18 days of 2019 and have application number 201910528537.8.

The application requires the priority of a neural network operation method and device and related products, which are filed on 6-14 th 2019 and are named as 201910515355.7.

Technical Field

Embodiments of the present disclosure relate to a method for determining quantization parameters of a neural network and related products.

Background

Neural Networks (NNs) are a mathematical or computational model that mimics the structure and function of biological neural networks. The neural network continuously corrects the network weight and the threshold value through training of sample data to enable the error function to descend along the negative gradient direction and approach the expected output. The method is a recognition classification model which is widely applied and is used for function approximation, model recognition classification, data compression, time sequence prediction and the like.

In practical application, the data of the neural network is commonly 32 bits, and the existing data of the neural network occupies more bits, so that although the accuracy is ensured, a higher storage space and a higher processing bandwidth are required, and the cost is increased.

Disclosure of Invention

In order to solve the above-mentioned technical problems, the disclosure provides a method for determining quantization parameters of a neural network and related products.

To achieve the above object, the present disclosure provides a quantization parameter determining method of a neural network, the method including:

traversing operators in a computational graph corresponding to the neural network, and selecting a current operator and an operator to be fused from the computational graph;

determining a split size according to the available storage capacity of the on-chip memory of the artificial intelligence processor;

splitting the output data of the operator to be fused into a plurality of data blocks according to the splitting size;

mapping to obtain the size of the data block of the input data of the current operator and the size of the data block of the intermediate data between the current operator and the operator to be fused based on the size of the data block of the output data of the operator to be fused;

the data blocks of the output data of the operator to be fused, the corresponding data blocks of the input data of the current operator and the data blocks of the intermediate data between the current operator and the operator to be fused are used as data to be quantized, and a statistical result of each type of data to be quantized is obtained; the data to be quantized comprises at least one data of neurons, weights, gradients and biases of the neural network;

Determining corresponding quantization parameters by using the statistical result of each type of data to be quantized and the data bit width; the quantization parameter is used for correspondingly quantizing the data in the operation process of the neural network by the artificial intelligence processor; the quantization parameter is a point location parameter.

To achieve the above object, the present disclosure provides a quantization parameter determining device of a neural network, including a memory and a processor, where the memory stores a computer program that can be run on the processor, and the processor implements the steps of the method when executing the computer program.

To achieve the above object, the present disclosure provides a computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the method described above.

In the neural network operation process, the technical scheme disclosed by the disclosure is utilized to determine the quantization parameter during quantization, and the quantization parameter is used for quantizing data in the neural network operation process by the artificial intelligent processor, so that high-precision data are converted into low-precision fixed point numbers, and all space sizes of data storage involved in the neural network operation process can be reduced. For example: conversion of float32 to fix8 can reduce model parameters by a factor of 4. Because the data storage space is reduced, a smaller space is used when the neural network is deployed, so that the on-chip memory on the artificial intelligent processor chip can accommodate more data, access to the data of the artificial intelligent processor chip is reduced, and the computing performance is improved

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.

FIG. 1 is a schematic diagram of a neural network architecture;

FIG. 2 is a flowchart of a method for determining quantization parameters of a neural network according to the present application;

FIG. 3 is a schematic representation of a symmetric fixed point number;

FIG. 4 is a schematic representation of a fixed point number for introducing an offset;

FIG. 5a is a graph of the magnitude of the variation of the weight data of the neural network during training;

FIG. 5b is a graph showing the magnitude of the variation of the weight data of the neural network during training;

FIG. 6 is one of the flow charts of a method of determining a target iteration interval;

FIG. 7 is a second flowchart of a method for determining a target iteration interval;

FIG. 8 is a third flowchart of a method for determining a target iteration interval;

fig. 9 is a block diagram of a hardware configuration of a quantization parameter determination apparatus of a neural network proposed in the present application;

FIG. 10 is a schematic diagram illustrating an application of the device for determining quantization parameters of a neural network in an artificial intelligence processor chip;

FIG. 11 is a functional block diagram of a quantization parameter determination apparatus of a neural network proposed in the present application;

FIG. 12 is a block diagram of a board card according to an embodiment of the present application;

FIG. 13 is a division of output data of layers to be fused;

FIG. 14 is a schematic diagram of obtaining a data block size of input data of the current layer corresponding to the output block and a data block size of intermediate data between the current layer and a layer to be fused based on the output block mapping according to one embodiment of the present application;

FIG. 15 is a process flow diagram of the determination of data to be quantized.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, specification, and drawings of this disclosure are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Definition of technical terms:

Floating point number: the IEEE floating point standard represents a number in the form of v= (-1)/(sign × 2 Λ E. Wherein sign is a sign bit, 0 represents a positive number, and 1 represents a negative number; e represents a step code, weighting the floating point number, the weight being the power of E (possibly the negative power) of 2; mantissa represents mantissa, and mantissa is a binary fraction ranging from 1 to 2-epsilon, or 0-epsilon. The representation of a floating point number representation in a computer is divided into three fields, which are encoded separately:

(1) A single sign bit s directly encodes the sign s.

(2) The k-bit code field encodes a code, exp=e (k-1). E (1) e (0).

(3) The small number segment mantissa of n bits encodes mantissa. But the encoding result depends on whether the step stage is all 0 s.

Fixed point number: consists of three parts, namely a shared exponent (exponents), a sign bit (sign) and a mantissa (mantissa). Wherein the shared exponent is that the exponent is shared in a real number set that needs quantization; the sign bit marks the positive and negative of the fixed point number. The mantissa determines the number of significant digits, i.e., the precision, of the fixed-point number. Taking an 8bit fixed point number type as an example, the numerical calculation method comprises the following steps:

value＝(-1) ^sign ×(mantissa)×2 ^{(exponent-127)}

binary decimal: any decimal number can be represented by the formula Σj 10 ⁱ And (3) representing. For example, decimal number 12.34, in the formula1 is expressed as: 12.34 =1×10 ¹ +2*10 ⁰ +3*10 ^-1 +4*10 ^-2 The left hand side of the decimal point is counted as the positive power of 10 and the right hand side of the decimal point is counted as the negative power of 10. Similarly, a binary fraction can be expressed in such a way that the left of the point is a positive power of 2 and the right of the point is a negative power of 2, and a decimal fraction of 5.75 can be expressed in a binary fraction of 101.11 expressed as 5.75= 1*2 ² +0*2 ¹ +1*2 ⁰ +1*2 ^-1 +1*2 ^-2 。

Overflow-in the fixed point operator, the number representation has a range. During the operation. If the number is out of the range that the fixed point number can represent, it is called "overflow".

KL (Kullback-Leibler divergence) divergence: also known as relative entropy (relative entropy), information divergence (information divergence), information gain (information gain). KL divergence is a measure of the asymmetry of the difference between the two probability distributions P and Q. KL divergence is a measure of the number of additional bits required to encode the average of samples from P using Q-based encoding. Typically, P represents the true distribution of the data, and Q represents the theoretical distribution of the data, the model distribution, or an approximate distribution of P.

Data bit width: the data is represented by how many bits.

Quantification: the process of converting the high-precision number expressed by 32bit or 64bit into the fixed point number which occupies less memory space in the past causes a certain loss in precision.

Detailed descriptions of a method for determining quantization parameters of a neural network and specific implementations of related products according to embodiments of the present disclosure are provided below with reference to the accompanying drawings.

Neural Networks (NNs) are a mathematical model that mimics the structure and function of biological neural networks, which are computed from a large number of neuronal connections. Thus, a neural network is a computational model, consisting of a large number of nodes (or "neurons") connected to each other. Each node represents a specific output function, called an activation function (activation function). The connection between every two neurons represents a weight through the connection signal, called a weight, which corresponds to the memory of the neural network. The output of the neural network varies according to the connection between neurons and the weights and activation functions. In a neural network, neurons are the fundamental units of the neural network. It takes a certain number of inputs and a bias and multiplies a weight when the signal (value) arrives. A connection is a connection of one neuron to another layer or another neuron of the same layer, the connection being accompanied by a weight associated therewith. In addition, the bias is an additional input to the neuron, which is always 1 and has its own connection weight. This ensures that even if all inputs are empty (all 0 s), the neuron will activate.

In application, if a nonlinear function is not applied to neurons in a neural network, the neural network is simply a linear function, then it is not more powerful than a single neuron. If a neural network is enabled to output between 0 and 1, for example, in the case of cat and dog discrimination, an output close to 0 may be considered a cat and an output close to 1 may be considered a dog. To accomplish this, an activation function is introduced into the neural network, such as: sigmoid activates a function. With respect to this activation function, it is only necessary to know that its return value is a number between 0 and 1. Thus, the activation function is used to introduce nonlinearities into the neural network, which reduces the result of the neural network operation to a smaller extent. In practice, it is not important how the activation function is expressed, and it is important that a nonlinear function is parameterized by weights that can be changed by changing the weights.

As shown in fig. 1, a schematic diagram of a neural network is shown. In the neural network shown in fig. 1, three layers are included, namely an input layer, an hidden layer and an output layer, and the hidden layer shown in fig. 1 is 5 layers. The leftmost layer of the neural network is called an input layer, and neurons of the input layer are called input neurons. The input layer acts as the first layer in the neural network, accepting the desired input signals (values) and passing them on to the next layer. It generally does not operate on the input signal (value) and has no associated weights and offsets. In the neural network shown in fig. 1, there are 4 input signals x1, x2, x3, x4.

The hidden layer contains neurons (nodes). In the neural network shown in fig. 1, there are 5 hidden layers. The first hidden layer has 4 neurons (nodes), layer 2 has 5 neurons, layer 3 has 6 neurons, layer 4 has 4 neurons, and layer 5 has 3 neurons. Finally, the hidden layer passes the operational value of the neuron to the output layer. The neural network shown in fig. 1 makes a complete connection between each neuron in 5 hidden layers, i.e., each neuron in each hidden layer is connected to each neuron in the next layer. It should be noted that not the hidden layers of each neural network are completely connected.

The rightmost layer of the neural network of fig. 1 is referred to as the output layer, and the neurons of the output layer are referred to as output neurons. The output layer receives the output from the last hidden layer. In the neural network shown in fig. 1, the output layer has 3 neurons and 3 output signals y1, y2, y3.

In practical application, a large amount of sample data (including input and output) is pre-trained on the initial neural network, and after the training is completed, the trained neural network is obtained. The neural network can give a correct output for future inputs to the real environment.

Before beginning the discussion of training of neural networks, a loss function needs to be defined. The loss function is a performance function that measures the performance of a neural network in performing a particular task. In some embodiments, the loss function may be obtained as follows: in the process of training a certain neural network, each sample data is transmitted along the neural network to obtain an output value, then the difference between the output value and an expected value is squared, the calculated loss function is the distance between the predicted value and the actual value, and the aim of training the neural network is to reduce the distance or the value of the loss function. In some embodiments, the loss function may be expressed as:

in the above formula, y represents an expected value,referring to the actual result obtained by the neural network for each sample data in the sample data set, i is the index of each sample data in the sample data set. />Representing the expected value y and the actual result +.>And the error value m is the number of sample data in the sample data set, or the identification of cats and dogs is taken as an example. There is a dataset consisting of pictures of cats and dogs, corresponding tags being 1 if the picture is a dog and 0 if the picture is a cat. The label corresponds to the expected value y in the formula, and when each sample picture is transmitted to the neural network, the identification result is actually obtained through the neural network. In order to calculate the loss function, each sample picture in the sample dataset has to be traversed to obtain the corresponding actual result for each sample picture >The loss function is then calculated as defined above. If the loss function is relatively large, the neural network is not trained, and the weight value needs to be further adjusted.

The weights are randomly initialized when training the neural network begins. It is clear that the initialized neural network does not provide a good result. During training, a network with high accuracy can be obtained by training, assuming a very poor neural network.

The training process of the neural network is divided into two stages, wherein the first stage is the forward processing of signals, and the signals pass through an hidden layer from an input layer and finally reach an output layer. The second stage is to counter-propagate gradients from the output layer to the hidden layer and finally to the input layer, and sequentially adjust weights and biases of each layer in the neural network according to the gradients.

During the forward processing, an input value is input to the input layer of the neural network, and an output of a so-called predicted value is obtained from the output layer of the neural network. When an input value is provided to the input layer of the neural network, it does not perform any operation. In the hidden layers, the second hidden layer acquires the predicted intermediate result value from the first hidden layer, performs calculation operation and activation operation, and then transfers the obtained predicted intermediate result value to the next hidden layer. The same operations are performed in the later layers, and finally the output values are obtained in the output layer of the neural network.

After the forward processing, an output value called a predicted value is obtained. In order to calculate the error, the predicted value is compared with the actual output value to obtain a corresponding error value. The back propagation uses the chain law of differentiation in which the derivative of the error value corresponding to the last layer of weights of the neural network is calculated first. These derivatives are called gradients, which are then used to calculate the gradient of the penultimate layer in the neural network. This process is repeated until a gradient is obtained for each weight in the neural network. And finally, subtracting the corresponding gradient from each weight in the neural network, so that the weight is updated once, and the purpose of reducing the error value is achieved.

For the neural network, the fine tuning is to load the trained neural network, the fine tuning process is the same as the training process and is divided into two stages, wherein the first stage is forward processing of signals, the second stage is backward propagation gradient, and the weight of the trained neural network is updated. Training differs from trimming in that training is to randomly process the initialized neural network, training the neural network from scratch, rather than trimming.

In the training or fine tuning process of the neural network, each time the neural network goes through the forward processing of the signal and the back propagation process of the corresponding error, the weight in the neural network is updated once by using the gradient, and the process is called iteration (iteration). In order to obtain a neural network with a precision that meets expectations, a very large sample data set is required during the training process. In this case, it is impossible to input the sample data set into the computer at one time. Therefore, in order to solve this problem, it is necessary to divide the sample data set into a plurality of blocks, each block is transferred to the computer, and the weights of the neural network are updated once after each block of data set is processed forward. When a complete sample data set has been processed forward through the neural network and a weight update has been returned accordingly, this process is referred to as a cycle (epoch). In practice, it is not enough to transfer the complete data set once in the neural network, and the complete data set needs to be transferred multiple times in the same neural network, that is, multiple periods are needed, so as to finally obtain the neural network with the accuracy meeting the expectations.

In the training or fine tuning of neural networks, it is generally desirable that the faster the speed, the better the accuracy, and the higher the accuracy. The data of the neural network is represented by a high-precision data format, such as floating point numbers, so that the data involved in the training or fine tuning process are all in the high-precision data format, and then the trained neural network is quantized. Taking the example that the quantized object is the weight of the whole neural network and the quantized weight is the 8bit fixed point number, since millions of connections are often in a neural network, almost all the space is occupied by the weights of the neuron connections. Moreover, these weights are all different floating point numbers. Each layer of weights tends to have a normal distribution of certain defined intervals, e.g., (-3.0,3.0). Storing the maximum value and the minimum value corresponding to the weight value of each layer in the neural network, and representing each floating point value by adopting 8bit fixed point numbers. Wherein, the interval is divided into 256 quantization intervals in the maximum value and minimum value range, and each quantization interval is represented by an 8bit fixed point number. For example: within the (-3.0,3.0) interval, byte 0 represents-3.0 and byte 255 represents 3.0. Similarly, byte 128 represents 0.

For data represented by a high-precision data format, taking a floating point number as an example, according to a computer architecture, the floating point arithmetic calculation mode is more complex for fixed point arithmetic and floating point arithmetic with the same length based on an arithmetic expression rule of the floating point number and an arithmetic expression rule of the fixed point number, and more logic devices are needed to form a floating point arithmetic unit. Thus, the volume of the floating point operator is larger than that of the fixed point operator in terms of volume. Also, floating point operators require more resources to process, such that the power consumption gap between fixed point operations and floating point operations is typically of the order of magnitude. In short, floating point operators occupy many times more chip area and power consumption than fixed point operators.

However, floating point operations are not substitutable. First, fixed point operations, while intuitive, determine the integer and fractional parts of a fixed number of bits, which is disadvantageous for expressing particularly large numbers or particularly small numbers at the same time, and may cause overflow.

Furthermore, floating point operators are often favored when training or fine tuning is performed using artificial intelligence processor chips, mainly because in supervised learning neural networks, only floating point operations can record and capture small increments in training. Therefore, how to greatly improve the operation capability of the chip for training is a problem which needs to be solved at present on the premise of not increasing the chip area and the power consumption of the artificial intelligent processor.

It is known to those skilled in the art that training using fixed-point numbers represented by low bit-widths requires processing of the counter-propagating gradient using higher than 8bit fixed-point numbers through practical feedback, which makes the process of training using fixed-point numbers represented by low bit-widths extremely complex. How to replace a floating point arithmetic unit with a fixed point arithmetic unit, achieve the fast speed of fixed point arithmetic, and meet the precision of floating point arithmetic required by arithmetic while improving the peak computing power of an artificial intelligent processor chip is the technical problem solved by the specification.

Based on the description of the technical problem, one characteristic of the neural network is that it is highly tolerant to input noise. If the recognition of objects in a photograph is considered, the neural network can ignore the dominant noise, focusing attention on important similarities. This function means that the neural network can use low-precision calculations as a source of noise, yet still produce accurate predictions in a numerical format that accommodates less information. To perform low-precision training or fine tuning, a data representation with universality is found, so that the overflow condition of the data can be improved, and the data near 0 in the target interval range can be better expressed. Thus, this data indicates that adaptation is required, which can be adjusted with the training or fine tuning process.

In addition, most neural network models require a large number of operations and memory accesses. Some neural network accelerators may provide higher computational performance. However, the computational power of the currently mainstream neural network accelerators is far beyond the bandwidth of the current external memory. The calculation amount and the memory access amount of each layer in the ResNet-18 neural network are taken as examples for the following description.

In ResNet-18 neural networks, the ratio of the amount of computation to the amount of memory access in each layer is different, and thus has different requirements for bandwidth and computational power. Taking element-wise layer as an example, if the computational power is 1GFLOPS (the number of Floating point operations performed per second by Giga flowing-point Operations Per Second), then the required bandwidth is 12GB/s. Meanwhile, for the convolutional layer, the bandwidth requirement is only 10MB/s for the same computing power of 1 GFLOPS. Although the hardware of neural network accelerators has been optimally designed in an attempt to maximize a balance between memory bandwidth and computational power, optimal performance has not been achieved. Under the caffe framework, the inventors of the present application further counted the ratio of computational power to memory access for each layer in the overall ResNet-18 neural network, and found that more than 95% of the data traffic was in some layers (including convolutional, batchNorm, scale, reLU, and element-by-element layers). However, the computational effort in these layers, except for the convolutional layers, is very small, less than 1% of the overall neural network. Thus, memory access is currently a serious bottleneck in the execution of neural networks by artificial intelligence processors.

The operator in the computation graph mapped by the neural network is realized on a CPU and an artificial intelligent processor through a kernel function, and is a mode of 'off-chip storage, on-chip computation and off-chip storage', namely, the input data and the output number of the operator in the neural network are stored in a global storage, the kernel function needs to read the input data from the global storage to complete computation, and a result is stored in the global storage. This presents two problems: first, the access of each operator to the input data and output data cannot be avoided by optimization within the operator; second, each operator requires startup overhead, especially for heterogeneous computing devices outside the CPU. To solve these problems, the kernel functions of two or more successive operators in the computational graph corresponding to the neural network are combined into a new kernel function, so that the computational tasks corresponding to the operators only need one scheduling overhead. Thus, a large amount of data transfer from external memory (DRAM) to on-chip memory, and data transfer from on-chip memory to external memory, can be eliminated. Through testing, the inventors have found that in a ResNet-18 neural network, 99.6% less data transmission can be achieved if all operators can be fused together.

However, it is difficult to fuse all operators in an actual neural network together. The reasons for this include: in practice, there is a mismatch between the size of the on-chip memory and the size of the data processed by the neural network, because the area overhead of the artificial intelligence processor is unlikely to be too large, and accordingly, there is a limit to the area overhead of the on-chip memory of the artificial intelligence processor. Also, the power consumption overhead required for on-chip memory of an artificial intelligence processor should be within a reasonable range. These reasons result in some limitations on the size of the data stored on-chip of the artificial intelligence processor. Thus, if all operators in the neural network are fused together, the data size of the intermediate data of those fused operators does not match the data size of the actual storage of the on-chip memory. To alleviate this contradiction, further analysis shows that intermediate results between these operators are included in the optimization range of the fused kernel, and that part of the memory access of intermediate results is therefore possible to be optimized, this optimization of intermediate results usually being based on the local independence of data available in the calculation process. Based on this principle of operation, in an operator, each point in the output data set depends only on a defined region within the input data set. Therefore, the input data and the output data can be separated or split into a plurality of blocks, each block can be calculated independently, and more operators in the calculation graph corresponding to the neural network are fused together.

Based on the above description, as shown in fig. 2, a flowchart of a method for determining quantization parameters of a neural network is provided. The quantization parameter determined by the technical scheme shown in fig. 2 is used for data representation of the data to be quantized, so as to confirm the number of fixed-point points after quantization. The quantized fixed-point number is used for training, fine tuning or reasoning of the neural network. The method comprises the following steps:

step 201): traversing operators in a computational graph corresponding to the neural network, and selecting a current operator and an operator to be fused from the computational graph.

Taking caffe as an example, the neural network includes a plurality of processing layers including, but not limited to, a convolutional layer, a BatchNorm layer, a Scale layer, a ReLU layer, a Pooling layer, an element-by-element layer, an inner laminate layer (InnerProduct layer), a SoftMax layer, and the like.

The operator corresponding layer to be fused is called a layer to be fused, the current operator corresponding layer is called a current layer, and the layer to be fused is positioned at the downstream of the current layer. Those skilled in the art will readily appreciate that the operator corresponding layer to be fused may also be located upstream of the current operator corresponding layer. Taking a convolution layer and a BatchNorm layer as an example, if the convolution layer is used as the current layer and the BatchNorm layer is used as the layer to be fused, the BatchNorm layer may be located upstream of the convolution layer, i.e., the output data of the BatchNorm layer is the input data of the convolution layer. The BatchNorm layer may also be located downstream of the convolutional layer, i.e., the output data of the convolutional layer is the input data of the BatchNorm layer.

In addition, according to a preferred embodiment of the present application, a first layer of the neural network is selected as a current layer, a next layer closely adjacent to the first layer is selected as a layer to be fused, and fusion judgment is performed layer by layer.

Step 202): the split size is determined based on the available storage capacity of the on-chip memory of the artificial intelligence processor.

Step 203): and splitting the output data of the operator to be fused into a plurality of data blocks according to the splitting size.

Fig. 13 shows output data OD2 of the layer to be fused, which is, for example, data in m×n dimensions. According to a preset splitting size, the output data OD2 of the layer to be fused is split into m×n output blocks, where M is less than or equal to M, N is less than or equal to N, and is OD2 (1, 1), OD2 (1, 2), and up to OD2 (M, N), respectively. According to a preferred embodiment of the present application, the splitting size is selected such that the output data OD2 of the layer L2 to be fused can be split uniformly into m×n shares. However, the present application is not limited thereto, and non-uniform splitting may be implemented, for example, in fig. 13, the sizes of the output blocks of the m-th row and the output blocks of the n-th column may be smaller than the sizes of the remaining output blocks, which are all within the protection scope of the present application.

Step 204): and mapping to obtain the size of the data block of the input data of the current operator and the size of the data block of the intermediate data between the current operator and the operator to be fused based on the size of the data block of the output data of the operator to be fused.

Fig. 14 illustrates one embodiment of step 204. As shown in fig. 14, the current layer L1 and the layer to be fused L2 are shown in a data transformation manner, and the layer structure of the entity is not shown. For the current layer L1, the input data is ID1, after the current layer L1 performs a preset transformation process on the input data ID1, output data OD1 is obtained, the output data OD1 is provided as input data to the layer L2 to be fused, and the output data OD1 may also be referred to as intermediate data between the current layer L1 and the layer L2 to be fused. After the fusion layer L2 performs preset transformation processing on the intermediate data OD1, output data OD2 is obtained.

Since the data transformation process performed by each of the current layer L1 and the layer to be fused L2 may be preset, the data block of the input data of that layer may be reversely deduced from the output block of the output data. For example, in fig. 14, taking the output block OD2 (m, 1) of the output data as an example, the data block size of the data block OD1 (m, 1) in the intermediate data OD1 may be derived according to the transform process performed by the layer to be fused L2, and the data block size of the data block OD1 (m, 1) may be larger, smaller, or both than the size of the output block OD2 (m, 1), which are within the scope of the present application. Similarly, the size of the data block ID1 (m, 1) of the input data ID1 of the current layer L1 can be obtained from the data block size of the data block OD1 (m, 1) in the intermediate data OD1 and from the transform process performed by the current layer L1. In other words, the above procedure is to reversely derive the data block size of the input data required in the current layer and the data block size of the intermediate data according to the output data block size of the layer to be fused.

Fig. 14 shows that the layer L2 to be fused is located downstream of the current layer L1, and the two layers are closely adjacent, and the output of the current layer L1 is the input of the layer L2 to be fused. The protection scope of the present application is not limited thereto, and the layer to be fused L2 and the current layer L1 may be separated by more layers. In this case, the teachings described above can also be applied to obtain, by reverse derivation, the data block size of the input data required in the current layer, and the data block size of the intermediate data, which, of course, in this case, has multiple layers of intermediate data, which are within the scope of the present application.

As shown in fig. 15, an engineering flow chart is determined for the data to be quantized. The following describes in detail with reference to fig. 15.

Step 1: traversing operators in the computation graph corresponding to the neural network, and selecting a current operator and an operator to be fused from the computation graph.

Step 2: the split size is determined based on the available storage capacity of the on-chip memory of the artificial intelligence processor.

Step 3: and splitting the output data of the operator to be fused into a plurality of data blocks according to the splitting size.

Step 4: and performing memory allocation. In practical applications, the on-chip memory or a specified portion of the on-chip memory for artificial intelligence processing is allocated to the output block, the data block of the input data, and the data block of the intermediate data.

Step 5: and judging whether the memory allocation is successful or not. For example, the sum of the size of the output block (i.e., the split size), the data block size of the input data, and the data block size of the intermediate data may be compared to the storage space of the on-chip memory available for allocation, and if the storage space is not exceeded, the allocation is successful. Meanwhile, the data to be quantized is determined; if the storage space is exceeded, the allocation fails and the process proceeds to step 6.

And 6, judging whether the splitting size can be reduced. It is easily understood by those skilled in the art that the split size may be dynamically changed, for example, the split size may be set to a larger value at the stage of determining whether an operator to be fused can be fused with the current operator at the beginning. If at this split size, the decision is not fusible, then an attempt may be made to reduce the split size, as shown in step 7. The magnitude of the split size reduction can be set as desired. Of course, those skilled in the art will readily appreciate that the resolution size cannot be reduced without limitation, and that its lower threshold may be set. In step 6, when the splitting size is judged not to reach the lower limit threshold value, the method proceeds to step 7, reduces the splitting size, returns to step 2, and re-splits the output data of the operator to be fused into corresponding output blocks according to the reduced splitting size, and proceeds to subsequent processing and judgment; and when the splitting size is judged to have reached the lower threshold value, judging that the splitting size cannot be further fused until the judgment result is that the current operator and the operator to be fused are fused together.

Step 205): the data blocks of the output data of the operator to be fused, the corresponding data blocks of the input data of the current operator and the data blocks of the intermediate data between the current operator and the operator to be fused are used as data to be quantized, and a statistical result of each type of data to be quantized is obtained; the data to be quantized comprises at least one of neuron, weight, gradient and bias of the neural network.

As described above, in training or fine tuning the neural network, each layer of the neural network includes four types of data, namely neurons, weights, gradients, and biases. In the reasoning process, each layer of the neural network comprises three types of data, namely neurons, weights and biases. These data are all represented in a high-precision data format, and this specification exemplifies floating point numbers as high-precision data. It should be clear that, taking the floating point number as an example, the floating point number is only an exemplary case and not an exhaustive case, and those skilled in the art, while understanding the spirit of the present technical solution, may generate other modifications or transformations based on the technical solution of the present application, for example: the high-precision data can be fixed point numbers with high data bit width, which have large expression range and small expression minimum precision, and the technical scheme can be adopted to convert the high-precision data into the fixed point numbers with low data bit width. But should be within the scope of the present application as long as the functions and technical effects achieved are similar to those of the present application.

Regardless of the neural network structure, the data to be quantized includes at least one of neurons, weights, gradients, and biases of the neural network during training or fine tuning of the neural network, and includes at least one of neurons, weights, and biases of the neural network during reasoning.

The following takes two data, namely, neuron and weight of the target layer in the neural network as an example of the data to be quantized, and describes the technical scheme in detail. In this step, the neurons and weights of each layer in the target layer are counted respectively to obtain the maximum value and the minimum value of each type of data to be quantized, and the maximum value of the absolute value of each type of data to be quantized can also be obtained. The target layer is used as a layer needing quantization in the neural network, and can be one layer or multiple layers. The absolute value maximum value of each data to be quantized can be confirmed by the maximum value and the minimum value in each data to be quantized in a layer unit. Or firstly solving the absolute value of each piece of data to be quantized, and traversing the result after solving the absolute value to obtain the maximum value of the absolute value of each piece of data to be quantized.

In practical application, the reason that the absolute value maximum value of each type of data to be quantized is obtained according to the maximum value and the minimum value in each type of data to be quantized is that, during quantization, the maximum value and the minimum value corresponding to the data to be quantized in each layer of the target layer are saved in a conventional manner, more resources are not required to be consumed to calculate the absolute value of the data to be quantized, and the absolute value maximum value is obtained directly based on the saved maximum value and the minimum value corresponding to the data to be quantized.

Step 206): determining corresponding quantization parameters by using the statistical result of each type of data to be quantized and the data bit width; the quantization parameter is used for the artificial intelligence processor to correspondingly quantize the data in the operation process of the neural network.

In this step, the quantization parameter can be divided into the following six cases. First case: the quantization parameter is a point location parameter s. In this case, the quantized data I can be obtained by quantizing the data to be quantized using the following formula (1) _x ：

Wherein s is a point location parameter, I _x Representing the value for n-bit binary representation after quantization of data x, F _x For the floating point value before quantization of data x, round is the rounding operation performed for rounding. It should be noted that, the method is not limited to round, and other rounding methods may be used, for example: and (3) adopting rounding operations such as rounding upwards, rounding downwards, rounding to zero and the like to replace round rounding operation in the formula (1). At this time, the maximum value A of the floating point number, which can be represented by the n-bit fixed point number, is 2 ^s (2 ^n-1 -1), then the n-bit fixed point number may represent a maximum of 2 in the number domain of the data to be quantized ^s (2 ^n-1 -1) the n-bit fixed point number may represent a minimum value of-2 in the number domain of the data to be quantized ^s (2 ^n-1 -1). As can be seen from the formula (1), when the quantization parameter corresponding to the first case is adopted to quantize the data to be quantized, the quantization interval is 2 ^s The quantization interval is denoted as C.

Let Z be the absolute maximum of all floating point numbers in the domain of the data to be quantized, then A needs to contain Z, and Z is greater thanThere is therefore the following constraint of equation (2):

2 ^s (2 ^n-1 -1)≥Z>2 ^s-1 (2 ^n-1 -1) (2)

thus, the first and second substrates are bonded together,obtain-> />

The quantized n-bit binary representation value I of data x according to equation (3) _x Performing inverse quantization to obtain inverse quantized dataWherein the dequantized data +.>Data format of (2) and corresponding pre-quantization data F _x The data formats of the data are the same and are all floating point values.

Second case: the quantization parameter is a first scaling factor f ₁ . In this case, the quantized data I can be obtained by quantizing the data to be quantized using the following equation (4) _x ：

Wherein f ₁ For the first scaling factor, I _x Representing the value for n-bit binary representation after quantization of data x, F _x For the floating point value before quantization of data x, round is the rounding operation performed for rounding. It should be noted that, the method is not limited to round, and other rounding methods may be used, for example: the round rounding operation in the formula (4) is replaced by rounding operations such as up rounding, down rounding, zero rounding and the like . As can be seen from the equation (4), when the quantization parameter corresponding to the second case is adopted to quantize the data to be quantized, the quantization interval is f ₁ The quantization interval is denoted as C.

For the first scaling factor f ₁ In terms of this, there is a case where: the point position parameter s is a fixed known value, no change occurs anymore, set 2 ^s T, T is a fixed value, then the maximum value a that can be expressed as a floating point number with an n-bit fixed point number is (2 ^n-1 -1) x T. In this case, the maximum value a depends on the data bit width n. Let Z be the maximum absolute value of all numbers in the number domain of the data to be quantizedAt this time z= (2 ^n-1 -1)×f ₁ . The n-bit fixed point number may represent that the maximum value in the number domain of the data to be quantized is (2 ^n-1 -1)×f ₁ The n-bit fixed point number may represent that the minimum value in the number domain of the data to be quantized is- (2) ^n-1 -1)×f ₁ . In yet another case, in engineering applications, 2 ^s ×f ₂ As a whole, as a first scaling factor f ₁ . At this time, it can be regarded that there is no independent point location parameter s. Wherein f ₂ Is the second scaling factor. Let Z be the absolute maximum of all numbers in the number domain of the data to be quantized, then +.>At this time z= (2 ^n-1 -1)×f ₁ . The n-bit fixed point number may represent that the maximum value in the number domain of the data to be quantized is (2 ^n-1 -1)×f ₁ The n-bit fixed point number may represent that the minimum value in the number domain of the data to be quantized is- (2) ^n-1 -1)×f ₁ 。

The quantized n-bit binary representation value I of the data x according to (5) _x Performing inverse quantization to obtain inverse quantized dataWherein the dequantized data +.>Data format of (2) and corresponding pre-quantization data F _x The data formats of the data are the same and are all floating point values.

Third case: the quantization parameter is a point location parameter s and a second scaling factor f ₂ . In this case, the quantized data I can be obtained by quantizing the data to be quantized using the following equation (6) _x ：

Wherein s is a point location parameter, f ₂ For the second scaling factor is,I _x representing the value for n-bit binary representation after quantization of data x, F _x For the floating point value before quantization of data x, round is the rounding operation performed for rounding. It should be noted that, the method is not limited to round, and other rounding methods may be used, for example: and (3) adopting rounding operations such as rounding upwards, rounding downwards, rounding to zero and the like to replace round rounding operation in the formula (6). The maximum value a in the number domain of the data to be quantized, which can be represented by n-bit fixed-point number, is 2 ^s (2 ^n-1 -1). As can be seen from the equation (6), when the quantization parameter corresponding to the third case is adopted to quantize the data to be quantized, the quantization interval is 2 ^s ×f ₂ The quantization interval is denoted as C.

Let Z be the absolute maximum of all numbers in the number domain of the data to be quantized, at this time, it is available according to equation (2):

i.e. < ->

When, according to equation (2), Z can be represented with lossless accuracy. When f ₂ When=1, formula (6) and formula (1), ++>The n-bit fixed point number may represent that the maximum value in the number domain of the data to be quantized is (2 ^n-1 -1)×2 ^s ×f ₂ The n-bit fixed point number may represent that the minimum value in the number domain of the data to be quantized is- (2) ^n-1 -1)×2 ^s ×f ₂ 。

N-bit binary representation of quantized data x according to equation (7) I _x Performing inverse quantization to obtain inverse quantized dataWherein the dequantized data +.>Data format of (2) and corresponding pre-quantization data F _x The data formats of the data are the same and are all floating point values.

As shown in FIG. 3, the symmetric fixed point number represents a schematic diagram. The number domain of the data to be quantized shown in fig. 3 is distributed with "0" as the center of symmetry. Z is the absolute maximum value of all floating-point numbers in the number domain of the data to be quantized, A is the maximum value of the floating-point number which can be represented by n-bit fixed-point numbers in FIG. 3, and the floating-point number A is converted into fixed-point number 2 ^n-1 -1. In order to avoid overflow, a needs to contain Z. In practice, floating point data in the neural network operation process tends to be normally distributed in a certain determined interval, but the distribution taking "0" as a symmetry center is not necessarily satisfied, and when the floating point data is represented by a fixed point number, an overflow condition is easy to occur. Is that This is improved by introducing an offset into the quantization parameter, as shown in fig. 4. In FIG. 4, the number field of the data to be quantized is not distributed with "0" as the center of symmetry, Z _min Is the minimum value of all floating point numbers in the number domain of the data to be quantized, Z _max Is the maximum of all floating point numbers in the number domain of the data to be quantized. P is Z _min ～Z _max And (3) the center point is used for integrally shifting the number domain of the data to be quantized, so that the number domain of the data to be quantized after the shifting is distributed by taking 0 as a symmetrical center, and the maximum value of the absolute value in the number domain of the data to be quantized after the shifting is Z. As can be seen from fig. 4, the offset is the horizontal distance between the "0" point and the "P" point, which is referred to as offset O. Wherein,

based on the above description about the offset O, a case of the fourth quantization parameter occurs. Fourth case: the quantization parameters include point location parameters and offsets. In this case, the quantized data I can be obtained by quantizing the data to be quantized using the following equation (8) _x ：

Wherein s is a point location parameter, O is an offset,I _x representing the value for n-bit binary representation after quantization of data x, F _x For the floating point value before quantization of data x, round is the rounding operation performed for rounding. It should be noted that, the method is not limited to round, and other rounding methods may be used, for example: and replacing round rounding operation in the formula (8) by rounding operations such as rounding up, rounding down and the like. At this time, the maximum value A of the floating point number, which can be represented by the n-bit fixed point number, is 2 ^s (2 ^n-1 -1), then the n-bit fixed point number may represent a maximum of 2 in the number domain of the data to be quantized ^s (2 ^n-1 -1) +o, the n-bit fixed point number may represent a minimum value of-2 in the number domain of the data to be quantized ^s (2 ^n-1 -1) +o. As can be seen from equation (8), when the quantization parameter corresponding to the fourth case is used to quantize the data to be quantized, the quantization interval is 2 ^s The quantization interval is denoted as C.

Let Z be the absolute maximum of all floating point numbers in the domain of the data to be quantized,then A needs to contain Z, and Z is greater than +.>Obtain +.>And then obtain

The quantized n-bit binary representation value I of data x according to equation (9) _x Performing inverse quantization to obtain inverse quantized dataWherein the dequantized data +.>Data format of (2) and corresponding pre-quantization data F _x The data formats of the data are the same and are all floating point values.

Based on the above description about the offset O, a case of a fifth quantization parameter occurs. Fifth case: the quantization parameter includes a first scaling factor f ₁ And an offset O. In this case, the quantized data I can be obtained by quantizing the data to be quantized using the following formula (10) _x ：

Wherein f ₁ For the first scaling factor, O is the offset, I _x Representing the value for n-bit binary representation after quantization of data x, F _x For the floating point value before quantization of data x, round is the rounding operation performed for rounding. It should be noted that, the method is not limited to round, and other rounding methods may be used, for example: and replacing round rounding operation in the formula (10) by rounding operations such as rounding up, rounding down and the like. At this time, there is a case that: the point position parameter s is a fixed known value, no change occurs anymore, set 2 ^s =t, T being a fixed value. Then, the maximum value A of the floating point number, which can be expressed by the n-bit fixed point number, is (2 ^n-1 -1) x T. In this case, the maximum value a depends on the data bit width n. Let Z be the maximum absolute value of all numbers in the number domain of the data to be quantizedAt this time z= (2 ^n-1 -1)×f ₁ . The n-bit fixed point number may represent that the maximum value in the number domain of the data to be quantized is (2 ^n-1 -1)×f ₁ The n-bit fixed point number may represent that the minimum value in the number domain of the data to be quantized is- (2) ^n-1 -1)×f ₁ . In yet another case, in engineering applications, 2 ^s ×f ₂ As a whole as a first scaling factor f ₁ . At this time, it can be regarded that no independent point position parameter s exists. Wherein f ₂ Is the second scaling factor. Let Z be the absolute maximum of all numbers in the number domain of the data to be quantized, then +.>At this time z= (2 ^n-1 -1)×f ₁ . The n-bit fixed point number may represent that the maximum value in the number domain of the data to be quantized is (2 ^n-1 -1)×f ₁ The +O, n-bit fixed point number may represent the minimum value in the number domain of the data to be quantizedIs- (2) ^n-1 -1)×f ₁ +O。

As can be seen from the equation (10), when the quantization parameter corresponding to the fifth case is used for quantizing the data to be quantized, the quantization interval is f ₁ The quantization interval is denoted as C.

The quantized n-bit binary representation value I of data x according to equation (11) _x Performing inverse quantization to obtain inverse quantized data Wherein the dequantized data +.>Data format of (2) and corresponding pre-quantization data F _x The data formats of the data are the same and are all floating point values.

Based on the above description about the offset O, a case of a sixth quantization parameter occurs. Sixth case: the quantization parameter includes a point position parameter, a second scaling factor f ₂ And an offset O. In this case, the quantized data I can be obtained by quantizing the data to be quantized using the following formula (12) _x ：

Wherein s is the point location parameter, the offset O, f ₂ For the second scaling factor is,I _x representing the value for n-bit binary representation after quantization of data x, F _x For the floating point value before quantization of data x, round is the rounding operation performed for rounding. It should be noted that, the method is not limited to round, and other rounding methods may be used, for example: adopts upward rounding and downward rounding,Rounding to zero or the like replaces the round rounding in equation (12). The maximum value a in the number domain of the data to be quantized, which can be represented by n-bit fixed-point number, is 2 ^s (2 ^n-1 -1). As can be seen from the equation (12), when the quantization parameter corresponding to the sixth case is used to quantize the data to be quantized, the quantization interval is 2 ^s ×f ₂ The quantization interval is denoted as C.

i.e. < ->

When, according to equation (2), Z can be represented with lossless accuracy. When f ₂ When the number of the codes is =1,the n-bit fixed point number may represent that the maximum value in the number domain of the data to be quantized is (2 ^n-1 -1)×2 ^s ×f ₂ The +O, n-bit fixed point number may represent that the minimum value in the number domain of the data to be quantized is- (2) ^n-1 -1)×2 ^s ×f ₂ +O。

The quantized n-bit binary representation value I of data x according to equation (13) _x Performing inverse quantization to obtain inverse quantized dataWherein the dequantized data +.>Data format of (2) and corresponding pre-quantization data F _x The data formats of the data are the same and are all floating point values.

The above detailed description of the determination of the 6 quantization parameters is merely an example. The kind of quantization parameter may be different from the above description in different embodiments. From equations (1) through (13), both the point location parameter and the scaling factor are related to the data bit width. Different data bit widths result in different point location parameters and scaling factors, thereby affecting quantization accuracy. In the training or fine tuning process, the same data bit width quantization is used within a certain iteration (iterations) frequency range, so that the overall accuracy of the neural network operation is not greatly influenced. Beyond a certain number of iterations, the requirement of training or fine tuning on precision cannot be met by using the same data bit width quantization. This requires an adjustment of the data bit width n with the training or fine tuning process. Simply, the data bit width n can be set manually. And calling the corresponding data bit width n which is set in advance in different iteration frequency ranges. However, it has been mentioned above that the process of implementing training using fixed point numbers of low bit width representations is exceptionally complex. This way of manually setting the data bit width in advance is basically not in line with the needs of practical applications.

In the present solution, according to the quantization error diff _bit The data bit width n is adjusted. In further detail, the quantization error diff _bit And comparing with a threshold value and obtaining a comparison result. The threshold value comprises a first threshold value and a second threshold value, the first threshold value is larger than the second threshold value, and the comparison result has three conditions, wherein the first condition is: quantization error diff _bit And is equal to or greater than the first threshold, in which case the data bit width is increased. The second case is: quantization error diff _bit And less than or equal to the second threshold, in which case the data bit width is reduced. The third case is: quantization error diff _bit Between the first threshold and the second threshold, in which case the data bit width remains unchanged. In practical applications, the first threshold value and the second threshold value may be empirical values, or may be variable super parameters. Conventional methodThe above-mentioned optimization methods are suitable for the first threshold and the second threshold, and the above-mentioned optimization schemes are not repeated here.

It should be emphasized that the data bit width can be adjusted according to a fixed bit number step, or according to a difference between the quantization error and the error threshold, the data bit width can be adjusted according to a variable adjustment step, and finally, according to the actual requirement of the neural network operation process, the data bit width is adjusted to be longer or shorter. Such as: the data bit width n of the current convolution layer is 16, according to the quantization error diff _bit The data bit width n is adjusted to 12. That is, in practical application, the data bit width n can meet the requirement of the neural network operation on the precision without 16, so that the fixed-point operation speed can be greatly improved within the precision allowable range, and the resource utilization rate of the artificial intelligent processor chip is improved.

For quantization error diff _bit The quantization error is determined from the quantized data and the corresponding pre-quantized data. In practical application, three quantization error determination modes are available, and all the quantization error determination modes are applicable to the technical scheme. The first way is: and determining quantization errors according to a formula (14) according to the quantization interval, the number of quantized data and the corresponding data before quantization.

Wherein C is the corresponding quantization interval during quantization, m is the number of quantized data obtained after quantization, F _i And i is a floating point value corresponding to the data to be quantized, wherein i is a subscript of the data in the data set to be quantized.

The second way is: determining quantization error diff according to equation (15) from quantized data and corresponding inverse quantized data _bit 。

Wherein F is _i To be treatedAnd quantizing the corresponding floating point value, wherein i is the subscript of the data in the data set to be quantized. And (5) inversely quantizing the data corresponding to the floating point value.

Third mode: determining quantization error diff according to formula (16) based on quantized data and corresponding inverse quantized data _bit 。

Wherein F is _i And i is a floating point value corresponding to the data to be quantized, wherein i is a subscript of the data in the data set to be quantized.And (5) inversely quantizing the data corresponding to the floating point value.

It should be emphasized that the above described acquisition of quantization error diff _bit The manner of (a) is merely an exemplary partial case, and not an exhaustive case, and those skilled in the art, while understanding the spirit of the technical solution of the present application, may generate other modifications or transformations based on the technical solution of the present application, and all that supports the modification formula for determining the quantization error according to the quantized data and the corresponding data before quantization, but all that is required is to fall within the scope of protection of the present application as long as the function and the achieved technical effect are similar to those of the present application.

For data bit width, fig. 5a is one of graphs of the magnitude of the weight data fluctuation of the neural network during training. FIG. 5b is a graph showing the variation range of the weight data of the neural network during training. In fig. 5a and 5b, the abscissa represents the number of iterations and the ordinate represents the maximum value of the logarithm of the weight. The weight data fluctuation range curve shown in fig. 5a shows the weight data fluctuation conditions corresponding to different iterations in the same period (epoch) of any convolution layer of the neural network. In fig. 5B, the conv0 layer corresponds to the weight data fluctuation width curve a, the conv1 layer corresponds to the weight data fluctuation width curve B, The conv2 layer corresponds to the weight data fluctuation range curve C, the conv3 layer corresponds to the weight data fluctuation range curve D, and the conv4 layer corresponds to the weight data fluctuation range curve e. As can be seen from fig. 5a and 5b, the weight change range per iteration is relatively large in the initial training period (epoch). The change amplitude of the weight value of each iteration is not too large in the middle and later stages of training. In this case, in the middle and later stages of training, because the variation amplitude of the weight data before and after each iteration is not large, the weight data of the corresponding layers of each generation have similarity in a certain iteration interval, and the data bit width used in the quantization of the corresponding layers in the last iteration can be adopted in the quantization of the data involved in each layer in the neural network training process. However, in the initial stage of training, since the variation amplitude of the weight data before and after each iteration is relatively large, in order to meet the precision of floating point operation required by quantization, in each iteration in the initial stage of training, the weight data of the corresponding layer of the current generation is quantized by using the data bit width adopted in the quantization of the corresponding layer of the last iteration, or the weight data of the current layer is quantized based on the preset data bit width n of the current layer, so as to obtain the quantized fixed point number. Determining quantization error diff according to quantized weight data and corresponding weight data before quantization _bit According to quantization error diff _bit And (3) comparing the data bit width n adopted in the quantization of the corresponding layer of the previous iteration or the preset data bit width n of the current layer with a threshold value, and applying the adjusted data bit width to the quantization of the weight data of the corresponding layer of the current iteration. Further, in the training or fine tuning process, the weight data of each layer of the neural network are independent of each other and have no similarity. The neuron data among each layer are independent of each other because the weight data do not have similarity, and the weight data do not have similarity. Thus, the data bit width of each layer within each iteration of the neural network is only applicable to the corresponding neural network layer during the neural network training or trimming process.

In the training or fine tuning process of the neural network, the data bit widths corresponding to the neuron data and the gradient data respectively are also taken as examples of the weight data, and are not repeated here.

In neural netIn the process of channel reasoning, weight data among each layer of the neural network are mutually independent and have no similarity. The neuron data among each layer are independent of each other because the weight data do not have similarity, and the weight data do not have similarity. Thus, in the neural network reasoning process, the data bit width of each layer of the neural network is applied to the corresponding layer. In practical application, the input neuron data of each time in the reasoning process is quite possibly different or dissimilar, and because the weight data between each layer of the neural network are mutually independent, the input neuron data of each layer in the hidden layers of the neural network are dissimilar. In quantization, the data bit width used by the input neuron data of the upper layer is not suitable for the input neuron data of the current layer. Based on the above, in order to meet the precision of floating point operation required by quantization, during reasoning, the input neuron data of the current layer is quantized by using the data bit width adopted during the quantization of the input neuron data of the previous layer, or the input neuron data of the current layer is quantized based on the preset data bit width n of the current layer, so as to obtain the quantized fixed point number. Determining quantization error diff based on pre-quantized input neuron data and corresponding post-quantized input neuron data _bit According to quantization error diff _bit And (3) comparing the data bit width n with a threshold value, which is adopted when the input neuron data of the upper layer is quantized, or the preset data bit width n of the current layer is adjusted, and the adjusted data bit width is applied to the quantization of the input neuron data of the current layer. The same is true for the data bit width corresponding to the weight data, and the description thereof is omitted here.

As can be seen from fig. 5a, the weight change range per iteration is relatively large in the initial training period (epoch). In the middle and later stages of training, as the change amplitude of the weight data before and after each iteration is not large, the weight data of the corresponding layers of each iteration has similarity in a certain iteration interval, so that the data of each layer of the current iteration can be prolonged by the quantization parameter of the corresponding data of the corresponding layer of the last iteration during quantization, the quantization parameter is reconfirmed without substitution in the middle and later stages of training, and the quantization parameter is confirmed only in each layer of each iteration in the initial stage of training, thereby still meeting the precision of floating point operation required by neural network operation and greatly improving the efficiency during quantization. Further, in the training or fine tuning process, the weight data of each layer of the neural network are independent of each other and have no similarity. The neuron data among each layer are independent of each other because the weight data do not have similarity, and the weight data do not have similarity. Thus, in the neural network training or trimming process, the quantization parameters of each layer within each iteration of the neural network are applied to the corresponding data to be quantized of the corresponding layer.

In the training or fine tuning process of the neural network, the quantization parameters corresponding to the neuron data and the gradient data are also respectively used as examples of the weight data, which is not described herein.

In the neural network reasoning process, weight data among each layer of the neural network are mutually independent and have no similarity. The neuron data among each layer are independent of each other because the weight data do not have similarity, and the weight data do not have similarity. Therefore, in the neural network reasoning process, the quantization parameter of each layer of the neural network is applied to the data to be quantized of the corresponding layer. Such as: the current layer of the neural network is a convolution layer, and the quantization parameter of the data to be quantized of the current convolution layer is obtained according to the technical scheme shown in fig. 2 according to the data to be quantized of the convolution layer, wherein the quantization parameter can only be applied to the current convolution layer, but cannot be applied to other layers of the neural network, and even if the other layers are convolution layers, the quantization parameter is not applicable.

In summary, the delay policy of the data bit width and the quantization parameter is determined based on the similarity between the data, if the data have similarity, the data bit width and the quantization parameter may be delayed, and if the data do not have similarity, the data bit width or the quantization parameter may need to be adjusted. The measure of similarity between data is typically measured by KL divergence, and may also be measured by the following equation (17).

absmax (A) ≡absmax (B) and mean (A) ≡mean (B) (17)

In some embodiments, if data A and data B satisfy equation (17), then it is determined that there is similarity between data A and data B.

It should be noted that, regarding the above described method for determining quantization error, method for adjusting data bit width, and delay strategy of data bit width and quantization parameter are only some of the exemplary cases, and are not exhaustive, for example: the method for confirming the quantization error, the method for adjusting the data bit width, the data bit width and the delay strategy of the quantization parameter are all applicable to the fine tuning process of the neural network. Also, regarding the measurement of the similarity between data, the above list of KL divergence and the measurement of the similarity by equation (17) is only a partial example, and not exhaustive, such as: histogram matching method, matrix decomposition method, image similarity calculation method based on feature points, proximity metric method, etc. Those skilled in the art, with the understanding of the technical solution of the present application, may generate other modifications or changes based on the technical solution of the present application, but all shall fall within the protection scope of the present application as long as the functions and technical effects achieved by the present application are similar to those of the present application.

In summary, in the middle and later stages of training, since the variation amplitude of the weight data before and after each iteration is not large, the weight data of the corresponding layer of each iteration has similarity in a certain iteration interval, in order to make the technical scheme have better universality in training or fine tuning, and meet the requirement that the resources of the artificial intelligent processor chip reach reasonable application, a strategy is needed to determine the iteration interval, so that the data bit width n of the corresponding layer of each iteration is kept unchanged within the iteration interval range, and exceeds the iteration interval, the data bit width n is changed, and generation is not needed to determine whether to adjust the data bit width n. Similarly, the quantization parameters are also the same, so that the peak computing power of the artificial intelligent processor chip is improved, and meanwhile the precision of floating point operation required by quantization is met.

As shown in fig. 6, one of the flow charts of the method for determining the target iteration interval is provided. In the technical solution shown in fig. 6, the target iteration interval includes at least one weight updating iteration, and the same data bit width is adopted in the quantization process in the same target iteration interval. The determining step of the target iteration interval comprises the following steps:

step 601): determining a change trend value of the position parameter of the corresponding point of the data to be quantized in the weight iterative process at a pre-judging time point; the pre-judging time point is used for judging whether the data bit width needs to be adjusted or not, and corresponds to the time point when the weight updating iteration is completed.

In this step, according to formula (18), the change trend value of the point location parameter is determined according to the sliding average value of the point location parameter in the weight iteration process corresponding to the current pre-determination time point and the sliding average value of the point location parameter in the weight iteration process corresponding to the previous pre-determination time point, or according to the point location parameter in the weight iteration process corresponding to the current pre-determination time point and the sliding average value of the point location parameter in the weight iteration process corresponding to the previous pre-determination time point. The expression of formula 18 is:

diff _update1 ＝|M ^(t) -M ^(t-1) |＝α|s ^(t) -M ^(t-1) | (18)

in equation 18, M is a running average of the point location parameter s as the training iteration increases. Wherein M is ^(t) For the point position parameter s corresponding to the t < th > pre-judgment time point, obtaining M according to a formula (19) along with the increasing sliding average value of training iteration ^(t) 。s ^(t) And the point position parameter s corresponding to the t < th > pre-judging time point is obtained. M is M ^(t-1) And the value is a sliding average value of point position parameters s corresponding to the t-1 th pre-judging time point, and alpha is a super parameter. diff (diff) _update1 Measuring the change trend of the point position parameter s, wherein the change of the point position parameter s also changes the phase and represents the maximum value Z of the data in the current data to be quantized _max Is a variation of (2). diff (diff) _update1 The larger the value range, the more widely varying the value range, requiring a shorter interval of update frequency, i.e., a smaller target iteration interval.

M ^(t) ←α×s ^(t-1) +(1-α)×M ^(t-1) (19)

Step 602): and determining the corresponding target iteration interval according to the change trend value of the point location parameter.

In the present solution, a target iteration interval is determined according to equation (20). For the target iteration interval, the same data bit width is adopted in the quantization process in the same target iteration interval, and the data bit widths adopted in the quantization process in different target iteration intervals can be the same or different.

In formula (20), I is the target iteration interval. diff (diff) _update1 Is the variation trend value of the point location parameter. Beta and gamma are empirical values, and may be variable super parameters. The conventional super-parameter optimization method is suitable for beta and gamma, and the super-parameter optimization scheme is not repeated here.

For the technical scheme, the pre-judging time point comprises a first pre-judging time point, and the first pre-judging time point is determined according to the target iteration interval. Specifically, at the t-th pre-judgment time point in the training or fine tuning process, the weight data of the corresponding layer of the current iteration is quantized by utilizing the data bit width adopted in the quantization of the corresponding layer of the last iteration, the number of fixed points after quantization is obtained, and the quantization error diff is determined according to the weight data before quantization and the corresponding weight data before quantization _bit . Will quantize the error diff _bit And comparing the data bit width with the first threshold and the second threshold respectively, and determining whether to adjust the data bit width adopted in the quantization of the corresponding layer of the last iteration by using the comparison result. If: the t first prejudgement time point corresponds to the 100 th iteration, and the data bit width used by the 99 th iteration is n ₁ . At 100 th iteration, according to data bit width n ₁ Confirming quantization error diff _bit Will quantize the error diff _bit And comparing the first threshold value with the second threshold value to obtain a comparison result. If the data bit width n is confirmed based on the comparison result ₁ Without changing, using equation (20) to confirm that the target iteration interval is 8 iterations, when the 100 th iteration is the starting iteration within the current target iteration interval, the 100 th to 107 th iterations are the current target iteration interval, and when the 100 th iteration is the last iteration of the previous target iteration interval, the 101 st to 101 th iterations108 iterations were taken as the current target iteration interval. Each generation still extends the data bit width n used in the last target iteration interval when quantizing within the current target iteration interval ₁ . In this case, the data bit width used in quantization between different target iteration intervals may be the same. If the 100 th iteration-107 th iteration is taken as the current target iteration interval, the 108 th iteration in the next target iteration interval is taken as the t+1st first pre-judging time point, and if the 101 st iteration-108 th iteration is taken as the current target iteration interval, the 108 th iteration in the current target iteration interval is taken as the t+1st first pre-judging time point. At the t+1st first pre-judging time point, according to the data bit width n ₁ Confirming quantization error diff _bit Will quantize the error diff _bit And comparing the first threshold value with the second threshold value to obtain a comparison result. Determining the data bit width n based on the comparison result ₁ Need to be changed to n ₂ And confirm the target iteration interval as 55 iterations using equation (20). Then the 108 th to 163 th iterations or the 109 th to 163 th iterations are used as the target iteration interval, and the data bit width n is used for each generation when the quantization is performed in the target iteration interval ₂ . In this case, the data bit width used in quantization may be different between different target iteration intervals.

For the technical scheme, whether the first pre-judging time point is the initial iteration or the last iteration in the target iteration interval, the method is suitable for the formula (18) to obtain the change trend value of the point position parameter. If the first predetermined point in time at the current time is the starting iteration of the current target iteration interval, then M in equation (18) ^(t) A sliding average value s of point location parameter s corresponding to a point of time corresponding to a starting iteration of a current target iteration interval along with the increase of training iteration ^(t) Point location parameter s, M corresponding to the point in time corresponding to the starting iteration of the current target iteration interval ^(t-1) And the point position parameter s corresponding to the point in time corresponding to the initial iteration of the last target iteration interval is a sliding average value which increases along with the training iteration. If the first pre-determined time point at the current time is the last iteration of the current target iteration interval, then In the formula (18), M ^(t) A sliding average value s of point location parameter s corresponding to the time point corresponding to the last iteration of the current target iteration interval along with the increase of training iteration ^(t) Point location parameter s, M corresponding to the time point corresponding to the last iteration of the current target iteration interval ^(t-1) The point position parameter s corresponding to the time point corresponding to the last iteration of the last target iteration interval is a sliding average value which increases along with the training iteration.

For the technical scheme, the pre-judging time point can also comprise a second pre-judging time point on the basis of the first pre-judging time point. The second pre-determination time point is determined according to the data fluctuation range curve. The data fluctuation range curve shown in fig. 5a is obtained based on the data fluctuation range condition of big data in the neural network training process.

Taking weight data as an example, as can be seen from the data fluctuation width curve shown in fig. 5a, the data fluctuation width is very large every time the weight is updated in the iteration interval period from the training start to the T-th iteration. At the current prejudgement time point, when quantizing, the previous iteration firstly utilizes the data bit width n of previous iteration ₁ Performing quantization to obtain quantization result and corresponding quantization error determined by the data before quantization, comparing the quantization error with a first threshold and a second threshold respectively, and comparing the data bit width n according to the comparison result ₁ Adjusting to obtain data bit width n ₂ . By means of the data bit width n ₂ And quantizing the weight data to be quantized related to the current iteration. Then determining a target iteration interval according to the formula (20), thereby determining a first pre-judgment time point, judging whether to adjust the data bit width and how to adjust the data bit width at the first pre-judgment time point, and determining a next target iteration interval according to the formula (20) to obtain a next first pre-judgment time point. In the iteration interval period from the beginning of training to the T-th iteration, the change amplitude of the weight data before and after each iteration is very large, so that the weight data of the corresponding layers of each iteration are not similar, in order to meet the precision problem, the data of each layer of the current iteration can not be delayed by the corresponding quantization parameter of the corresponding layer of the last iteration during quantization, and the data of the corresponding layer of the current iteration can be substituted for modulation in the previous T-th iterationAnd (3) the whole data bit width is different from each iteration in the previous T iterations during quantization, and the target iteration interval is 1 iteration. For optimal utilization of resources of the artificial intelligence processor chip, the target iteration interval of the previous T iterations may be preset in advance according to the rule disclosed by the data fluctuation range graph shown in fig. 5a, that is: and directly presetting a target iteration interval of the previous T iterations according to the data fluctuation amplitude curve, and confirming a time point when the weight updating iteration corresponding to each iteration of the previous T iterations is completed without going through a formula (20) as a second pre-judging time point. Thereby making the resource of the artificial intelligent processor chip more reasonably utilized. The data fluctuation range curve shown in fig. 5a has small fluctuation range from the T iteration, the quantization parameter is reconfirmed without substitution in the middle and later stages of training, the quantization error is determined by using the data before quantization and the data after quantization corresponding to the current iteration in the T iteration or the t+1 iteration, whether the data bit width needs to be adjusted or not is determined according to the quantization error, and the target iteration interval is determined according to the formula (20). If the confirmed target iteration interval is 55 iterations, it is required to determine whether to adjust the data bit width and how to adjust the data bit width again from the time point corresponding to the T-th iteration or the t+1th iteration at an interval of 55 iterations as a first pre-determination time point, and determine the next target iteration interval according to the formula (20), thereby determining the next first pre-determination time point until all generation operations in the same period (epoch) are completed. On the basis, after each period (epoch), adaptive adjustment is carried out on the data bit width or the quantization parameter, and finally the quantized data is used to obtain the neural network with the accuracy meeting the expectations.

In particular, if: according to the weight data fluctuation range graph shown in fig. 5a, it is determined that the value of T is 130 (this value does not correspond to fig. 5a, for convenience of description, it is only assumed that the value of T is 130, and is not limited to the assumed value.) then 130 th iteration in the training process is taken as the second pre-determination time point, the current first pre-determination time point is 100 th iteration in the training process, and at 100 th iteration, the target iteration interval is determined to be 35 iterations through formula (20). Training is performed until the 130 th iteration reaches a second pre-judging time point within the target iteration interval, at this time, whether the data bit width needs to be adjusted and how to be adjusted are determined at the time point corresponding to the 130 th iteration, and the target iteration interval is determined according to a formula (20). If the target iteration interval determined in this case is 42 iterations. And taking the 130 th iteration to the 172 th iteration as a target iteration interval, wherein the 135 th iteration corresponding to the first pre-judging time point determined when the target iteration interval is 35 iterations is within 42 iterations of the target iteration interval, and judging whether the data bit width needs to be adjusted or not and how to adjust according to the formula (20) in the 135 th iteration. Evaluation of whether and how to adjust the data bit width may also be performed until the 172 th iteration without making evaluation pre-determination at the 135 th iteration. In summary, it is suitable for the present technical solution whether the evaluation and the prognosis are performed at the 135 th iteration.

In summary, a second pre-judgment time point is preset in advance according to the data fluctuation range curve, the resources of an artificial intelligent processor chip are not required to be spent to determine a target iteration interval in the initial stage of training or fine tuning, the data bit width is directly adjusted according to the quantization error at the preset second pre-judgment time point, and the data to be quantized related to the current iteration is quantized by utilizing the adjusted data bit width. At the middle and later stages of training or fine tuning, a target iteration interval is obtained according to formula (20), thereby determining corresponding first pre-determination time points, and determining whether and how to adjust the data bit width at each first pre-determination time point. Therefore, the precision of floating point operation required by neural network operation can be met, the resources of the artificial intelligent processor chip are reasonably utilized, and the efficiency in quantization is greatly improved.

In practice, not only according to the trend value diff of the point position parameter, but also to obtain a more accurate target iteration interval of the data bit width _update1 The change trend value diff of the point position parameter can be considered at the same time _update1 And a trend value diff of the data bit width _update2 . As shown in fig. 7, a second flowchart of the method for determining the target iteration interval is provided. The determining step of the target iteration interval comprises the following steps:

Step 701): determining a change trend value of the position parameter of the corresponding point of the data to be quantized and a change trend value of the data bit width in the weight iterative process at a pre-judging time point; the pre-judging time point is used for judging whether the data bit width needs to be adjusted or not, and corresponds to the time point when the weight updating iteration is completed.

It should be emphasized that the technical solution content of determining the target iteration interval of the data bit width based on the trend value of the point location parameter shown in fig. 6 is applicable to the technical solution shown in fig. 7, and will not be described herein.

In this step, a trend value of the data bit width is determined using the quantization error according to equation (21).

In the formula (21), delta is a super parameter, diff _bit Is quantization error; diff (diff) _update2 Is the variation trend value of the data bit width. diff (diff) _update2 Measuring change trend of data bit width n adopted in quantization, diff _update2 The larger the bit width of the update setpoint, the more likely it is that a shorter interval of update frequency is required.

The trend value for the point location parameter referred to in fig. 7 is still obtainable from equation (18), for M in equation (18) ^(t) Obtained according to formula (19). diff (diff) _update1 Measuring the change trend of the point position parameter s, wherein the change of the point position parameter s also changes the phase and represents the maximum value Z of the data in the current data to be quantized _max Is a variation of (2). diff (diff) _update1 The larger the value range, the more widely varying the value range, requiring a shorter interval of update frequency, i.e., a smaller target iteration interval.

Step 702): and determining the corresponding target iteration interval according to the change trend value of the point location parameter and the change trend value of the data bit width.

In the present solution, the target iteration interval is determined according to formula (22). For the target iteration interval, the same data bit width is adopted in the quantization process in the same target iteration interval, and the data bit widths adopted in the quantization process in different target iteration intervals can be the same or different.

In equation (22), I is the target iteration interval. Beta and gamma are super parameters. diff (diff) _update1 Is the variation trend value of the point location parameter. diff (diff) _update2 Is the variation trend value of the data bit width. Beta and gamma are empirical values, and may be variable super parameters. The conventional super-parameter optimization method is suitable for beta and gamma, and the super-parameter optimization scheme is not repeated here.

For the present technical proposal, diff _update1 Is used to measure the change in the point location parameter s, but the change in the point location parameter s due to the change in the data bit width n is ignored. Since this is already at diff _update2 The variation of the data bit width n is reflected. If at diff _update1 If the target iteration interval I determined according to the formula (22) is inaccurate, resulting in excessive first predetermined time points, and if and how to update the data bit width n is easy to frequently perform in the training or fine tuning process, thereby resulting in unreasonable utilization of resources of the artificial intelligence processor chip.

Based on the above description, diff _update1 According to M ^(t) And (5) determining. Assume that the data bit width corresponding to the t-1 th prejudgement time point is n ₁ The corresponding point location parameter is s ₁ The sliding average value of the point position parameter increased along with training iteration is m ₁ . By means of the data bit width n ₁ And quantizing the data to be quantized to obtain the quantized fixed-point number. Determining quantization error diff based on pre-quantization data and corresponding post-quantization data _bit According to quantization error diff _bit Comparing the data bit width n with the threshold value ₁ Adjusted to n ₂ Data bit width is adjusted by |n ₁ -n ₂ Level, t-th pre-determination time point quantizationA data bit width of n ₂ . In order to ignore the change of the point position parameters caused by the change of the data bit width, M is determined ^(t) One of the following two optimization modes can be selected. The first way is: if the data bit width increases by |n ₁ -n ₂ The I bit is s ^(t-1) Take the value s ₁ -|n ₁ -n ₂ |，M ^(t-1) Take the value of m ₁ -|n ₁ -n ₂ I, will s ^(t-1) 、M ^(t-1) Substituting into formula (19) to obtain M ^(t) And the point position parameter corresponding to the t < th > pre-judging time point is a sliding average value which increases along with training iteration. If the data bit width is reduced by |n ₁ -n ₂ The I bit is s ^(t ^-1) Take the value s ₁ +n ₁ -n ₂ |，M ^(t-1) Take the value of m ₁ +|n ₁ -n ₂ I, will s ^(t-1) 、M ^(t-1) Substituting into formula (19) to obtain M ^(t) And the point position parameter corresponding to the t < th > pre-judging time point is a sliding average value which increases along with training iteration. The second way is: regardless of whether the data bit width is increased by |n ₁ -n ₂ The i bit is also reduced by n ₁ -n ₂ |，s ^(t-1) Take the value s ₁ ，M ^(t-1) Take the value of m ₁ Will s ^(t-1) 、M ^(t-1) Substituting into formula (19) to obtain M ^(t) . Increase in data bit width by |n ₁ -n ₂ When in the I bit, M is ^(t) Subtracting |n ₁ -n ₂ I, decrease in data bit width by |n ₁ -n ₂ When in the I bit, M is ^(t) Plus |n ₁ -n ₂ And the result is taken as a sliding average value of the point position parameter corresponding to the t < th > pre-judging time point along with the increase of training iteration. The two modes are equivalent, and the change of the point position parameters caused by the change of the data bit width can be ignored, so that a more accurate target iteration interval is obtained, and the resource utilization rate of the artificial intelligence processor chip is improved.

In practical application, the data bit width n and the point location parameter s have a great influence on the quantization accuracy, and the second scaling factor f in the quantization parameter ₂ Offset O versus quantization accuracyThe sound is not loud. For the first scaling factor f ₁ In other words, it has been mentioned above that if it is the second case, 2 will be ^s ×f ₂ As a whole, as a first scaling factor f ₁ Since the point position parameter s has a great influence on the quantization accuracy, the first scaling factor f in this case ₁ Has great influence on quantification. Therefore, in the present technical solution, determining the target iteration interval of the point location parameter s is also a very significant matter, regardless of whether the data bit width n is changed or not, and the point location parameter s is variable, and the idea of the technical solution shown in fig. 6 may be applied to determining the target iteration interval of the point location parameter s. Thus, a method of determining the target iteration interval for the point location parameter s is shown in fig. 8. Comprising the following steps:

step 801): determining a change trend value of the position parameter of the corresponding point of the data to be quantized, which is involved in the weight iterative process, at a pre-judging time point; the pre-judging time point is a time point used for judging whether the quantization parameter needs to be adjusted or not, and corresponds to a time point when the weight updating iteration is completed.

Step 802): and determining the corresponding target iteration interval according to the change trend value of the point location parameter.

It should be emphasized that the technical solution content of fig. 6 regarding the determination of the target iteration interval of the quantization parameter based on the trend value of the point location parameter is applicable to the technical solution shown in fig. 8, and will not be described herein. For the solution shown in fig. 8, the quantization parameter is preferably a point location parameter.

It should be noted that, regarding the above-mentioned target iteration interval for determining the data bit width and the target iteration interval for determining the quantization parameter are only exemplary and not exhaustive, those skilled in the art, while understanding the gist of the present application, may generate other modifications or transformations based on the technical solution of the present application, for example: the target iteration interval for redetermining the quantization parameter within the target iteration interval for determining the data bit width is also applicable to the solutions shown in fig. 6, 7 and 8. But should be within the scope of the present application as long as the functions and technical effects achieved are similar to those of the present application.

According to the technical scheme, the quantization parameters are determined, the data bit width or the quantization parameters are adjusted according to the quantization errors, and whether the target iteration interval of the data bit width or the quantization parameters is adjusted is determined, so that the data bit width or the quantization parameters are adjusted at a proper time point in the neural network operation process, the proper quantization parameters are used at the proper iteration time point, the speed that the artificial intelligent processor chip executes the neural network operation to achieve fixed-point operation is realized, and the precision of floating point operation required by operation is met while the peak calculation force of the artificial intelligent processor chip is improved.

It should be noted that, for simplicity of description, the foregoing method embodiments are all depicted as a series of acts, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of acts described, as some steps may occur in other orders or concurrently in accordance with the disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required by the present disclosure.

Further, although the steps in the flowcharts of fig. 2, 6, 7, and 8 are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps of fig. 2, 6, 7, 8 may include multiple sub-steps or phases that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the sub-steps or phases are performed necessarily occur sequentially, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or phases of other steps.

As shown in fig. 9, a block diagram of a hardware configuration of a quantization parameter determination device of a neural network is provided. In fig. 9, the quantization parameter determining device 10 of the neural network may include a processor 110 and a memory 120. In the quantization parameter determination apparatus 10 of the neural network of fig. 9, only the constituent elements related to the present embodiment are shown. Thus, it will be apparent to those of ordinary skill in the art that: the quantization parameter determination apparatus 10 of the neural network may further include common constituent elements different from those shown in fig. 10. Such as: a fixed point arithmetic unit.

The quantization parameter determination apparatus 10 of the neural network may correspond to a computing device having various processing functions, for example, functions for generating the neural network, training or learning the neural network, quantizing the floating point type neural network into the fixed point type neural network, or retraining the neural network. For example, the quantization parameter determination apparatus 10 of the neural network may be implemented as various types of devices, such as a Personal Computer (PC), a server device, a mobile device, and the like.

The processor 110 controls all functions of the quantization parameter determining device 10 of the neural network. For example, the processor 110 controls all functions of the quantization parameter determination apparatus 10 of the neural network by executing a program stored in the memory 120 on the quantization parameter determination apparatus 10 of the neural network. The processor 110 may be implemented by a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an Application Processor (AP), an artificial intelligence processor chip (IPU), or the like provided in the quantization parameter determination apparatus 10 of the neural network. However, the present disclosure is not limited thereto.

The memory 120 is hardware for storing various data processed in the quantization parameter determination device 10 of the neural network. For example, the memory 120 may store processed data and data to be processed in the quantization parameter determination device 10 of the neural network. The memory 120 may store data sets involved in the neural network operation process that the processor 110 has processed or is to process, e.g., data of an untrained initial neural network, intermediate data of a neural network generated during training, data of a neural network that has completed all training, data of a quantized neural network, etc. Further, the memory 120 may store an application, a driver, or the like to be driven by the quantization parameter determining device 10 of the neural network. For example: the memory 120 may store various programs related to training algorithms, quantization algorithms, etc. of the neural network to be executed by the processor 110. The memory 120 may be a DRAM, but the present disclosure is not limited thereto. The memory 120 may include at least one of volatile memory or nonvolatile memory. The nonvolatile memory may include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), flash memory, phase change RAM (PRAM), magnetic RAM (MRAM), resistive RAM (RRAM), ferroelectric RAM (FRAM), and the like. Volatile memory can include Dynamic RAM (DRAM), static RAM (SRAM), synchronous DRAM (SDRAM), PRAM, MRAM, RRAM, ferroelectric RAM (FeRAM), and the like. In an embodiment, the memory 120 may include at least one of a Hard Disk Drive (HDD), a Solid State Drive (SSD), a high density flash memory (CF), a Secure Digital (SD) card, a Micro-secure digital (Micro-SD) card, a Mini-secure digital (Mini-SD) card, an extreme digital (xD) card, a cache (cache), or a memory stick.

The processor 110 may generate a trained neural network by iteratively training (learning) a given initial neural network. In this state, the parameters of the initial neural network are in a high-precision data representation format, for example, a data representation format having 32-bit floating point precision, in the sense of ensuring the processing accuracy of the neural network. Parameters may include various types of data input/output to/from the neural network, such as: input/output neurons of the neural network, weights, biases, etc. In contrast to fixed point operations, floating point operations require a relatively large number of operations and relatively frequent memory accesses. In particular, most of the operations required for neural network processing are known as various convolution operations. Thus, in mobile devices with relatively low processing capabilities (such as smartphones, tablets, wearable devices, etc., embedded devices, etc.), neural network high-precision data operations may make the resources of the mobile device underutilized. As a result, the amount of computation in the above-described apparatus can be sufficiently reduced in order to drive the neural network operation within the allowable accuracy loss range, and the high-accuracy data involved in the neural network operation can be quantized and converted into a low-accuracy fixed-point number.

In consideration of processing performance of devices such as mobile devices, embedded devices, and the like, where the neural network is deployed, the quantization parameter determining apparatus 10 of the neural network performs quantization of converting parameters of the trained neural network into fixed points having a specific number of bits, and the quantization parameter determining apparatus 10 of the neural network transmits corresponding quantization parameters to the devices where the neural network is deployed so as to be fixed point number operation when the artificial intelligence processor chip performs operation operations such as training, trimming, and the like. The device deploying the neural network may be an autonomous vehicle, a robot, a smart phone, a tablet device, an Augmented Reality (AR) device, an internet of things (IoT) device, or the like that performs voice recognition, image recognition, or the like by using the neural network, but the present disclosure is not limited thereto.

The processor 110 retrieves data from the memory 120 during operation of the neural network. The data comprises at least one data of neurons, weights, offsets and gradients, corresponding quantization parameters are determined by using the technical scheme shown in fig. 2, and target data in the neural network operation process is quantized by using the quantization parameters. And executing the neural network operation on the quantized data. The arithmetic operations include, but are not limited to, training, fine tuning, reasoning.

The processor 110 is based on the quantization error diff _bit The data bit width n is adjusted and the processor 110 may execute the program of the method of the target iteration interval shown in fig. 6, 7 and 8 to determine the target iteration interval of the data bit width or the target iteration interval of the quantization parameter.

In summary, the specific functions of the memory 120 and the processor 110 of the quantization parameter determining device for a neural network provided in the embodiments of the present disclosure may be explained in comparison with the previous embodiments in the present disclosure, and the technical effects of the previous embodiments may be achieved, which will not be repeated herein.

In this embodiment, the processor 110 may be implemented in any suitable manner. For example, the processor 110 may take the form of, for example, a microprocessor or processor, and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (ApplicationSpecific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, among others.

As shown in fig. 10, an application diagram of the quantization parameter determining device of the neural network provided in the present application to an artificial intelligence processor chip is shown. Referring to fig. 10, as described above, in the quantization parameter determining apparatus 10 of the neural network such as a PC, a server, etc., the processor 110 performs quantization operation to quantize floating point data involved in the operation of the neural network into fixed point numbers, and the fixed point arithmetic unit on the artificial intelligence processor chip performs training, fine tuning, or reasoning using the fixed point numbers obtained by the quantization. An artificial intelligence processor chip is dedicated hardware for driving a neural network. Because the artificial intelligent processor chip is realized with relatively lower power or performance, the neural network operation is realized by adopting the low-precision fixed point number by utilizing the technical scheme, compared with high-precision data, the memory bandwidth required by reading the low-precision fixed point number is smaller, and the caches of the artificial intelligent processor chip can be better used, so that the memory access bottleneck is avoided. Meanwhile, when the SIMD instruction is executed on the artificial intelligence processor chip, more calculation is realized in one clock period, and the neural network operation is faster to execute.

Further, as seen from the comparison between fixed-point operations and high-precision data operations with the same length, especially fixed-point operations and floating-point operations, the floating-point operation calculation mode is more complex, and more logic devices are needed to form the floating-point operator. Thus, the volume of the floating point operator is larger than that of the fixed point operator in terms of volume. Also, floating point operators require more resources to handle, and the power consumption gap between fixed point operations and floating point operations is often orders of magnitude.

In summary, according to the technical scheme, the floating point arithmetic unit on the artificial intelligent processor chip can be replaced by the fixed point arithmetic unit, so that the power consumption of the artificial intelligent processor chip is lower. This is particularly important for mobile devices. That is, the technical proposal opens a large number of air-guide pipes which can not be transported efficientlyLine floating point computing codeEmbedded typeThe system gate makes the world wide application of the Internet of things possible.

In the present technical solution, the artificial intelligence processor chip may correspond to, for example, a Neural Processing Unit (NPU), a Tensor Processing Unit (TPU), a neural engine, etc., which are dedicated chips for driving the neural network, but the present disclosure is not limited thereto.

In this embodiment, the artificial intelligence processor chip may be implemented in a separate device independent of the quantization parameter determining device 10 of the neural network, and the quantization parameter determining device 10 of the neural network may also be implemented as a part of the functional modules of the artificial intelligence processor chip. The present disclosure is not limited thereto.

In the technical scheme, an operating system of a general processor (such as a CPU) generates an instruction based on the technical scheme, the generated instruction is sent to an artificial intelligent processor chip (such as a GPU), and the artificial intelligent processor chip executes the instruction operation to determine the quantization parameters of the neural network and perform the quantization process. In another application, the general processor directly determines the corresponding quantization parameter based on the technical scheme, the general processor directly quantizes the corresponding target data according to the quantization parameter, and the artificial intelligent processor chip executes fixed-point operation by using the quantized data. Furthermore, the general-purpose processor (such as a CPU) and the artificial intelligent processor chip (such as a GPU) operate in a pipelining mode, an operating system of the general-purpose processor (such as the CPU) generates instructions based on the technical scheme, and the artificial intelligent processor chip (such as the GPU) performs neural network operation while copying target data, so that certain time consumption can be hidden. The present disclosure is not limited thereto.

In this embodiment, the embodiment of the present application further provides a readable storage medium having stored thereon a computer program that when executed implements the method for determining quantization parameters of a neural network described above.

From the above, in the neural network operation process, the technical scheme disclosed by the disclosure is utilized to determine the quantization parameter during quantization, and the quantization parameter is used for the artificial intelligent processor to quantize the data in the neural network operation process, convert the high-precision data into the low-precision fixed point number, and can reduce the size of all the space of the data storage involved in the neural network operation process. For example: conversion of float32 to fix8 can reduce model parameters by a factor of 4. Because the data storage space is reduced, a smaller space is used when the neural network is deployed, so that the on-chip memory on the artificial intelligent processor chip can accommodate more data, access to the data of the artificial intelligent processor chip is reduced, and the calculation performance is improved.

Those skilled in the art will also appreciate that, in addition to implementing clients and servers in pure computer readable program code, it is well possible to implement the same functions by logically programming method steps such that clients and servers are implemented in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, and the like. Such clients and servers may therefore be considered as one hardware component, and the means included therein for performing various functions may also be considered as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

As shown in fig. 11, a functional block diagram of a quantization parameter determining device of a neural network is provided. The method comprises the following steps:

a statistical result obtaining unit a, configured to obtain a statistical result of each type of data to be quantized; the data to be quantized comprises at least one data of neurons, weights, gradients and biases of the neural network;

a quantization parameter determining unit b, configured to determine a corresponding quantization parameter using a statistical result of each type of data to be quantized and a data bit width; the quantization parameter is used for the artificial intelligence processor to correspondingly quantize the data in the operation process of the neural network.

In this embodiment, optionally, the quantization parameter determining device of the neural network further includes:

and the first quantization unit is used for quantizing the data to be quantized by utilizing the corresponding quantization parameters.

a second quantization unit for quantizing the target data using the corresponding quantization parameter; wherein, the characteristics of the target data and the characteristics of the data to be quantized have similarity.

In this embodiment, the neural network operation process includes at least one operation of neural network training, neural network reasoning, and neural network fine tuning.

In this embodiment, the statistical result obtained by the statistical unit is a maximum value and a minimum value in each data to be quantized.

In this embodiment, the statistical result obtained by the statistical unit is the maximum absolute value in each type of data to be quantized.

In this embodiment, the statistics unit determines the absolute value maximum from the maximum and the minimum in each data to be quantized.

In this embodiment, the quantization parameter determining unit determines the quantization parameter according to the maximum value, the minimum value, and the data bit width in each type of data to be quantized.

In this embodiment, the quantization parameter determining unit determines the quantization parameter from the data bit width, which is the maximum value of the absolute value in each of the data to be quantized.

In this embodiment, the quantization parameter determined by the quantization parameter determining unit is a point location parameter or a first scaling factor.

In this embodiment, the quantization parameter determining unit determines the first scaling factor based on a point position parameter and a second scaling factor; the point location parameter used in determining the first scaling factor is a known fixed value, or the integral result of multiplying the point location parameter and the corresponding second scaling factor is used as the first scaling factor to be applied to data quantization in the neural network operation process.

In this embodiment, the quantization parameter determined by the quantization parameter determining unit includes a point position parameter and a second scaling factor.

In this embodiment, the quantization parameter determining unit determines the second scaling factor according to the point location parameter, the statistical result, and the data bit width.

In this embodiment, the quantization parameter determined by the quantization parameter determining unit further includes an offset.

In this embodiment, the quantization parameter determining unit determines the offset according to the statistical result of each data to be quantized.

In this embodiment, the data bit width used by the quantization parameter determining unit is a preset value.

In this embodiment, the quantization parameter determining unit includes an adjustment module and a quantization error determining module; wherein,

the adjusting module is used for adjusting the data bit width according to the corresponding quantization error;

the quantization error determining module is configured to determine the quantization error according to quantized data and corresponding pre-quantized data.

In this embodiment, the adjustment module is specifically configured to:

comparing the quantization error with a threshold value, and adjusting the data bit width according to a comparison result; wherein the threshold comprises at least one of a first threshold and a second threshold.

In this embodiment, the adjustment module includes a first adjustment submodule, where the first adjustment submodule is configured to:

and if the quantization error is greater than or equal to the first threshold value, increasing the data bit width.

In this embodiment, the adjusting module includes a second adjusting submodule, where the second adjusting submodule is configured to:

and if the quantization error is smaller than or equal to the second threshold value, reducing the data bit width.

In this embodiment, the adjustment module includes a third adjustment submodule, where the third adjustment submodule is configured to:

the quantization error is between the first threshold and the second threshold, the data bit width remains unchanged.

In this embodiment, the quantization error determining module includes:

a quantization interval determination submodule for determining a quantization interval according to the data bit width;

and the first quantization error determination submodule is used for determining quantization errors according to the quantization interval, the number of the quantized data and the corresponding data before quantization.

In this embodiment, the quantization error determining module includes:

the inverse quantization data determining submodule is used for carrying out inverse quantization on quantized data to obtain inverse quantization data; wherein, the data format of the inverse quantization data is the same as the data format of the corresponding data before quantization;

And the second quantization error determination submodule is used for determining quantization errors according to the quantized data and the corresponding inverse quantization data.

In this embodiment, the pre-quantization data used by the quantization error determination module is the data to be quantized.

In this embodiment, the pre-quantization data used by the quantization error determining module is data to be quantized involved in a weight updating iteration process within a target iteration interval; the target iteration interval comprises at least one weight updating iteration, and the same data bit width is adopted in the quantization process in the same target iteration interval.

In this embodiment, the quantization parameter determining apparatus of a neural network further includes a first target iteration interval determining unit; wherein the first target iteration interval determining unit includes:

the first change trend value determining module is used for determining a change trend value of point position parameters of the data to be quantized, which are involved in the weight updating iteration process, at a pre-judging time point; the pre-judging time point is used for judging whether the data bit width needs to be adjusted or not, and corresponds to the time point when the weight updating iteration is completed;

And the first target iteration interval module is used for determining the corresponding target iteration interval according to the change trend value of the point location parameter.

In this embodiment, the first target iteration interval determination unit includes:

the second change trend value determining module is used for determining a change trend value of point position parameters and a change trend value of data bit width of the data to be quantized, which are involved in the weight updating iteration process, at a pre-judging time point; the pre-judging time point is used for judging whether the data bit width needs to be adjusted or not, and corresponds to the time point when the weight updating iteration is completed;

and the second target iteration interval module is used for determining the corresponding target iteration interval according to the change trend value of the point location parameter and the change trend value of the data bit width.

In this embodiment, the first target iteration interval determining unit further includes a first pre-determination time point determining unit; wherein,

the first pre-judgment time point determining unit is used for determining the first pre-judgment time point according to the target iteration interval.

In this embodiment, the first target iteration interval determining unit further includes a second pre-judgment time point determining unit; the second pre-judging time point determining unit is used for determining a second pre-judging time point according to the data fluctuation range curve; the data fluctuation range curve is obtained by counting the data fluctuation range conditions in the weight updating iterative process.

In this embodiment, the first trend value determining module and the second trend value determining module determine the trend value of the point location parameter according to a sliding average value of the point location parameter corresponding to the current pre-determination time point and a sliding average value of the point location parameter corresponding to the previous pre-determination time point.

In this embodiment, the first trend value determining module and the second trend value determining module determine the trend value of the point location parameter according to the point location parameter corresponding to the current pre-determination time point and the sliding average value of the point location parameter corresponding to the previous pre-determination time point.

In this embodiment, the first trend value determining module and the second trend value determining module each include:

the point position parameter determining sub-module is used for determining the point position parameter corresponding to the current pre-judging time point according to the point position parameter corresponding to the last pre-judging time point and the adjustment value of the data bit width;

the adjustment result determining submodule is used for adjusting the sliding average value of the point position parameters corresponding to the previous pre-judging time point according to the adjustment value of the data bit width to obtain an adjustment result;

And the first sliding average value determining sub-module is used for determining the sliding average value of the point position parameter corresponding to the current pre-judging time point according to the point position parameter corresponding to the current pre-judging time point and the adjustment result.

the middle result determining sub-module is used for determining a middle result of the sliding average value of the point position parameters corresponding to the current pre-judging time point according to the point position parameters corresponding to the previous pre-judging time point and the sliding average value of the point position parameters corresponding to the previous pre-judging time point;

and the second moving average value determining sub-module is used for determining the moving average value of the point position parameter corresponding to the current pre-judging time point according to the intermediate result of the moving average value of the point position parameter corresponding to the current pre-judging time point and the adjustment value of the data bit width.

In this embodiment, the second trend value determining module determines the trend value of the data bit width according to the quantization error.

In this embodiment, the first target iteration interval determining unit further includes:

a quantization error determining module for determining a corresponding quantization error; the data before quantization corresponding to the quantization error is data to be quantized involved in the weight updating iterative process corresponding to the pre-judging time point;

And the data bit width determining module is used for determining the data bit width adopted in the quantization process in the target iteration interval according to the corresponding quantization error.

In this embodiment, the data bit width determining module is specifically configured to:

and comparing the quantization error with a threshold value, and adjusting the data bit width adopted in the quantization process in the previous target iteration interval according to the comparison result, wherein the adjustment result is used as the data bit width adopted in the quantization process in the current target iteration interval.

In this embodiment, the pre-quantization data used by the quantization error determining module is data to be quantized involved in a weight update iteration within a target iteration interval; the target iteration interval comprises at least one weight updating iteration, and the same quantization parameter is adopted in the quantization process in the same target iteration interval.

In this embodiment, the quantization parameter determining apparatus of a neural network further includes a second target iteration interval determining unit; wherein the second target iteration interval determining unit includes:

the third change trend value determining module is used for determining the change trend value of the point position parameter of the data to be quantized, which is related in the weight updating iteration process, at the pre-judging time point; the pre-judging time point is used for judging whether the quantization parameter needs to be adjusted or not, and corresponds to the time point when the weight updating iteration is completed;

And the third target iteration interval module is used for determining the corresponding target iteration interval according to the change trend value of the point location parameter.

In this embodiment, the quantization parameter determining unit determines the point location parameter based on the statistical result and the data bit width.

It should be understood that the apparatus embodiments described above are illustrative only and that the device of the present disclosure may be implemented in other ways. For example, the division of the units/modules in the above embodiments is merely a logic function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted or not performed.

The units or modules described as separate components may or may not be physically separate. The components described as units or modules may be physical units, may be located in one apparatus, or may be distributed over a plurality of apparatuses. The embodiments of the present disclosure may be implemented by selecting some or all of the units according to actual needs.

In addition, unless specifically stated, each functional unit/module in the embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may be integrated together. The integrated units/modules described above may be implemented either in hardware or in software program modules.

It should be understood that the above-described device embodiments are merely illustrative and that the device of the present disclosure may be implemented in other ways. For example, the division of the units/modules in the above embodiments is merely a logic function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted or not performed.

The integrated units/modules, if implemented in hardware, may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. The artificial intelligence processor may be any suitable hardware processor, unless otherwise specified, such as: CPU, GPU, FPGA, DSP and ASIC, etc. The storage unit may be any suitable magnetic or magneto-optical storage medium, unless otherwise indicated, such as: a resistive Random Access Memory RRAM (Resistive Random Access Memory), a dynamic Random Access Memory DRAM (Dynamic Random Access Memory), a Static Random Access Memory SRAM (Static Random-Access Memory), an enhanced dynamic Random Access Memory EDRAM (Enhanced Dynamic Random Access Memory), a High-Bandwidth Memory HBM (High-Bandwidth Memory), a hybrid Memory cube HMC (Hybrid Memory Cube), and the like.

The integrated units/modules may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method described in the various embodiments of the present disclosure. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In this technical scheme, this disclosure still discloses an artificial intelligence chip, and it includes above-mentioned neural network's quantization parameter determination equipment.

In the technical scheme, the disclosure also discloses a board card, which comprises a storage device, an interface device, a control device and the artificial intelligent chip; wherein the artificial intelligent chip is respectively connected with the storage device, the control device and the interface device; the storage device is used for storing data; the interface device is used for realizing data transmission between the artificial intelligent chip and external equipment; the control device is used for monitoring the state of the artificial intelligent chip.

Fig. 12 shows a block diagram of a board according to an embodiment of the present disclosure, referring to fig. 12, which may include other mating components in addition to the chip 389, including but not limited to: a memory device 390, an interface device 391 and a control device 392;

the memory device 390 is connected to the artificial intelligence chip through a bus for storing data. The memory device may include multiple sets of memory cells 393. Each group of storage units is connected with the artificial intelligent chip through a bus. It is understood that each set of memory cells may be DDR SDRAM (English: double Data Rate SDRAM, double Rate synchronous dynamic random Access memory).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the memory device may include 4 sets of the memory cells. Each set of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the artificial intelligence chip may include 4 72-bit DDR4 controllers therein, where 64 bits of the 72-bit DDR4 controllers are used to transfer data and 8 bits are used for ECC verification. It is understood that the theoretical bandwidth of data transfer can reach 25600MB/s when DDR4-3200 granules are employed in each set of memory cells.

In one embodiment, each set of memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each storage unit.

The interface device is electrically connected with the artificial intelligent chip. The interface device is used for realizing data transmission between the artificial intelligent chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the chip through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X10 interface transmission is adopted, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface device may be another interface, and the disclosure is not limited to the specific form of the other interface, and the interface unit may be capable of implementing a switching function. In addition, the results of the computation of the artificial intelligence chip are still transmitted back to the external device (e.g., server) by the interface device.

The control device is electrically connected with the artificial intelligence chip. The control device is used for monitoring the state of the artificial intelligent chip. Specifically, the artificial intelligent chip and the control device can be electrically connected through an SPI interface. The control device may comprise a single chip microcomputer (Micro Controller Unit, MCU). The artificial intelligent chip can comprise a plurality of processing chips, a plurality of processing cores or a plurality of processing circuits, and can drive a plurality of loads. Therefore, the artificial intelligent chip can be in different working states such as multi-load and light-load. The control device can regulate and control the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the artificial intelligent chip.

In one possible implementation, an electronic device is disclosed that includes the artificial intelligence chip described above. The electronic device includes a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.

The foregoing may be better understood in light of the following clauses:

A1. a method of determining quantization parameters for a neural network, the method comprising:

A2. The method of A1, the method further comprising:

and quantizing the data to be quantized by using the corresponding quantization parameters.

A3. The method of A1 or A2, the method further comprising:

quantizing the target data by using the corresponding quantization parameters; wherein, the characteristics of the target data and the characteristics of the data to be quantized have similarity.

A4. The method of A1, wherein the neural network operation process comprises at least one operation of neural network training, neural network reasoning and neural network fine tuning.

A5. The method of A1, wherein the statistics are maximum and minimum values in each type of data to be quantized.

A6. The method of A1, wherein the statistics are absolute maximum values in each type of data to be quantized.

A7. The method of A6, wherein the absolute maximum is determined based on a maximum and a minimum in each of the data to be quantized.

A8. The method of A5, wherein the quantization parameter is determined according to a maximum value, a minimum value, and the data bit width in each data to be quantized.

A9. The method of A6 or A7, wherein the quantization parameter is determined according to the maximum absolute value in each data to be quantized, and the data bit width.

A10. The method of A1, wherein the data bit width is a predetermined value.

A11. The method of A1, wherein the data bit width is adjusted according to a corresponding quantization error; wherein the quantization error is determined according to the quantized data and the corresponding pre-quantized data.

A12. The method of a11, the step of adjusting the data bit width comprising:

A13. The method of a12, the step of adjusting the data bit width comprising:

A14. The method of a12, the step of adjusting the data bit width comprising:

A15. The method of a12, the step of adjusting the data bit width comprising:

A16. The method of a11, the method for obtaining a quantization error includes:

determining a quantization interval according to the data bit width;

and determining quantization errors according to the quantization intervals, the number of the quantized data and the corresponding data before quantization.

A17. The method of a11, the method for obtaining a quantization error includes:

performing inverse quantization on the quantized data to obtain inverse quantized data; wherein, the data format of the inverse quantization data is the same as the data format of the corresponding data before quantization;

and determining quantization errors according to the quantized data and the corresponding inverse quantized data.

A18. The method of a11, wherein the pre-quantized data is the data to be quantized.

A19. The method of a11, wherein the pre-quantized data is data to be quantized involved in a weight update iteration process within a target iteration interval; the target iteration interval comprises at least one weight updating iteration, and the same data bit width is adopted in the quantization process in the same target iteration interval.

A20. The method of a19, the determining of the target iteration interval comprising:

determining the change trend value of the point position parameter of the data to be quantized, which is involved in the weight updating iterative process, at a pre-judging time point; the pre-judging time point is used for judging whether the data bit width needs to be adjusted or not, and corresponds to the time point when the weight updating iteration is completed;

And determining the corresponding target iteration interval according to the change trend value of the point location parameter.

A21. The method of a19, the determining of the target iteration interval comprising:

determining the change trend value of point position parameters and the change trend value of data bit width of the data to be quantized, which are involved in the weight updating iterative process, at a pre-judging time point; the pre-judging time point is used for judging whether the data bit width needs to be adjusted or not, and corresponds to the time point when the weight updating iteration is completed;

and determining the corresponding target iteration interval according to the change trend value of the point location parameter and the change trend value of the data bit width.

A22. The method of a20 or a21, the pre-determined time point comprising a first pre-determined time point; wherein the first predetermined point in time is determined based on the target iteration interval.

A23. The method of a22, the pre-determined time point further comprising a second pre-determined time point; wherein the second pre-judging time point is determined according to a data fluctuation range curve; the data fluctuation range curve is obtained by counting the data fluctuation range conditions in the weight updating iterative process.

A24. The method according to any one of a20 to a23, wherein the change trend value of the point location parameter is determined according to a sliding average value of the point location parameter corresponding to the current pre-determination time point and a sliding average value of the point location parameter corresponding to the previous pre-determination time point.

A25. The method according to any one of a20 to a23, wherein the change trend value of the point location parameter is determined according to the point location parameter corresponding to the current pre-determination time point and the sliding average value of the point location parameter corresponding to the previous pre-determination time point.

A26. The method of a24, wherein the step of determining the sliding average value of the point location parameter corresponding to the current pre-determination time point includes:

determining point position parameters corresponding to the current pre-judging time point according to the point position parameters corresponding to the last pre-judging time point and the adjustment value of the data bit width;

according to the adjustment value of the data bit width, adjusting the sliding average value of the point position parameters corresponding to the last pre-judging time point to obtain an adjustment result;

and determining a sliding average value of the point position parameters corresponding to the current pre-judgment time point according to the point position parameters corresponding to the current pre-judgment time point and the adjustment result.

A27. The method of a24, wherein the step of determining the sliding average value of the point location parameter corresponding to the current pre-determination time point includes:

Determining an intermediate result of the sliding average value of the point position parameter corresponding to the current pre-judging time point according to the point position parameter corresponding to the last pre-judging time point and the sliding average value of the point position parameter corresponding to the last pre-judging time point;

and determining the sliding average value of the point position parameters corresponding to the current pre-judging time point according to the intermediate result of the sliding average value of the point position parameters corresponding to the current pre-judging time point and the adjustment value of the data bit width.

A28. The method of a21, wherein the trend value of the data bit width is determined according to the quantization error.

A29. The method according to any one of a20 to a23, wherein the step of determining the data bit width used in the quantization process in the target iteration interval includes:

determining a corresponding quantization error; the data before quantization corresponding to the quantization error is data to be quantized involved in the weight updating iterative process corresponding to the pre-judging time point;

and determining the data bit width adopted in the quantization process in the target iteration interval according to the corresponding quantization error.

A30. The method of a29, the step of determining the data bit width employed in the quantization process within the target iteration interval comprising:

A31. The method of a11, wherein the pre-quantization data is data to be quantized involved in weight update iteration within a target iteration interval; the target iteration interval comprises at least one weight updating iteration, and the same quantization parameter is adopted in the quantization process in the same target iteration interval.

A32. The method of claim 31, the step of determining the target iteration interval comprising:

determining the change trend value of the point position parameter of the data to be quantized, which is involved in the weight updating iterative process, at a pre-judging time point; the pre-judging time point is used for judging whether the quantization parameter needs to be adjusted or not, and corresponds to the time point when the weight updating iteration is completed;

A33. The method of A1, wherein the point location parameter is determined based on statistics and the data bit width.

A34. The method of A1, the method further comprising:

judging whether to fuse the current operator with an operator to be fused or not based on the splitting size, the size of a data block of input data of the current operator and the size of a data block of intermediate data between the current operator and the operator to be fused;

And determining the splitting size according to the judging result.

A35. The method of a34, wherein the step of determining the split size according to the determination result includes:

the judgment result is that the current operator and the operator to be fused cannot be fused together, the current splitting size is adjusted, and the output data of the operator to be fused is split into corresponding data blocks according to the adjusted splitting size;

and mapping to obtain a data block of input data of the current operator and a data block of intermediate data between the current operator and the operator to be fused based on the data block of the operator to be fused.

A36. The method of A1, wherein a data flow between the current operator and the operator to be fused is unidirectional.

A37. A quantization parameter determination apparatus for a neural network, comprising a memory and a processor, the memory having stored thereon a computer program executable on the processor, the processor implementing the steps of the method of any one of A1 to a36 when the computer program is executed.

A38. A computer readable storage medium having stored therein a computer program which, when executed, implements the steps of the method of any of claims A1-a 36.

The embodiments of the present disclosure have been described above, and the above description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for determining quantization parameters of a neural network, the method comprising:

2. The method of claim 1, wherein the method further comprises:

3. The method of claim 1 or 2, wherein the method further comprises:

4. The method of claim 1, wherein the neural network operation process comprises at least one of neural network training, neural network reasoning, neural network fine tuning.

5. The method of claim 1, wherein the statistics are a maximum and a minimum in each type of data to be quantized.

6. The method of claim 1, wherein the statistics are absolute maximum values in each type of data to be quantized.

7. The method of claim 6, wherein the absolute maximum is determined based on a maximum and a minimum in each of the data to be quantized.

8. The method of claim 5, wherein the quantization parameter is determined based on a maximum value, a minimum value, and the data bit width in each type of data to be quantized.

9. The method of claim 6 or 7, wherein the quantization parameter is determined based on the data bit width, the absolute maximum in each data to be quantized.

10. The method of claim 1, wherein the data bit width is a preset value.

11. The method of claim 1, wherein the data bit widths are adjusted according to corresponding quantization errors; wherein the quantization error is determined according to the quantized data and the corresponding pre-quantized data.

12. The method of claim 11, wherein the step of adjusting the data bit width comprises:

13. The method of claim 12, wherein the step of adjusting the data bit width comprises:

14. The method of claim 12, wherein the step of adjusting the data bit width comprises:

15. The method of claim 12, wherein the step of adjusting the data bit width comprises:

16. The method of claim 11, wherein the quantization error acquisition method comprises:

determining a quantization interval according to the data bit width;

17. The method of claim 11, wherein the quantization error acquisition method comprises:

18. The method of claim 11, wherein the pre-quantized data is the data to be quantized.

19. The method of claim 11, wherein the pre-quantized data is data to be quantized involved in a weight update iteration process within a target iteration interval; the target iteration interval comprises at least one weight updating iteration, and the same data bit width is adopted in the quantization process in the same target iteration interval.

20. The method of claim 19, wherein the step of determining the target iteration interval comprises:

21. The method of claim 19, wherein the step of determining the target iteration interval comprises:

22. The method of claim 20 or 21, wherein the pre-determined time point comprises a first pre-determined time point; wherein the first predetermined point in time is determined based on the target iteration interval.

23. The method of claim 22, wherein the pre-determined time point further comprises a second pre-determined time point; wherein the second pre-judging time point is determined according to a data fluctuation range curve; the data fluctuation range curve is obtained by counting the data fluctuation range conditions in the weight updating iterative process.

24. The method according to claim 20 or 21, wherein the change trend value of the point location parameter is determined according to a sliding average value of the point location parameter corresponding to the current pre-determination time point and a sliding average value of the point location parameter corresponding to the previous pre-determination time point.

25. The method according to claim 20 or 21, wherein the change trend value of the point location parameter is determined according to the point location parameter corresponding to the current pre-determination time point and the sliding average value of the point location parameter corresponding to the previous pre-determination time point.

26. The method of claim 24, wherein the step of determining a sliding average of the point location parameters corresponding to the current predetermined point in time comprises:

27. The method of claim 24, wherein the step of determining a sliding average of the point location parameters corresponding to the current predetermined point in time comprises:

28. The method of claim 21, wherein the trend data bit width value is determined based on a corresponding quantization error.

29. The method of claim 20 or 21, wherein the step of determining the data bit width employed in the quantization process within the target iteration interval comprises:

30. The method of claim 29, wherein the step of determining the data bit width employed in the quantization process within the target iteration interval comprises:

31. The method of claim 11, wherein the pre-quantization data is data to be quantized involved in a weight update iteration within a target iteration interval; the target iteration interval comprises at least one weight updating iteration, and the same quantization parameter is adopted in the quantization process in the same target iteration interval.

32. The method of claim 31, wherein the step of determining the target iteration interval comprises:

33. The method of claim 1, wherein the point location parameter is determined based on a result of the statistics and the data bit width.

34. The method of claim 1, wherein the method further comprises:

and determining the splitting size according to the judging result.

35. The method of claim 34, wherein the step of determining the split size based on the determination comprises:

36. The method of claim 1, wherein a data flow between the current operator and the operator to be fused is unidirectional.

37. A quantization parameter determining device for a neural network, comprising a memory and a processor, the memory storing a computer program executable on the processor, the processor implementing the steps of the method according to any one of claims 1 to 36 when the computer program is executed.

38. A computer readable storage medium having a computer program stored therein, wherein the computer program, when executed, implements the steps of the method according to any of claims 1-36.