WO2022111002A1 - 用于训练神经网络的方法、设备和计算机可读存储介质 - Google Patents

用于训练神经网络的方法、设备和计算机可读存储介质 Download PDF

Info

Publication number
WO2022111002A1
WO2022111002A1 PCT/CN2021/119122 CN2021119122W WO2022111002A1 WO 2022111002 A1 WO2022111002 A1 WO 2022111002A1 CN 2021119122 W CN2021119122 W CN 2021119122W WO 2022111002 A1 WO2022111002 A1 WO 2022111002A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
quantization
parameter
update
neural network
Prior art date
Application number
PCT/CN2021/119122
Other languages
English (en)
French (fr)
Inventor
周诗怡
刘少礼
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202011379857.0A external-priority patent/CN114580624A/zh
Priority claimed from CN202011379869.3A external-priority patent/CN114580625A/zh
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Publication of WO2022111002A1 publication Critical patent/WO2022111002A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • This disclosure relates generally to the field of artificial intelligence. More specifically, the present disclosure relates to methods, apparatus, integrated circuits, boards, and computer-readable storage media for training neural networks through hardware platforms.
  • the data range that can be represented by relatively low bit-width data is limited.
  • the minimum precision value that it can represent is 5.9604 ⁇ 10-8.
  • the resulting value is beyond what a 16-bit floating-point number can represent.
  • the value is still represented by a 16-bit floating point number, the value is 0. Obviously, such a zeroing operation will have a significant impact on the accuracy and precision of the calculation.
  • the present disclosure provides the following technical solutions in various aspects.
  • the present disclosure provides an apparatus for training a neural network, wherein training the neural network includes forward propagation and back propagation performed iteratively, the apparatus comprising: a scaling circuit configured to vary according to a scaling factor scaling the loss value obtained by the forward propagation to obtain a scaled loss value; an update circuit configured to perform an update operation in the backpropagation based on the scaled loss value; and an adjustment circuit is configured to adjust the scaling factor for scaling of the loss value in next generation backpropagation based on at least the gradient data in the backpropagation.
  • the present disclosure provides an integrated circuit that includes the apparatus as described above and discussed in various embodiments below.
  • the present disclosure provides a board including the apparatus as described above and discussed in various embodiments below.
  • the present disclosure provides a method for training a neural network, wherein training the neural network comprises iteratively performing forward and backpropagation, the method comprising: scaling the forward according to a scaling factor Propagating the loss value obtained by scaling to obtain a scaled loss value; performing an update operation in the backpropagation based on the scaled loss value; and adjusting the backpropagation based on at least gradient data in the backpropagation Scaling factor to use for scaling of the loss value in the next generation of backpropagation.
  • the present disclosure provides an apparatus for training a neural network.
  • the device includes at least one processor.
  • the apparatus also includes at least one memory having computer program code stored thereon, the at least one memory and the computer program code being configured to utilize the processor to cause the apparatus to perform the aforementioned methods and the methods described below.
  • Various embodiments of the method are possible.
  • the present disclosure provides a computer-readable storage medium having stored thereon a computer program for training a neural network that, when executed by one or more processors, implements the aforementioned methods and the methods described below. Various embodiments of the method are described.
  • the solution of the present disclosure can adaptively or timely adjust the size of the loss value in the back-propagation of training the neural network , so that when the operation data of the low-precision type is used for the reverse update operation, the effect of the result error caused by the too small data expression range of the low-precision data type is avoided, thereby improving the accuracy and precision of the operation.
  • the adjustment of the loss value in this scheme it is more applicable to quantify high-precision data into low-precision data to participate in the operation, thereby expanding the operation scenario of the neural network.
  • the solution of the present disclosure supports the quantization of high-precision data into low-precision data to participate in the relevant operations in the neural network
  • the calculation scenario is also not limited by the number of operation bits that the processor chip can support, thereby expanding the processor. usage scenarios.
  • the disclosed scheme also speeds up the neural network training process and reduces computational overhead and power consumption.
  • the neural network trained by the solution of the present disclosure it can be widely used in various fields such as image processing, speech recognition, data acquisition, etc., and also greatly improves the efficiency and cost of related fields.
  • FIG. 1 is an exemplary block diagram illustrating a neural network to which technical solutions of the present disclosure may be applied;
  • FIG. 2 is a functional block diagram illustrating an apparatus for training a neural network according to an embodiment of the present disclosure
  • FIG. 3 is an exemplary flow diagram illustrating update operations in forward propagation and back propagation in a neural network according to an embodiment of the present disclosure
  • FIG. 4 is a graph illustrating a principle involving quantization error according to an embodiment of the present disclosure
  • FIG. 5 is a flowchart illustrating a method for training a neural network according to an embodiment of the present disclosure
  • FIG. 6 is a block diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure.
  • FIG. 7 is a schematic structural diagram illustrating a board according to an embodiment of the present disclosure.
  • the solution of the present disclosure is mainly applied in the field of artificial intelligence, especially in the efficient training of neural networks. Therefore, in order to facilitate the understanding of the solution of the present disclosure, the following will firstly describe the neural network architecture and the neural network architecture involved in the present disclosure. Its working principle is introduced.
  • Neural network (“Neural Network”, referred to as “NN”) is a mathematical model that imitates the structure and function of biological neural network, which is calculated by a large number of neuron connections. Therefore, a neural network is a computational model consisting of a large number of nodes (or “neurons") connected to each other. Each node represents a specific output function, called an activation function (“activation function”). The connection between each two neurons represents a weighted value of the signal passing through the connection, called the weight, which is equivalent to the memory of the neural network. The output of the neural network varies according to the way the neurons are connected and the weights and activation functions. In neural network, neuron is the basic unit of neural network.
  • a connection is the connection of a neuron to another layer or to another neuron in the same layer, and the connection is accompanied by a weight associated with it.
  • the bias is an extra input to the neuron, which is always 1 and has its own connection weight. This ensures that the neuron will fire even if all inputs are empty (all 0).
  • the neural network In application, if a non-linear function is not applied to the neurons in the neural network, the neural network is just a linear function, then it is not more powerful than a single neuron. If the output of a neural network is between 0 and 1, for example, in the case of cat and dog discrimination, the output close to 0 can be regarded as a cat, and the output close to 1 can be regarded as a dog.
  • an activation function such as a sigmoid activation function, is introduced into the neural network. Regarding this activation function, the return value is usually a number between 0 and 1. Therefore, the activation function is used to introduce nonlinearity into the neural network, which narrows the results of the neural network operation to a smaller range. In fact, it doesn't matter how the activation function is expressed, what matters is that a nonlinear function is parameterized by some weights, and the nonlinear function can be changed by changing these weights.
  • FIG. 1 is an exemplary block diagram illustrating a neural network 100 to which the technical solutions of the present disclosure may be applied.
  • the neural network 100 includes an input layer and an output layer and a plurality of hidden layers located between the input layer and the output layer, which are exemplarily shown as convolutional layers, activation layers, pooling layers in FIG. 1 layers and fully connected layers.
  • the neurons of the input layer are called input neurons, in this case 3 input neurons are drawn, which receive 3 input signals x1, x2, x3.
  • the input layer acts as the first layer in a neural network, accepting required input signals (values) and passing them to the next layer.
  • the input layer does not operate on the input signal (value) and has no associated weights and biases.
  • the input layer can handle multidimensional data.
  • the input layer of a one-dimensional convolutional neural network receives a one-dimensional or two-dimensional array, where the one-dimensional array is usually time or spectral samples; a two-dimensional array can contain multiple channels; the input layer of a two-dimensional convolutional neural network receives A 2D or 3D array; the input layer of a 3D convolutional neural network receives a 4D array, and so on.
  • data preprocessing operations can also be performed at the input layer, for example, operations such as de-averaging, normalizing, and dimensionality reduction can be performed on the data.
  • Hidden layers contain neurons (nodes) used to apply different transformations to the input data.
  • the neural network shown in Figure 1 includes four hidden layers, namely a convolutional layer including 4 neurons (nodes), an activation layer with 4 neurons, a pooling layer with 2 neurons, and 6 A fully connected layer of neurons. Finally, the operation value of the fully connected layer is passed to the output layer. The neurons of the output layer are called output neurons. The output layer receives the output from the last hidden layer. In the neural network shown in Figure 1, the output layer has 2 neurons with 2 output signals y1 and y2. From the hidden layers shown, depending on the particular hidden layer, each neuron of each hidden layer may or may not be connected to every neuron of the next layer, such as the activation layer and the pooling layer. Neurons are partially connected, and the pooling layer is fully connected to the fully connected layer.
  • the first hidden layer-convolutional layer in this example its function is usually to extract features from the input data, and it can contain multiple convolution kernels, and each element of the convolution kernel can correspond to a weight coefficient and a bias, similar to the neurons of a feedforward neural network.
  • each feature in the image is first perceived locally, and then the local comprehensive operation is performed at a higher level to obtain global information.
  • the parameters of the convolutional layer can include the size of the convolution kernel, the stride and the padding, which together determine the size of the output feature map of the convolutional layer, which is the hyperparameter of the convolutional neural network.
  • each neuron in the convolutional layer is connected to multiple neurons in the region of the previous layer in a close position, and the size of the region depends on the size of the convolutional kernel.
  • the convolution kernel When the convolution kernel is working, it will regularly scan the input features, perform matrix element multiplication and summation (multiply-add) on the input features, and superimpose the deviation.
  • the activation layer that receives the output of the above convolutional layer is actually a non-linear mapping of the output of the convolutional layer.
  • Commonly used excitation functions are: Sigmoid function, Tanh function, ReLU function, Leaky, ReLU function, ELU function and Maxout function. After passing through these activation functions, the output of the previous layer will become relatively complex, thereby improving the expressive ability of the neural network model.
  • the pooling layer is mainly used for feature dimension reduction, compressing the number of data and parameters, reducing overfitting, and improving the fault tolerance of the model.
  • pooling methods mainly include max pooling and average pooling.
  • the output feature map is passed to the pooling layer for feature selection and information filtering.
  • the pooling layer contains a pre-set pooling function whose function is to replace the result of a single point in the feature map with the feature map statistics of its neighboring regions.
  • the pooling layer selects the pooling region in the same step as the convolution kernel scans the feature map, which can be controlled by the pooling size, stride and padding.
  • the signal processing flow of the neural network reaches the fully connected layer, which is located in the last part of the hidden layer of the neural network in this example.
  • the feature map loses its spatial topology in the fully connected layer, and is expanded into a vector and output through the activation function.
  • the fully connected layer can non-linearly combine the extracted features to obtain the output, that is, the fully connected layer itself is not expected to have feature extraction ability, but tries to use the existing high-order features to complete the learning goal.
  • LRN local normalization
  • data enhancement and other operations can also be performed in the fully connected layer to increase the robustness of the neural network.
  • each layer in the neural network there is one or more operators (described in detail in conjunction with FIG. 2 ) associated with the layer to perform corresponding computational operations.
  • An operator in a neural network is a mapping from a function space to a function space. Broadly speaking, any operation on any function can be considered as an operator.
  • an operator can be a mapping, relation or transformation.
  • there is a convolution operator for a convolution layer (or other layers that need to perform convolution operations) can be embodied as the expression of one or more convolution calculation formulas.
  • the structure of the neural network shown in FIG. 1 and the functions of its nodes have been exemplarily described above.
  • a large amount of sample data (including input and output) will be provided in advance to train the initial neural network.
  • the trained neural network is obtained.
  • the trained neural network can give a correct output for future real-world input.
  • a loss function is a measure of how well a neural network is performing at a particular task.
  • the loss function can be obtained as follows: in the process of training a neural network, for each sample data, the output value is passed along the neural network to obtain the output value, and then the difference between the output value and the expected value is squared, so that The calculated loss function is the distance between the predicted value and the true value, and the purpose of training the neural network is to reduce this distance or the value of the loss function.
  • the loss function can be expressed as:
  • y represents the expected value
  • i is the index of each sample data in the sample data set.
  • m is the number of sample data in the sample data set.
  • a dataset consists of pictures of cats and dogs. If the picture is a dog, the corresponding label is 1, and if the picture is a cat, the corresponding label is 0. This label corresponds to the expected value y in the above formula.
  • it actually wants to obtain the recognition result through the neural network that is, whether the animal in the image is a cat or a dog.
  • the loss function it is necessary to traverse each sample image in the sample data set to obtain the actual result corresponding to each sample image Then calculate the loss function as defined above.
  • loss value the value of the loss function
  • the weights When starting to train a neural network, the weights need to be randomly initialized. In most cases, the initialized neural network does not provide a good training result. During training, let's say you start with a bad neural network, and you can get a network with high accuracy by training.
  • the training process of the neural network is divided into two stages, the first stage is the forward processing operation of the signal (referred to in this disclosure as the "forward propagation process"), the training passes from the input layer through the hidden layer, and finally reaches the output layer.
  • the second stage is the back-propagation gradient operation (referred to as the back-propagation process in this disclosure), training from the output layer to the hidden layer, and finally to the input layer, adjusting the weights and biases of each layer in the neural network in turn according to the gradient .
  • the input value is input to the input layer of the neural network, and the output of the so-called predicted value can be obtained from the output layer of the neural network through the corresponding operations performed by the correlation operators of multiple hidden layers.
  • the input value is provided to the input layer of the neural network, it can do nothing or do some necessary preprocessing according to the application scenario.
  • the second hidden layer obtains the predicted intermediate value from the first hidden layer and performs computation and activation operations, and then passes the obtained predicted intermediate value to the next hidden layer. Do the same in later layers, and finally get the output value in the output layer of the neural network.
  • an output value called the predicted value is usually obtained.
  • the predicted value can be compared with the actual output value to obtain the corresponding error value.
  • the chain rule of differential calculus can be used to update the weights of each layer, in order to obtain a lower error value relative to the previous one in the next forward propagation process.
  • the derivatives of the error values corresponding to the weights of the last layer of the neural network are first calculated (call these derivatives gradients). These gradients are then used to calculate the gradient of the penultimate layer in the neural network. Repeat this process until you get the gradient corresponding to each weight in the neural network. Finally, the corresponding gradient is subtracted from each weight in the neural network to update the weight once to reduce the error value.
  • forward operators Similar to the use of various types of operators (referred to as forward operators in this disclosure) in the forward propagation process, there are also countermeasures corresponding to the forward operators in the forward propagation process in the corresponding back propagation process. to the operator.
  • the convolution operator in the aforementioned convolution layer it includes the forward convolution operator in the forward propagation process and the deconvolution operator in the back propagation process.
  • the weights in the neural network are updated once by using the gradient, which is called the forward propagation process of the forward signal processing and the back propagation process corresponding to an error.
  • the forward propagation process of the forward signal processing and the back propagation process corresponding to an error.
  • One iteration. In order to obtain a neural network with the desired accuracy, a huge sample data set is required during the training process, and it is almost impossible to input the sample data set into a computing device (such as a computer) at one time. Therefore, in order to solve this problem, it is necessary to divide the sample data set into multiple blocks and transmit them to the computer in blocks. After each block of data set is processed in the forward propagation process, a corresponding update neural network in the back propagation process is carried out. Weight operation of the network.
  • the data of the neural network is represented by a high-precision data format (such as floating point numbers).
  • a high-precision data format such as floating point numbers.
  • floating-point operations based on the arithmetic representation of floating-point numbers and the arithmetic representation of fixed-point numbers, for floating-point operations and fixed-point operations of the same length, floating-point operations
  • the computational mode is more complex and requires more logic devices to form a floating-point arithmetic unit.
  • the volume of floating-point operators is larger than that of fixed-point operators.
  • floating-point operators need to consume more resources to process, so that the power consumption difference between fixed-point and floating-point operations is usually orders of magnitude, resulting in a significant difference in computational cost.
  • fixed-point operations are faster than floating-point operations, and the loss of precision is not large, so it is feasible to use fixed-point operations in artificial intelligence chips to process a large number of neural network operations (such as convolution and fully connected operations) Program.
  • floating-point data involving inputs, weights, and gradients of forward convolution, forward full connection, reverse convolution, and reverse full connection operators can be quantized and then subjected to fixed-point operations. After the operation is completed, the low-precision data is converted into high-precision data.
  • the quantized weights are all 8-bit fixed-point numbers (a low-precision type relative to floating-point numbers) as an example, since there are often millions of connections in a neural network, almost all The space is occupied by the weights of neuron connections, and these weights may all be different floating point numbers.
  • the weights of each layer tend to be normally distributed in a certain interval, such as (-3.0, 3.0).
  • the maximum and minimum values corresponding to the weights of each layer in the neural network are saved, and each floating-point value is represented by an 8-bit fixed-point number.
  • each quantization interval is represented by an 8-bit fixed-point number. For example: In the interval (-3.0,3.0), byte 0 represents -3.0, and byte 255 represents 3.0. And so on, byte 128 represents 0.
  • shift and n are usually involved: shift and n, where shift is the position of the fixed-point number (that is, the "point position parameter” in this disclosure), and n is the bit width of the fixed-point number (that is, the The disclosed "bit width parameter"), n can be set manually initially, and shift is calculated by the following formula through the distribution range and n of the data to be quantized:
  • the above quantized fixed-point number is beneficial to the accelerated training of the neural network, reducing the chip size and significantly reducing the computational cost.
  • the above-mentioned quantization operation may be performed on neuron data and weight data in the forward propagation of training the neural network, and the quantization operation may be performed on the gradient data used for the update operation in the backward propagation of the training neural network.
  • some correlation operators described in detail later in conjunction with Fig. 3
  • the quantization operation can be optimized to make full use of the quantization operation without introducing too much quantization overhead, thereby speeding up the neural network training, improving training accuracy and reducing computational overhead.
  • the present disclosure proposes an effective loss value adjustment scheme to overcome the above-mentioned defects, improve the training accuracy of the neural network, and thus speed up the training.
  • FIG. 2 is a functional block diagram illustrating an apparatus 200 for training a neural network according to an embodiment of the present disclosure.
  • the apparatus 200 includes a scaling circuit 202 , an update circuit 204 and an adjustment circuit 206 .
  • apparatus 200 may also include quantization circuitry 208 .
  • the scaling circuit may be configured to scale the loss value obtained by forward propagation in the process of training the neural network according to the scaling factor to obtain the scaled loss value. By amplifying the loss value, it is possible to effectively avoid errors in subsequent update operations due to the limited numerical expression range of low-precision data, resulting in delay and inefficiency of the entire training process. For ease of understanding, assuming that the loss value of the present disclosure is expressed as loss, it can be scaled by the following formula:
  • Loss_scale loss ⁇ scale (6)
  • scale represents the scaling factor of the present disclosure
  • Loss_scale represents the scaling factor after scaling.
  • the update circuit may be configured to perform an update operation in backpropagation based on the above-mentioned scaled loss value.
  • the update operations here may involve updates to the weights and updates of gradient data passed from the previous layer to the next layer in the back-propagation direction.
  • the update operation of backpropagation involves quantization operations of various types of data, such as the quantization operations from high-precision data (such as floating-point numbers) to low-precision data (such as fixed-point numbers) described above and from low-precision data (such as fixed-point numbers) Inverse quantization operation of precision data to high precision data.
  • the quantization circuit 208 of the device 200 may be configured to perform a quantization operation on the operational data according to the quantization parameters, and to determine whether to update the aforementioned quantization parameters based on the operational data. Further, the adjustment circuit may be configured to adjust the scaling factor at least according to the gradient data in the back-propagation for scaling of the loss value in the next-generation back-propagation. In an implementation scenario, the adjustment circuit may further be configured to adjust the aforementioned scaling factor for scaling the loss value in the next generation backpropagation when the quantization circuit determines to update the quantization parameter. In some embodiments, the scaling factor may be adjusted based on gradient data in backpropagation.
  • the above-mentioned operation data may include various types of data in the neural network training process.
  • the operational data may include gradient data.
  • the quantization parameter may include a first point position parameter or a first bit width parameter applied to the quantized gradient data
  • the adjustment circuit may be configured to adjust the first point position parameter or the first bit width parameter when the first point position parameter or the first bit width parameter is updated
  • the scaling factor is used for scaling of the loss value in the next generation of backpropagation.
  • the operational data may include gradient data and neuron data
  • the quantization parameter may include a second point position parameter or a second bit width parameter for the gradient data and neuron data.
  • the adjustment circuit of the present disclosure may be configured to adjust the scaling factor for scaling of the loss value in the next generation backpropagation when the second point position parameter or the second bit width parameter is updated.
  • the operation data may include gradient data, neuron data and weight data
  • the quantization parameters include a third point position parameter or a third bit width parameter for the three
  • the adjustment circuit It may be configured to adjust the scaling factor for scaling of the loss value in next generation backpropagation when the third point position parameter or the third bit width parameter is updated.
  • the quantization circuit of the present disclosure may include a positive quantization circuit and an inverse quantization circuit, wherein the positive quantization circuit may be configured to quantize operational data of a high precision data type (eg, floating point type) into a low precision data type (eg, according to a quantization parameter)
  • the inverse quantization circuit may be configured to inverse quantize the operation data of the low-precision data type into the operation data of the high-precision data type according to the quantization parameter.
  • the quantization circuit of the present disclosure can also make a judgment on whether to update the quantization parameter.
  • the quantization circuit may be configured to determine whether to update the quantization parameter based on the quantization error of the operational data.
  • the quantization circuit of the present disclosure may be configured to perform an operation based on the mean value of the operation data before and after quantization to determine the aforementioned quantization error.
  • the update circuit of the present disclosure may be configured to utilize the scaled loss values to obtain weight gradient data.
  • the inverse quantization circuit in the above quantization circuit may be configured to inverse quantize the weight gradient data from a low-precision data type (eg, fixed-point number) to weight gradient data of a high-precision data type (eg, floating-point number).
  • the update circuit may be configured to update the weights using the scaling factor and the weight gradient data of the high precision data type.
  • the obtained loss value when it is a floating-point number with a length of 16 bits, it can be enlarged by a scaling factor to perform a reverse update operation (including the operation of quantization into a fixed-point number for calculation), and in the weight gradient When calculating, it can be converted into high-precision data (such as floating point numbers of 32-bit length). The 32-bit length floating point number can then be scaled down by a scaling factor inversely proportional to the weight gradient data to update the weights.
  • a scaling factor inversely proportional to the weight gradient data to update the weights.
  • the weight gradient w_grad for updating the weights of each layer can be obtained.
  • w_grad can be converted into high-precision type data (for example, it is converted from a relatively low-precision 16-bit floating point number to a relatively high-precision 32-bit floating point number), and calculated by the following formula (7)
  • the scaling factor "scale” of the present disclosure can be adjusted in a number of ways.
  • the scaling factor may be adjusted according to one or more hyperparameters.
  • the scaling factor may be determined according to the data distribution of the gradient data.
  • the scaling factor may also be determined based on a preset threshold and the maximum value of the gradient data.
  • the apparatus of the present disclosure was described above in conjunction with FIG. 2 . It is to be understood that the apparatus of the present disclosure can be applied to one or more layers in a neural network.
  • the quantization circuit of the present disclosure may be configured to determine whether to update the quantization parameters for each layer. When the quantization circuit determines that the quantization parameter needs to be updated at any layer, for example, based on the aforementioned quantization error, it is determined that the quantization parameter needs to be updated. Scaling of loss values.
  • FIG. 3 is an exemplary flowchart illustrating update operations in forward propagation and back propagation in neural network 300 according to an embodiment of the present disclosure.
  • the neural network 300 can be implemented to include an operation block 301 in forward propagation, a gradient update block 302 and a weight update block 303 in back propagation.
  • the neural network shown in FIG. 3 can be regarded as a network including only a single hidden layer (such as a convolution layer) or a network including only one type of operation (only convolution operation), Those skilled in the art will understand from the above and the following description that the solutions of the present disclosure are also applicable to the case where the hidden layer includes multiple layers or multiple other types of operations.
  • FIG. 3 Further shown in FIG. 3 are the aforementioned multiple operators, which may specifically include a quantization operator “quantify”, a forward convolution operator “convFwd”, a weight gradient operator “convBpFilter” and an input data gradient operator “convBpData”.
  • quantify a quantization operator
  • convFwd a forward convolution operator
  • convBpFilter a weight gradient operator
  • convBpData input data gradient operator
  • input neuron data x[fp32] and initial weights w[fp32] can be received.
  • both are floating point numbers with a length of 32 bits. It can be understood that 32 bits here is merely exemplary, and it can also be a floating point number of 16 bits or other bit widths. Both can be quantized into fixed-point numbers by a quantization operation such as that performed by the quantization circuit of the present disclosure, as previously described.
  • a quantization operator "quantify" may be implemented on the quantization circuit of the present disclosure.
  • the quantization operator may include a quantization strategy operator and a quantization parameter operator.
  • the quantization strategy operator may be at least used to determine whether to perform an update operation of a quantization parameter
  • the quantization parameter operator may be at least used to determine a quantization parameter, and use the quantization parameter to perform an update operation on the high
  • the quantization operation is performed on neural network data of a precision data type (floating point in the example of the present disclosure).
  • the above-mentioned quantization strategy operator may be responsible for calculating the quantization error diff bit and the quantization period trend value diff update . Since the determination of the quantization error is of great significance to the quantization period, the adjustment of the data bit width, etc., it will be described in detail below.
  • the quantization error can be calculated by the following formula:
  • the quantized mean is:
  • the slope of the tangent line at c is k.
  • that is, the more concentrated the distribution
  • the quantization interval the mean f and the greater the gap. It can be known from experiments that the more concentrated the data is quantized, the greater the error brought to the final result of the training, so the difference between the mean values before and after quantization can be used to simulate the actual error brought by quantization to the training.
  • the quantization interval should be reduced, that is, the quantization bit width should be increased.
  • the present disclosure is based on such a theoretical basis, and after considering the influence of quantization error on training accuracy and effect, a solution with variable quantization period and data bit width is proposed.
  • the quantization parameters may not be calculated according to the current data to be quantized in each generation, but the quantization parameters may be updated at certain algebraic intervals.
  • the stored quantization parameters obtained from the last update can be used when quantizing data. As long as the update interval is selected appropriately, this will not bring about loss of training accuracy, because the changes in the data to be quantized (such as weights and gradient data) during the training process are relatively stable, with certain continuity and similarity.
  • a simple way is to use a fixed update cycle, but the fixed update cycle is less adaptable, so the present disclosure also proposes an adaptive update cycle adjustment.
  • diff update max(diff update1 , diff update2 ) (15)
  • ⁇ , ⁇ , ⁇ , ⁇ , t, and th can be hyperparameters, and ⁇ and ⁇ can be either empirical or hyperparameters.
  • conventional hyperparameter optimization methods are suitable for both ⁇ and ⁇ .
  • the input of the quantization strategy operator of the present disclosure may include data before quantization and data after quantization, quantization parameters (mainly using the moving average value m of shift), and quantization period I (which can be either the input, It can also be output) and output quantization bit width, where quantization period and output quantization bit width can be passed as input to the quantization parameter operator.
  • the input of the quantization parameter operator may include the data to be quantized, the quantization parameter (including the number of points shift, the moving average value m of the number of points, the scaling factor, etc.), the data bit width (representing which bit width is used for the output quantized data) and Quantization period.
  • the quantization period may be a variable that controls whether the quantization operator should calculate the quantization parameter. For example, when the quantization period is equal to 0, statistics of the quantization parameters can be performed.
  • the settings here are only exemplary, and those skilled in the art can also give other meanings to the quantization period based on the teachings herein, or use different forms to control.
  • the quantization parameter if the quantization parameter is counted, the new quantization parameter needs to be updated to the address of the old quantization parameter. Otherwise, the quantization operation will still use the old quantization parameters.
  • the quantization operator can quantify the data of the same layer in the entire current board.
  • the quantization parameters thus obtained may or may not be subsequently synchronized between multiple machines and multiple cards.
  • synchronization is not performed, a copy of the quantitative parameters can be maintained in each board.
  • each processor core performs synchronization after calculating the quantization parameters, and comprehensively obtains the final global quantization parameters.
  • new quantization parameters and quantized fixed-point numbers will be obtained, that is, the quantization parameters "paramx”, “paramw” and the quantized data "x" [int8]” (corresponding to 8-bit fixed-point neurons), w[int8] (corresponding to 8-bit fixed-point weights), and send the four items as inputs to the forward convolution operator convFwd to perform operations, To get the floating-point result y[fp16] (corresponding to 16-bit floating-point data).
  • the output of the forward convolution operator convFwd here can also be 32-bit floating point data.
  • the forward convolution operator convFwd can perform operations such as multiply-add operations on fixed-point neuron data and weight data.
  • the convolution operation can be the multiplication and summation of the corresponding image matrix and the corresponding position elements of the filter, and finally add Up-bias b to obtain a feature map as the output.
  • the convolution operator of the present disclosure can also be integrated with an inverse quantization operator, which can be implemented, for example, by an inverse quantization circuit in the quantization circuit of the present disclosure.
  • the output result can be inverse-quantized into 16-bit floating-point data y[fp16].
  • the inverse quantization operation here may involve using the aforementioned quantization parameters paramx and paramw to determine the step size during inverse quantization, that is, the step in the aforementioned formula (5), so as to inverse quantize the fixed-point number into a high-precision floating-point number.
  • the loss function LossDiff can be determined based on the training results obtained in the forward propagation process, which can be obtained, for example, by the method of formula (1) described in conjunction with FIG. 1 , and details are not repeated here.
  • the loss value of the LossDiff is obtained, according to the solution of the present disclosure, it can be scaled by a scaling circuit (ie, the "scaling" operation shown in the figure is performed) for the operation in backpropagation.
  • Equation (6) can be used to determine the scaled loss value.
  • the training process will enter into a backpropagation process, which involves a reverse gradient update block 302 and a weight update block 303 to perform the backpropagation process.
  • the present disclosure implements two operators by updating the circuit, namely the weight gradient operator “convBpFilter” and the input data gradient operator “convBpData” shown in the figure.
  • the function of convBpData may be to calculate the gradient of the input neuron data x.
  • the gradient calculation formula of x can be obtained as:
  • the function of convBpFilter can be to calculate the gradient of the weight w.
  • the gradient calculation formula of w can be obtained as:
  • w, x, ⁇ represent the weight, input and input gradient data from the previous layer, respectively
  • the rot180 function represents the rotation of the data by 180 degrees.
  • the input gradient data such as a vector (equivalent to the adjusted dy[ fp16]) and the weight matrix of this layer to perform weighted summation to calculate the output gradient vector of this layer (equivalent to dx[fp16] in the figure), which involves the inverse quantization circuit performed from fixed-point to floating-point Inverse quantization of points.
  • the input gradient vector (equivalent to dy[fp16] in the figure) is operated with the input neuron data in the forward propagation process (for example, multiplication of bits), and the gradient of the weight of this layer can be obtained (equivalent to the figure dw[fp32] in . Then, it can be scaled using the previous equation (7) to obtain the actual weight gradient. Then, the weights of this layer can be updated according to the obtained actual gradients of the weights of this layer (equivalent to w[fp32] in the figure).
  • the solution of the present disclosure can obtain the current input gradient data after scaling the loss value of the loss function LossDiff. Next, it is quantized by the quantization operator quantify to obtain 8-bit fixed-point input gradient data dy[int8] and a quantization parameter paramdy about the gradient data. Next, the corresponding weight quantization parameter paramw, 8-bit fixed-point weight w[int8], and the input gradient dy[int8] and gradient quantization parameter paramdy obtained in the forward propagation process of this layer can be input as input.
  • the weight w and the input neuron data x can reuse the data used in the forward propagation process, so only the quantized gradient dy needs to be quantized in the backpropagation process , the result of gradient dy quantization is the quantization parameter paramdy and the quantized data dy[int8], without the need to quantize the input neuron data and weight data again, thus reducing the multiple quantization of the data and shortening the training time.
  • the adjustment circuit of the present disclosure will adjust the scaling factor accordingly.
  • the adjusted scaling factor can be calculated based on:
  • th in the above formula is a hyperparameter, which can be set according to the size of neuron data, for example, 512.
  • m1 can be the maximum value of the gradient data in backpropagation.
  • floor() represents the rounding down function
  • pow(x,y) function represents the y-th power (or power) of x.
  • FIG. 5 is a flowchart illustrating a method 500 for training a neural network, wherein the process of training a neural network here includes forward propagation and back propagation performed iteratively, according to an embodiment of the present disclosure.
  • the method 500 scales the loss value obtained by the forward propagation according to a scaling factor to obtain a scaled loss value.
  • method 500 performs an update operation in the backpropagation based on the scaled loss value.
  • the update operation here may include gradient update and weight update.
  • method 500 adjusts the scaling factor based on at least the gradient data in the back-propagation for scaling of the loss value in the next-generation back-propagation.
  • step 506 the output of step 506 is the adjusted scaling factor, and the scaling factor is fed back to step 502 for new scaling of the loss value.
  • the method 500 can be repeatedly performed until the loss value of the neural network reaches the expected value, thereby completing the training of the neural network.
  • the solution of the present disclosure further proposes to perform a quantization operation on the operation data according to the quantization parameter after the update operation of step 504 is performed, and determine whether to perform a quantization operation on the operation data based on the operation data. update the quantization parameters described above. In response to determining to update the quantization parameters, the scaling factor is adjusted for scaling of loss values in the next generation of backpropagation.
  • the method 500 can be performed by the device 200 described in the present disclosure in conjunction with FIG. 2 , so the description of the specific operation of the device 200 in conjunction with FIG. 2 is also applicable to the steps performed by the method 500 , here No longer.
  • FIG. 6 is a structural diagram illustrating a combined processing apparatus 600 according to an embodiment of the present disclosure.
  • the combined processing device 600 includes a computing processing device 602, an interface device 604, other processing devices 606, and a storage device 608.
  • one or more computing devices 610 may be included in the computing processing device, and the computing devices may be configured to perform the operations described herein in conjunction with FIGS. 1-5 .
  • the computing device 610 may include the device 200 described in conjunction with FIG. 2 of the present disclosure, and perform the steps described in conjunction with FIG. 5 .
  • the computing processing devices of the present disclosure may be configured to perform user-specified operations.
  • the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor.
  • one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure or circuit of an artificial intelligence processor core to implement, for example, various types of circuits disclosed in the present disclosure, such as Scaling circuit, update circuit, quantization circuit or adjustment circuit.
  • the computing processing apparatus of the present disclosure may interact with other processing apparatuses through an interface apparatus to jointly complete an operation specified by a user.
  • other processing devices of the present disclosure may include central processing units (Central Processing Unit, CPU), graphics processing units (Graphics Processing Unit, GPU), artificial intelligence processors and other general-purpose and/or special-purpose processors.
  • processors may include, but are not limited to, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the computing processing device of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when computing processing devices and other processing devices are considered together, the two can be viewed as forming a heterogeneous multi-core structure.
  • the other processing device may serve as an interface for the computing processing device of the present disclosure (which may be embodied as a related computing device for artificial intelligence such as neural network operations) with external data and control, performing operations including but not limited to Limited to basic controls such as data movement, starting and/or stopping computing devices.
  • other processing apparatuses may also cooperate with the computing processing apparatus to jointly complete computing tasks.
  • the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices.
  • the computing and processing device may obtain input data from other processing devices via the interface device, and write the input data into the on-chip storage device (or memory) of the computing and processing device.
  • the computing and processing device may obtain control instructions from other processing devices via the interface device, and write them into a control cache on the computing and processing device chip.
  • the interface device can also read the data in the storage device of the computing processing device and transmit it to other processing devices.
  • the combined processing device of the present disclosure may also include a storage device.
  • the storage device is connected to the computing processing device and the other processing device, respectively.
  • a storage device may be used to store data of the computing processing device and/or the other processing device, such data may be, for example, operational data of the present disclosure, including but not limited to pre-quantization or quantization Post neuron data, weight data and/or gradient data.
  • the data may be data that cannot be fully stored in the internal or on-chip storage of the computing processing device or other processing device.
  • the present disclosure also discloses a chip (eg, chip 702 shown in FIG. 7).
  • the chip is a System on Chip (SoC) and integrates one or more combined processing devices as shown in FIG. 6 .
  • the chip can be connected with other related components through an external interface device (such as the external interface device 706 shown in FIG. 7 ).
  • the relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card or a wifi interface.
  • other processing units such as video codecs
  • interface modules such as DRAM interfaces
  • the present disclosure also discloses a chip package structure including the above-mentioned chip.
  • the present disclosure also discloses a board including the above-mentioned chip package structure. The board will be described in detail below with reference to FIG. 7 .
  • FIG. 7 is a schematic structural diagram illustrating a board 700 according to an embodiment of the present disclosure.
  • the board includes a storage device 704 for storing data, which includes one or more storage units 710 .
  • the storage device can be connected to the control device 708 and the chip 702 described above for connection and data transmission through, for example, a bus.
  • the board also includes an external interface device 706, which is configured for data relay or transfer function between the chip (or a chip in a chip package structure) and an external device 712 (such as a server or a computer, etc.).
  • the data to be processed can be transmitted to the chip by an external device through an external interface device.
  • the calculation result of the chip may be transmitted back to the external device via the external interface device.
  • the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface and the like.
  • control device in the board of the present disclosure may be configured to regulate the state of the chip.
  • control device may include a single-chip microcomputer (Micro Controller Unit, MCU) for regulating the working state of the chip.
  • MCU Micro Controller Unit
  • an electronic device or device which may include one or more of the above-mentioned boards, one or more of the above-mentioned chips and/or one or a plurality of the above-mentioned combined processing devices.
  • the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the electronic device or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, terminal, etc.
  • the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be obtained from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Matching appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
  • units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units.
  • the aforementioned components or elements may be co-located or distributed over multiple network elements.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure.
  • a computer device eg, a personal computer, a server or network equipment, etc.
  • the aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • the various types of devices described herein may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • a variable resistance memory Resistive Random Access Memory, RRAM
  • Dynamic Random Access Memory Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • ROM and RAM etc.
  • An apparatus for training a neural network wherein training the neural network comprises forward propagation and back propagation performed iteratively, the apparatus comprising:
  • a scaling circuit configured to scale the loss value obtained by the forward propagation according to a scaling factor to obtain a scaled loss value
  • an update circuit configured to perform an update operation in the backpropagation based on the scaled loss value
  • An adjustment circuit configured to adjust the scaling factor for scaling of the loss value in next generation backpropagation based on at least the gradient data in the backpropagation.
  • adjustment circuit is further configured to adjust the scaling factor for scaling of the loss value when the quantization circuit determines to update the quantization parameter.
  • Clause A3 The apparatus of clause A2, wherein the operational data includes the gradient data, the quantization parameter includes a first point position parameter or a first bit width parameter, and the adjustment circuit is configured to When the one-point position parameter or the first bit width parameter is updated, the scaling factor is adjusted for scaling the loss value in the next-generation backpropagation.
  • Clause A4 The apparatus of clause A2, wherein the operational data includes the gradient data and neuron data, the quantization parameter includes a second point position parameter or a second bit width parameter, and the adjustment circuit is configured to When the second point position parameter or the second bit width parameter is updated, the scaling factor is adjusted for scaling of the loss value in next-generation backpropagation.
  • Clause A5. The apparatus of clause A2, wherein the operational data includes the gradient data, neuron data and weight data, the quantization parameter includes a third point position parameter or a third bit width parameter, and the The adjustment circuit is configured to adjust the scaling factor for scaling of the loss value in next generation backpropagation when the third point position parameter or the third bit width parameter is updated.
  • the quantization circuit comprises a positive quantization circuit and an inverse quantization circuit
  • the positive quantization circuit is configured to quantize the operational data of a high precision data type to a low precision according to the quantization parameter operation data of the data type
  • the inverse quantization circuit is configured to inverse quantize the operation data of the low-precision data type into operation data of the high-precision data type according to the quantization parameter.
  • Clause A7 The apparatus of Clause A6, wherein the high precision data type is a floating point data type and the low precision data type is a fixed point data type.
  • Clause A8 The apparatus of Clause A6, wherein the update circuit is configured to utilize the scaled loss value to obtain weight gradient data in the gradient data, and the inverse quantization circuit is configured to convert the weights
  • the value gradient data is inverse quantized into weight gradient data of a high precision data type, and the update circuit is configured to update the weights using the scaling factor and the weight gradient data of the high precision data type.
  • Clause A12 The apparatus of Clause A2, wherein the quantization circuit is configured to determine whether to update the quantization parameter based on a quantization error of the operational data.
  • Clause A13 The apparatus of clause A12, wherein the quantization circuit is configured to perform an operation to determine the quantization error based on a pre-quantization and post-quantization mean of the operational data.
  • Clause A14 The apparatus of any of clauses A2-A13, wherein the neural network comprises a multi-layered structure formed by a plurality of neuron connections, and wherein the quantization circuit is configured to determine for each layer whether The quantization parameter is updated, and when the quantization circuit determines to update the quantization parameter at any layer, the adjustment circuit is configured to dynamically adjust the scaling factor for use in the next generation of backpropagation. scaling of the loss value described above.
  • a method for training a neural network comprising forward propagation and back propagation performed iteratively, the method comprising:
  • the scaling factor is adjusted according to at least the gradient data in the back-propagation for scaling of the loss value in the next-generation back-propagation.
  • Clause A18 The method of Clause A17, wherein the method further comprises performing a quantifying operation, the quantifying operation comprising:
  • the scaling factor is adjusted for scaling the loss value.
  • Clause A19 The method of Clause A18, wherein the operational data comprises the gradient data, the quantization parameter comprises a first point position parameter or a first bit width parameter, and the adjusting comprises:
  • the scaling factor is adjusted for scaling of the loss value in next-generation backpropagation.
  • Clause A20 The method of Clause A18, wherein the operational data includes the gradient data and neuron data, the quantization parameter includes a second point position parameter or a second bit width parameter, and the adjusting includes:
  • the scaling factor is adjusted for scaling of the loss value in next-generation backpropagation.
  • the scaling factor is adjusted for scaling of the loss value in the next generation of backpropagation.
  • Clause A22 The method of Clause A18, wherein the quantization operation comprises a positive quantization operation and an inverse quantization operation, wherein the positive quantization operation comprises quantizing operational data of a high precision data type into low precision data according to the quantization parameter type of operation data, and the inverse quantization operation includes inverse quantization of operation data of a low-precision data type into operation data of a high-precision data type according to the quantization parameter.
  • Clause A23 The method of Clause A22, wherein the high precision data type is a floating point data type and the low precision data type is a fixed point data type.
  • Clause A24 The method of Clause A22, wherein the update operation in the backpropagation comprises utilizing the scaled loss value to obtain weight gradient data in the gradient data, and the inverse quantization operation including inverse quantizing the weight gradient data into weight gradient data of a high precision data type, and the update operation in the backpropagation includes utilizing the scaling factor and the weight gradient data pair of the high precision data type The weights are updated.
  • Clause A25 The method of Clause A17, wherein the adjusting comprises adjusting the scaling factor according to one or more hyperparameters.
  • Clause A27 The method of Clause A17, wherein the adjusting comprises determining the scaling factor based on a maximum value of the gradient data and a preset threshold.
  • Clause A28 The method of Clause A18, wherein the quantizing comprises determining whether to update the quantization parameter based on a quantization error of the operational data.
  • Clause A29 The method of Clause A28, wherein the quantizing comprises performing an operation based on a pre-quantization and post-quantization mean of the operational data to determine the quantization error.
  • the scaling factor is dynamically adjusted for scaling of the loss value in the next generation of backpropagation.
  • Clause A31 A computer-readable storage medium having stored thereon a computer program for training a neural network, which when executed by one or more processors implements the method according to any of clauses A17-A30 .
  • An apparatus for training a neural network comprising:
  • At least one memory having computer program code stored thereon, the at least one memory and the computer program code being configured to, with the processor, cause the apparatus to perform the method according to any of clauses A17-A30 .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)

Abstract

用于训练神经网络的设备(200)、方法和集成电路板卡,其中设备(200)以包括在组合处理装置(600)中的计算装置(610)来体现,组合处理装置(600)还可以包括通用互联接口和其他处理装置(606)。计算装置(610)与其他处理装置(606)进行交互,共同完成用户指定的计算操作。组合处理装置(600)还可以包括存储装置(608),存储装置(608)分别与计算装置(610)和其他处理装置(606)连接,用于计算装置(610)和其他处理装置(606)的数据。可以加速对神经网络的训练。

Description

用于训练神经网络的方法、设备和计算机可读存储介质
相关申请的交叉引用
本申请要求于2020年11月30日提交,申请号为202011379869.3,名称为“用于训练神经网络的方法、设备和计算机可读存储介质”的中国专利申请和于2020年11月30日提交,申请号为202011379857.0,名称为“用于训练神经网络的方法、设备和计算机可读存储介质”的中国专利申请的优先权,在此将其全文引入作为参考。
技术领域
本披露一般地涉及人工智能领域。更具体地,本披露涉及用于通过硬件平台来训练神经网络的方法、设备、集成电路、板卡和计算机可读存储介质。
背景技术
随着人工智能领域技术的不断发展,如何高效地训练神经网络以获得良好的神经网络模型成为当前关注的一个焦点。现有的神经网络在训练中通常采用浮点型数据来执行运算以期获得好的训练结果。尽管浮点型数据具有相对较高的数据精度,但在训练过程中会对运行神经网络的硬件平台提出更高的硬件要求,例如更大的存储空间及更高的功耗。另外,在一些训练场景中,使用精度相对较低的定点型数据也同样可以达到与浮点型数据相同或近似的训练效果,从而使得应用浮点型数据在一些情况下并不必要。
另外,相对较低位宽的数据所能表示的数据范围有限。以16位的浮点型数据为例,其所能表示的最小精度的数值是5.9604×10-8。然而,在执行神经网络训练的反向传播时,通常有可能会得到小于5.9604×10-8的数值。换句话说,该得到的值超出了16位浮点数所能表示的范围。此时,如果依然采用16位浮点数来表示该值,则该值即为0。显然,这样的归零操作会对计算的准确性和精度产生明显的影响。
发明内容
为了解决在上文中所提到的一些或全部的问题,提供一种对神经网络进行高效训练的方式,本披露在多个方面中提供了如下的技术方案。
在一个方面中,本披露提供一种用于训练神经网络的设备,其中训练所述神经网络包括迭代执行的前向传播和反向传播,所述设备包括:缩放电路,其配置成根据缩放因子对所述前向传播获得的损失值进行缩放,以获得缩放的损失值;更新电路,其配置成基于所述缩放的损失值来执行所述反向传播中的更新操作;以及调整电路,其配置成至少根据所述反向传播中的梯度数据来调整所述缩放因子,以便用于下一代反向传播中所述损失值的缩放。
在又一个方面中,本披露提供一种集成电路,其包括如上所述并且将在下面多个实施例中讨论的设备。
在又一个方面中,本披露提供一种板卡,其包括如上所述并且将在下面多个实施例中讨论的设备。
在另一个方面中,本披露提供一种用于训练神经网络的方法,其中训练所述神经 网络包括迭代执行的前向传播和反向传播,所述方法包括:根据缩放因子对所述前向传播获得的损失值进行缩放,以获得缩放的损失值;基于所述缩放的损失值来执行所述反向传播中的更新操作;以及至少根据所述反向传播中的梯度数据来调整所述缩放因子,以便用于下一代反向传播中所述损失值的缩放。
在又一方面中,本披露提供一种用于训练神经网络的设备。该设备包括至少一个处理器。该设备还包括至少一个存储器,其存储有计算机程序代码,所述至少一个存储器和所述计算机程序代码被配置为利用所述处理器,以使得所述设备执行前述的方法和在下文所描述的该方法的多个实施例。
在一个方面中,本披露提供一种计算机可读存储介质,其存储有用于训练神经网络的计算机程序,当所述计算机程序由一个或多个处理器运行时,实现前述的方法和在下文所描述的该方法的多个实施例。
通过上述用于训练神经网络的设备、方法、集成电路、板卡和计算机可读存储介质,在训练神经网络的反向传播中,本披露的方案可以自适应地或者适时地调整损失值的大小,以便在利用低精度类型的操作数据进行反向更新操作时,免受由于低精度数据类型数据表达范围过小而导致结果产生误差的影响,由此提高了运算的准确性和精度。另外,由于本方案对损失值的调整,也使得将高精度数据量化成低精度数据以参与运算更具有适用性,由此扩展了神经网络的运算场景。进一步,由于本披露的方案支持将高精度数据量化成低精度数据以参与神经网络内的相关运算,也令计算场景不受处理器芯片所能支持的运算位数的限制,从而扩展了处理器的使用场景。另外,由于低精度数据类型的使用(例如使用定点数来进行神经网络的相关运算例如乘加操作),从而本披露的方案也加速了神经网络的训练过程并且减小了计算开销和功耗。另外,通过经本披露方案所训练的神经网络,其可以被广泛运用于图像处理、语音识别、数据采集等各类领域,也极大地改善相关领域的效率成本。
附图说明
通过结合附图,可以更好地理解本发明的上述特征,并且其众多目的,特征和优点对于本领域技术人员而言是显而易见的,其中相同的附图标记表示相同的元件,并且其中:
图1是示出可以应用本披露的技术方案的神经网络的示例性框图;
图2是示出根据本披露实施例的用于训练神经网络的设备的功能框图;
图3是示出根据本披露实施例的神经网络中前向传播以及反向传播中的更新操作的示例性流程图;
图4是示出根据本披露实施例的涉及量化误差原理的曲线图;
图5是示出根据本披露实施例的用于训练神经网络的方法的流程图;
图6是示出根据本披露实施例的一种组合处理装置的结构图;以及
图7是示出根据本披露实施例的一种板卡的结构示意图。
具体实施方式
现在将参考附图描述本发明的实施例。应当理解,为了说明的简单和清楚,在认为合适的情况下,可以在附图中重复附图标记以指示对应或类似的元件。另外,本申请阐述了许多具体细节以便提供对本文所述实施例的透彻理解。然而,本领域普通技 术人员在本公开的教导下,可以在没有这些具体细节的情况下实施本文所描述的多个实施例。在其他情况下,本方没有详细描述公知的方法、过程和组件,以免不必要地模糊本文描述的实施例。而且,该描述不应被视为限制本文描述的实施例的范围。
如前所述,本披露的方案主要应用于人工智能领域,特别是应用于对神经网络进行高效的训练中,因此为了便于理解本披露的方案,下面将首先对本披露所涉及的神经网络架构及其工作原理进行介绍。
神经网络(“Neural Network”,简称“NN”)是一种模仿生物神经网络的结构和功能的数学模型,神经网络由大量的神经元连接进行计算。因此,神经网络是一种计算模型,由大量的节点(或称“神经元”)相互连接构成。每个节点代表一种特定的输出函数,称为激活函数(“activation function”)。每两个神经元之间的连接都代表一个通过该连接信号的加权值,称之为权值,这相当于神经网络的记忆。神经网络的输出则依神经元之间的连接方式以及权值和激活函数的不同而不同。在神经网络中,神经元是神经网络的基本单位。它获得一定数量的输入和一个偏置,当信号(值)到达时会乘以一个权值。连接是将一个神经元连接到另一层或同一层的另一个神经元,连接伴随着与之相关联的权值。另外,偏置是神经元的额外输入,它始终为1,并具有自己的连接权值。这确保即使所有的输入都为空(全部为0),神经元也会激活。
在应用中,如果不对神经网络中的神经元应用一个非线性函数,神经网络只是一个线性函数而已,那么它并不比单个神经元强大。如果让一个神经网络的输出结果在0到1之间,例如,在猫狗鉴别的例子中,可以把接近于0的输出视为猫,将接近于1的输出视为狗。为了完成这个目标,在神经网络中引入激活函数,比如:sigmoid激活函数。关于这个激活函数,其返回值通常是一个介于0到1的数字。因此,激活函数用于将非线性引入神经网络,它会将神经网络运算结果缩小到较小的范围内。实际上,激活函数怎样表达并不重要,重要的是通过一些权值将一个非线性函数参数化,可以通过改变这些权值来改变这个非线性函数。
图1是示出可以应用本披露的技术方案的神经网络100的示例性框图。如图1中所示,该神经网络100包括输入层和输出层以及位于该输入层和输出层之间的多个隐藏层,在图1中示例性示为卷积层、激活层、池化层和全连接层。
输入层的神经元被称为输入神经元,在本例中绘出3个输入神经元,其接收3个输入信号x1,x2,x3。输入层作为神经网络中的第一层,接受需要输入信号(值)并将它们传递到下一层。通常情况下,输入层不会对输入信号(值)做操作,并且没有关联的权值和偏置。对于特定的神经网络,例如卷积神经网络,其输入层可以处理多维数据。常见地,一维卷积神经网络的输入层接收一维或二维数组,其中一维数组通常为时间或频谱采样;二维数组可以包含多个通道;二维卷积神经网络的输入层接收二维或三维数组;三维卷积神经网络的输入层接收四维数组,以此类推。在一些特定的应用场景中,也可以在输入层处对数据进行预处理操作,例如可以对数据去均值、归一化及降维等操作。
隐藏层包含用于对输入数据应用不同变换的神经元(节点)。在图1所示出的神经网络中包括了四个隐藏层,即包括4个神经元(节点)的卷积层、4个神经元的激活层、2个神经元的池化层、6个神经元的全连接层。最后,由全连接层的运算值传递给输出层。输出层的神经元被称为输出神经元。输出层接收来自最后一个隐藏层的 输出。在图1所示的神经网络中,输出层有2个神经元,有2个输出信号y1和y2。从所示出的隐藏层可以看出,基于特定的隐藏层,每个隐藏层的每一个神经元可以或可以不与下一层的每一个神经元进行连接,例如激活层和池化层的神经元是部分连接,而池化层与全连接层之间是全连接。
下面对于本例中的示例性隐藏层进行简要的描述。需要理解的是这里关于上述各个隐藏层的描述仅仅是示例性的而非限制性的,本披露的技术方案并不受图1所示神经网络隐藏层结构的限制,并且本领域技术人员根据本披露的教导可以对图1所示出的神经网络结构进行修改,例如根据应用的需要增加一个或多个层,或去除图1所示结构中的一个或多个层,而这些操作依然涵盖于本披露所涵盖的技术方案之内。
作为本例中的第一个隐藏层-卷积层,其功能通常是对输入数据进行特征提取,其内部可以包含多个卷积核,组成卷积核的每个元素可以对应于一个权重系数和一个偏差量,类似于一个前馈神经网络的神经元。当处理图片数据时,在卷积层中,图片中的每一个特征首先局部感知,然后更高层次地对局部进行综合操作,从而得到全局信息。卷积层参数可以包括卷积核大小、步长和填充,三者共同决定卷积层输出特征图的尺寸,是卷积神经网络的超参数。在应用中,卷积层内每个神经元都与前一层中位置接近的区域的多个神经元相连,该区域的大小取决于卷积核的大小。卷积核在工作时,会有规律地扫过输入特征,对输入特征做矩阵元素乘法求和(乘加)并叠加偏差量。
接收上述卷积层输出的激活层实际上是对卷积层的输出结果做一次非线性映射。常用的激励函数有:Sigmoid函数、Tanh函数、ReLU函数、Leaky、ReLU函数、ELU函数及Maxout函数等。通过这些激活函数之后,前一层的输出将变得相对复杂,从而提升了神经网络模型的表达能力。
池化层主要用于特征降维,压缩数据和参数的数量,减小过拟合,同时提高模型的容错性。通常,池化方法主要包括最大池化和平均池化。在卷积层进行特征提取并且经激活层处理后,输出的特征图会被传递至池化层进行特征选择和信息过滤。池化层包含预设定的池化函数,其功能是将特征图中单个点的结果替换为其相邻区域的特征图统计量。池化层选取池化区域与卷积核扫描特征图步骤相同,该步骤可以由池化大小、步长和填充进行控制。
经过前面的卷积+激活+池化后,神经网络的信号处理流程到达全连接层,其位于本例神经网络隐藏层的最后部分。特征图在全连接层中会失去空间拓扑结构,被展开为向量并通过激励函数输出。全连接层可以对提取的特征进行非线性组合以得到输出,即全连接层本身不被期望具有特征提取能力,而是试图利用现有的高阶特征完成学习目标。另外在全连接层还可以进行局部归一化(LRN)、数据增强等操作,以便增加神经网络的鲁棒性。
尽管在图1中未示出,在神经网络中的每一层处具有与该层关联的一个或多个算子(将结合图2来具体描述)以执行相对应的计算操作。算子在神经网络中是一个函数空间到函数空间上的映射。广义上来讲,对任何函数进行某一项操作都可以认为是一个算子。简言之,算子可以是映射、关系或者变换。例如,对于卷积层(或其他需要执行卷积操作的层)存在卷积算子,该卷积算子可以具体化为一个或多个卷积计算公式的表达。通过利用该卷积算子将输入数据与卷积核进行计算,可以获得卷积操作 后的结果值。
上面对图1所示出神经网络结构及其节点的功能进行了示例性的描述。在实际应用中,为了获得良好的神经网络模型,预先会提供大量的样本数据(包含输入和输出)对初始神经网络进行训练。训练完成后,就获得训练后的神经网络。该受训后的神经网络可以对于将来的真实环境的输入给出一个正确的输出。
在开始讨论神经网络的训练之前,需要定义损失函数。损失函数是一个衡量神经网络在执行某个特定任务的表现函数。在有些实施例中,损失函数可以如此得到:在训练某神经网络过程中,对每一个样本数据,都沿着神经网络传递得到输出值,然后将这个输出值与期望值做差再求平方,这样计算出来的损失函数就是预测值与真实值之间的距离,而训练神经网络目的就是将这个距离或损失函数的取值减小。在某些实施例中,损失函数可以表示为:
Figure PCTCN2021119122-appb-000001
上式中,y代表期望值,
Figure PCTCN2021119122-appb-000002
指样本数据集合中每个样本数据通过神经网络得到的实际结果,i是样本数据集合中每个样本数据的索引。
Figure PCTCN2021119122-appb-000003
表示期望值y与实际结果
Figure PCTCN2021119122-appb-000004
之间的误差值。m为样本数据集合中样本数据的个数。
以猫狗鉴别的实际应用场景为例。假定一个数据集由猫和狗的图片组成,如果图片是狗,对应的标签是1,如果图片是猫,对应的标签是0。这个标签就是对应上述公式中的期望值y,在向神经网络传递每一张样本图片的时候,实际是想通过神经网络来获得识别结果,即图片中的动物是猫还是狗。为了计算损失函数,必须遍历样本数据集中的每一张样本图片,获得每一张样本图片对应的实际结果
Figure PCTCN2021119122-appb-000005
然后按照上面的定义计算损失函数。如果损失函数的数值(简称“损失值”)比较大,例如超过一个预定的阈值,则说明神经网络还没有训练好,此时就需要借助于前述的反向传播过程来对权值进一步调整。
在开始训练神经网络的时候,需要对权值进行随机初始化。在大多数的情况下,初始化的神经网络并不能提供一个很好的训练结果。在训练的过程中,假设以一个很糟糕的神经网络开始,通过训练可以得到一个具有高准确率的网络。
神经网络的训练过程分为两个阶段,第一阶段是信号的正向处理操作(在本披露中称为“前向传播过程”),训练从输入层经过隐藏层,最后到达输出层。第二阶段是反向传播梯度操作(在本披露中称为后向传播过程),训练从输出层到隐藏层,最后到输入层,根据梯度依次调节神经网络中每层的权值和偏置。
在前向传播过程中,将输入值输入到神经网络的输入层,经过多个隐藏层的相关算子执行的相应运算,可以从神经网络的输出层得到所谓的预测值的输出。当输入值提供给神经网络的输入层时,其可以不进行任何操作或依应用场景做一些必要的预处理。在隐藏层中,第二个隐藏层从第一个隐藏层获取预测中间结果值并进行计算操作和激活操作,然后将得到的预测中间结果值传递给下一个隐藏层。在后面的层中执行相同的操作,最后在神经网络的输出层得到输出值。在经过前向传播过程的正向处理后,通常可以得到一个被称为预测值的输出值。为了计算误差,可以将预测值与实际输出值进行比较,获得对应的误差值。
在反向传播过程中,可以使用微分学的链式法则来对各层的权值进行更新,以期 在下一次的前向传播过程中获得相对于前次较低的误差值。在链式法则中,首先计算对应神经网络的最后一层权值的误差值的导数(称这些导数为梯度)。然后,使用这些梯度来计算神经网络中的倒数第二层的梯度。重复此过程,直到得到神经网络中每个权值对应的梯度。最后,将神经网络中每个权值减去对应的梯度,从而对权值进行一次更新,以达到减少误差值的目的。与前向传播过程中利用各类算子(在本披露中称为前向算子)类似,在对应的反向传播过程中也存在与前向传播过程中的前向算子相对应的反向算子。例如,对于前述卷积层中的卷积算子,其包括前向传播过程中的前向卷积算子和反向传播过程中的反卷积算子。
在神经网络进行训练的过程中,神经网络每经过一次信号的正向处理的前向传播过程以及对应一次误差的反向传播过程,神经网络中的权值利用梯度进行一次更新,此时称为一次迭代(iteration)。为了获得精度符合预期的神经网络,在训练过程中需要很庞大的样本数据集,而一次性将样本数据集输入进计算设备(例如计算机)几乎是不可能的。因此,为了解决这个问题,需要将样本数据集划分成多个块,按块传递给计算机,每块数据集经过前向传播过程的正向处理后,对应进行一次反向传播过程中的更新神经网络的权值操作。当一个完整的样本数据集通过了神经网络一次正向处理并且对应返回了一次权值更新,这个过程称为一个周期(epoch)。实际中,在神经网络中传递一次完整的数据集是不够的,需要将完整的数据集在同一神经网络中传递多次,即需要多个周期,最终获得精度符合预期的神经网络。
在神经网络进行训练的过程中,通常用户希望训练的速度越快越好,并且准确率越高越好,但这样的期望通常受神经网络数据的数据类型的影响。在很多应用场景中,神经网络的数据通过高精度数据格式表示(例如浮点数)。以前向传播过程中的卷积操作和反向传播过程中的反向卷积操作为例,当在计算设备中央处理单元(“CPU”)和图形处理单元(“GPU”)上执行这两项操作时,为了确保数据精度,几乎所有的输入、权重和梯度都是浮点类型数据。
以浮点类型格式作为高精度数据格式为例,根据计算机体系结构可知,基于浮点数的运算表示法则、定点数的运算表示法则,对于同样长度的浮点运算和定点运算来说,浮点运算计算模式更为复杂,需要更多的逻辑器件来构成浮点运算器。这样从体积上来说,浮点运算器的体积比定点运算器的体积要大。进一步,浮点运算器需要消耗更多的资源去处理,使得定点运算和浮点运算二者之间的功耗差距通常是数量级的,由此造成显著的计算成本差异。然而,根据实验发现,定点运算比浮点运算执行速度快,而且精度损失并不大,因此在人工智能芯片中采用定点运算处理大量的神经网络运算(例如卷积和全连接运算)是可行的方案。例如,可以将涉及前向卷积、前向全连接、反向卷积和反向全连接算子的输入、权重和梯度的浮点型数据均进行量化后进行定点数运算,并且在算子运算完成后将低精度的数据转换成高精度数据。
以量化对象是神经网络的权值、且量化后的权值均为8-bit定点数(相对于浮点数的低精度类型)为例,由于一个神经网络中常常有数百万连接,几乎所有空间都被神经元连接的权值所占据,并且这些权值有可能都是不同的浮点数。每层权值都趋向于某个确定区间的正态分布,例如(-3.0,3.0)。将神经网络中每层的权值对应的最大值和最小值保存下来,将每个浮点数值采用8-bit定点数表示。其中,在最大值、最小值范围内区间线性划分256个量化间隔,每个量化间隔用一个8-bit定点数表示。例如: 在(-3.0,3.0)区间内,字节0表示-3.0,字节255表示3.0。以此类推,字节128表示0。
在执行量化操作过程中,通常会涉及到两个量化参数:shift和n,其中shift是定点数点的位置(即本披露的“点位置参数”),n是定点数比特位宽(即本披露的“位宽参数”),n初始可以人为设定,shift则通过待量化数据的分布范围和n,利用下式来计算得到:
Figure PCTCN2021119122-appb-000006
其中Z是待量化数据F的绝对值的最大值max(|F|)。使用F表示量化前的浮点数据,I表示量化后的n位定点数,则从F到I的定点变换方法可以通过下式来计算:
F≈I×2 shift  (3)
其中step=2 s为量化步长(最小量化间隔),得到的定点数I可以通过下式来表达:
Figure PCTCN2021119122-appb-000007
当需要将量化后得到的定点数转换成浮点数,则可以执行反量化操作,则反量化后的值
Figure PCTCN2021119122-appb-000008
可以通过下式来表达:
Figure PCTCN2021119122-appb-000009
可以看出,通过上述量化的定点数有利于神经网络的加速训练、减小芯片体积并显著减小计算开销。特别地,可以在训练神经网络的前向传播中对神经元数据和权值数据进行上述的量化操作,并且在训练神经网络的反向传播中对用于更新操作的梯度数据进行量化操作。当在上述量化操作中引入一些相关算子(稍后结合图3来详细描述)时,可以优化量化操作,从而在充分利用量化操作的同时尽量不引入过多的量化开销,由此加速神经网络训练、提高训练精度并减小计算开销。
如前所述,尽管上述的量化操作给神经网络的训练带来明显的技术优势,但在一些应用场景中,当损失值由相对低精度的数据类型来表达时,例如以前面例子的16位浮点数(其相对于32位、64位或更高位的浮点数来说相对精度较低)来表达时,由于数值表达范围的限制,必然对反向传播中的更新操作产生不良的影响。为此,本披露提出了一种有效的损失值调整方案以克服上述缺陷,提高神经网络的训练准确性,从而加速训练。
图2是示出根据本披露实施例的用于训练神经网络的设备200的功能框图。如图2所示,该设备200包括缩放电路202、更新电路204和调整电路206。在一个或多个实施例中,设备200还可以包括量化电路208。根据本披露的方案,缩放电路可以配置成根据缩放因子对训练神经网络过程中的前向传播获得的损失值进行缩放,以获得缩放的损失值。通过将损失值进行放大,可以有效地避免由于低精度数据的数值表达范围受限而造成后续的更新操作出现错误,从而导致整个训练过程的延缓和低效。为了便于理解,假定本披露的损失值表示为loss,则可以利用下式来对其进行缩放:
Loss_scale=loss×scale     (6)
在上面的式(6)中,scale表示本披露的缩放因子,而Loss_scale表示缩放后的缩放因子。
在对损失值执行缩放操作后,更新电路可以配置成基于上述缩放的损失值来执行 反向传播中的更新操作。在一个实施例中,这里的更新操作可以涉及对权值的更新以及在反向传播方向上从前一层向后一层传送的梯度数据更新。如前所述,反向传播的更新操作中涉及到各种类型数据的量化操作,例如前文描述的从高精度数据(如浮点数)到低精度数据(如定点数)的量化操作和从低精度数据到高精度数据的反量化操作。为此,设备200的量化电路208可以配置成根据量化参数来对操作数据执行量化操作,以及基于所述操作数据来确定是否对前述的量化参数进行更新。进一步,调整电路可以配置成至少根据反向传播中的梯度数据来调整缩放因子,以便用于下一代反向传播中所述损失值的缩放。在一个实施场景中,调整电路还可以配置成在量化电路确定更新量化参数时,调整前述的缩放因子以用于下一代反向传播中损失值的缩放。在一些实施例中,可以根据反向传播中的梯度数据来调整该缩放因子。
根据不同的应用场景,上述的操作数据可以包括神经网络训练过程内的各种类型的数据。例如,在一个实施例中,所述操作数据可以包括梯度数据。基于此,量化参数可以包括应用于量化梯度数据的第一点位置参数或第一位宽参数,并且调整电路可以配置成当所述第一点位置参数或第一位宽参数更新时,调整所述缩放因子以用于下一代反向传播中所述损失值的缩放。
在一个实施例中,所述操作数据可以包括梯度数据和神经元数据,所述量化参数可以包括用于梯度数据和神经元数据的第二点位置参数或第二位宽参数。基于此,本披露的调整电路可以配置成当第二点位置参数或第二位宽参数更新时,调整缩放因子以用于下一代反向传播中所述损失值的缩放。
在一个实施例中,所述操作数据可以包括梯度数据、神经元数据和权值数据三者,所述量化参数包括用于三者的第三点位置参数或第三位宽参数,并且调整电路可以配置成当第三点位置参数或第三位宽参数更新时,调整所述缩放因子以用于下一代反向传播中所述损失值的缩放。
根据计算场景的不同,量化操作可以具有不同的执行方式。鉴于此,本披露的量化电路可以包括正量化电路和反量化电路,其中正量化电路可以配置成根据量化参数将高精度数据类型(例如浮点型)的操作数据量化成低精度数据类型(例如定点型)的操作数据,而反量化电路可以配置成根据量化参数将低精度数据类型的操作数据反量化成高精度数据类型的操作数据。在一些实施例中,除了执行量化操作以外,本披露的量化电路还可以对是否进行量化参数的更新做出判断。为此,在一个实施例中,量化电路可以配置成根据操作数据的量化误差来确定是否对量化参数进行更新。在一个实施场景中,本披露的量化电路可以配置成基于操作数据在量化前和量化后的均值来执行运算,以确定前述的量化误差。
在反向传播的更新操作中,关键性的任务是如何有效和准确地更新权值数据。为此,本披露的更新电路可以配置成利用缩放的损失值来获得权值梯度数据。相应地,上述量化电路中的反量化电路可以配置成将所述权值梯度数据从低精度数据类型(例如定点数)反量化成高精度数据类型(例如浮点数)的权值梯度数据。进一步,更新电路可以配置成利用缩放因子和高精度数据类型的权值梯度数据对权值进行更新。例如,当得到的损失值是16位长度的浮点数时,可以通过缩放因子将其放大后进行反向更新操作(其中包括量化成定点数进行计算的操作),并且在针对于权值梯度的计算时,可以将其转换成高精度数据(例如32位长度的浮点数)。然后,可以以缩放 因子成反比的形式来缩小该32位长度的浮点数,以计算权值梯度数据,从而对权值进行更新。关于该量化和更新过程,下文还将做进一步的描述。
以前述式(6)为基础,当利用Loss_scale进行反传时(包括对相应层的操作数据进行量化操作和计算),可以得到用于各层权值更新的权值梯度w_grad。接着,为了保证计算的精度,可以将w_grad转换成高精度类型数据(例如将其从相对低精度的16位浮点数转换成相对高精度的32位浮点数),并且通过下式(7)计算权值梯度的实际值,并且利用该实际值来对权值进行更新:
W_grad_real=w_grad/scale    (7)
就本披露的缩放因子“scale”而言,可以通过多种方式来对其进行调整。在一个实施例中,可以根据一个或多个超参数来对缩放因子来进行调整。在另一个实施例中,可以根据梯度数据的数据分布来确定该缩放因子。在一个实施例中,还可以基于预设的阈值和梯度数据的最大值来确定缩放因子。
上面结合图2对本披露的设备进行了描述。需要理解的是本披露的设备可以应用于神经网络中的一层或多层。特别地,当神经网络是多层的层结构时,即神经网络包括多个中间的隐藏层时,本披露的量化电路可以配置成针对每一层来确定是否对量化参数进行更新。当量化电路在任意一层处确定需要更新量化参数时,例如基于前述的量化误差确定需要更新量化参数时,此时调整电路将相应地动态调整缩放因子,以用于下一代反向传播中的损失值的缩放。
图3是示出根据本披露实施例的神经网络300中前向传播和反向传播中的更新操作的示例性流程图。
如图3中的虚框所示,该神经网络300可以实施为包括前向传播中的运算块301,反向传播过程中的梯度更新块302和权值更新块303。为了便于理解和描述本披露的方案,图3所示出的神经网络可以视为仅包括单个的隐藏层(例如卷积层)的网络或仅包括一类操作(仅卷积操作)的网络,而本领域技术人员根据上文和下面的描述,将理解本披露的方案同样地也适用于隐藏层包括多个层或多种其他类型操作的情形。
进一步示出于图3中的是前述的多个算子,其具体可以包括量化算子“quantify”,前向卷积算子“convFwd”,权重梯度算子“convBpFilter”和输入数据梯度算子“convBpData”。下面将按照训练神经网络的前向传播过程和反向传播过程(包括权值更新和梯度更新的操作)的顺序来描述图3中的流程。这里需要指出的是,在图3中,“x”表示输入神经元数据、“w”表示权值、“dx”表示输入梯度、“dy”表示输出梯度,“[]”内的内容表示具体的数据类型,“paramx”表示神经元数据的量化参数,“paramw”表示权值的量化参数,而“paramdy”表示梯度的量化参数。
首先,可以接收输入的神经元数据x[fp32]和初始的权值w[fp32]。如括号中所示出的,二者都是具有32位长度的浮点数。可以理解的是这里32位仅仅是示例性的,其也可以是16位或其他位宽的浮点数。如前所述,通过例如本披露的量化电路执行的量化操作,可以将二者量化为定点数。为此,在一个实施例中,可以在本披露的量化电路上实现量化算子“quantify”。该量化算子可以包括量化策略算子和量化参数算子。在一个实施例中,所述量化策略算子可以至少用于确定是否执行量化参数的更新操作,而所述量化参数算子可以至少用于确定量化参数,并使用所述量化参数对所述高精度数据类型(在本披露的例子中是浮点数)的神经网络数据执行所述量化操作。
在一些应用场景中,上述的量化策略算子可以负责计算量化误差diff bit、和量化周期趋势值diff update。由于量化误差的确定对于量化周期、数据位宽的调整等方面具有重要意义,下面将对其进行具体描述。
假定在一个场景中,待量化的数据为F=[f 1,f 2,...,f m],使用n位定点进行量化后得到的数据为
Figure PCTCN2021119122-appb-000010
则可以通过下式来计算量化误差:
Figure PCTCN2021119122-appb-000011
即:
Figure PCTCN2021119122-appb-000012
当diff bit大于阈值th时,则可以考虑将量化位宽增加t位,从而新的量化位宽为n=n+t,这里th和t都是可变的超参数。
可以看出,上面量化误差的确定涉及利用均值函数mean()的计算,而该量化误差的计算方式具有如下的意义:
如图4中所示出的曲线1和2的两种浮点数数据分布,假设其中一个量化间隔为[a,b],[a,c]之间的浮点数量化到a,[c,b]之间的浮点数量化到b,假定数据满足高斯分布P(x)~G(0,σ),则量化前的均值为
Figure PCTCN2021119122-appb-000013
量化后的均值为:
Figure PCTCN2021119122-appb-000014
从图4中可以看出,分布在c处的切线斜率为k。经过推导与近似计算,|K|越大(即分布越集中)并且量化间隔越大,则mean f
Figure PCTCN2021119122-appb-000015
的差距越大。通过实验可以知道,分布越集中的数据量化后对训练最终结果带来的误差越大,所以可以使用量化前后的均值的差距来模拟量化给训练带来的实际误差。为了保持该误差不增大,在量化分布更为集中的数据(|K|)时,应使量化间隔减小,即增加量化位宽。本披露正是基于这样的理论基础,在考虑了量化误差对训练精度和效果所带来的影响后,提出可变的量化周期和数据位宽的方案。
为减小训练过程中计算量化参数所带来的计算消耗,可以不每代都根据当前待量化数据计算量化参数,而是间隔一定代数进行量化参数更新。在不更新的代里,量化数据时可以使用存储下来的上一次更新得到的量化参数。只要更新间隔选择合适,这并不会带来训练精度损失,这是因为在训练过程中待量化数据(例如权值和梯度数据)的变化是相对稳定的,具有一定连续性和相似性。一种简单的方式是使用固定的更新周期,但是固定的更新周期适应性较差,因此本披露还提出自适应的更新周期调整。
设间隔“Interval”代(即量化周期)来更新量化参数,其计算方法如下:
首先引入shift随着训练迭代周期的滑动平均值m
m (i)←α×shift+(1-α)×m (i-1)(12)
引入衡量shift变化趋势的diff update1:
diff update1=|m (i)-m (i-1)|    (13)
diff update1越大,则说明数值范围变化越激烈,需要更高的更新频率,即Interval越小。
衡量定点位宽n变化趋势diff update2
Figure PCTCN2021119122-appb-000016
diff update2越大则越需要更大的量化位宽,需要更新位宽,间隔频率更高。
将上述的两种衡量同时考虑,则得到前述的量化周期趋势值diff update如下:
diff update=max(diff update1,diff update2)    (15)
最后计算得到Interval:
Figure PCTCN2021119122-appb-000017
在上面的等式中,α、β、γ、δ、t和th可以是超参数,并且β和γ既可以为经验值,也可以为超参数。另外,常规的超参数的优化方法均适于β和γ。
上面对量化策略算子如何计算量化误差diff bit和量化周期趋势值diff update进行了详细地描述。在一个实现场景中,本披露的量化策略算子的输入可以包括量化前的数据和量化后的数据、量化参数(主要用到shift的滑动平均值m)、量化周期I(既可以是输入,也可以是输出)和输出量化位宽,其中量化周期和输出量化位宽可以作为输入传递给量化参数算子。
进一步,量化参数算子的输入可以包括待量化数据、量化参数(包括点数shift、点数的滑动平均值m、缩放系数等)、数据位宽(表示输出的量化后数据采用哪种位宽)以及量化周期。在一些应用场景中,量化周期可以是控制量化算子是否要计算量化参数的变量。例如,当量化周期等于0时,可以执行量化参数的统计。当然,这里的设置也仅仅是示例性的,并且本领域技术人员基于本文的教导也可以赋予量化周期其他的含义,或采用不同的形式来进行控制。在另外一些应用场景中,如果对量化参数进行了统计,则还需把新的量化参数更新到旧的量化参数的地址。否则,此次量化操作将依然采用旧的量化参数。
根据不同的实现方式或应用场景,量化算子可以对当前整个板卡内同一层数据进行量化。由此得到的量化参数既可以接着在多机多卡之间同步,也可以不进行这样的同步。当不进行同步时,每个板卡内可以维护一份量化参数。附加地,每个处理器核在计算出量化参数后进行同步,综合得到最终的全局的量化参数。
返回到图3的处理流程,再经过上述的量化算子的量化操作后,将获得新的量化参数和量化后的定点数,即量化参数“paramx”、“paramw”和量化后的数据“x[int8]”(对应8位定点型的神经元)、w[int8](对应8位定点型的权重),并且将该四项作为输入送入到前向卷积算子convFwd中执行运算,以获得浮点型的结果y[fp16](对应16位浮点型数据)。在一些场景中,此处前向卷积算子convFwd的输出也可以是32位浮点型数据。在该卷积运算过程中,前向卷积算子convFwd可以对定点型的神经元数 据和权值数据执行例如乘加操作的运算。例如,当输入的神经元数据是图像数据而对应的权值是卷积核(filter),则卷积操作可以是对应的图像矩阵和filter的对应位置元素的相乘再求和,最后再加上偏置b,从而获得一个特征图作为输出结果。为了保持输出的数据仍是浮点型数据,本披露的卷积算子也可以融合有反量化算子,其例如可以通过本披露的量化电路中的反量化电路来实施。由此,可以将输出结果反量化为16位的浮点型数据y[fp16]。这里的反量化操作可以涉及利用前述的量化参数paramx和paramw来确定反量化时的步长,即前述式(5)中的step,从而将定点数反量化为高精度的浮点数。
如前所述,基于前向传播过程所获得的训练结果,可以确定损失函数LossDiff,其例如可以通过结合图1所描述的式(1)的方式来获得,此处不再赘述。在获得该LossDiff的损失值后,根据本披露的方案,可以利用缩放电路对其进行缩放(即执行图中所示出的“scaling”操作),以用于反向传播中的运算。例如,可以使用式(6)来确定缩放后的损失值。
接着,训练流程将进入到反向传播过程,这其中涉及反向梯度更新块302和权值更新块303来执行反向传播过程。为此,本披露通过更新电路来实施两个算子,即图中所示出的权重梯度算子“convBpFilter”和输入数据梯度算子“convBpData”。在一个或多个实施例中,convBpData的功能可以是计算输入神经元数据x的梯度。根据链式求导法则推导,可以得到x的梯度计算公式为:
Figure PCTCN2021119122-appb-000018
进一步,convBpFilter的功能可以是计算权重w的梯度,根据链式求导法则推导,可得w的梯度计算公式为:
Figure PCTCN2021119122-appb-000019
在上面的两个公式中,w,x,δ分别表示权重、输入以及来自上一层的输入梯度数据,
Figure PCTCN2021119122-appb-000020
表示卷积操作,rot180函数则表示将数据旋转180度。
下面对反向传播过程所涉及的具体操作进行示例性描述,以便于理解反向梯度更新块和权值更新块中所涉及到的操作。
在反向传播过程中,对于包括两层或两层以上的多个神经元的神经网络来说,针对每一层,首先对输入梯度数据例如向量(相当于图中的经调整后的dy[fp16])和本层的权值矩阵进行加权求和,以计算出本层的输出梯度向量(相当于图中的dx[fp16]),其中涉及到由反量化电路执行的从定点数到浮点数的反量化操作。另外,该输入梯度向量(相当于图中的dy[fp16])与前向传播过程时的输入神经元数据进行运算(例如对位相乘),可以得到本层权值的梯度(相当于图中的dw[fp32])。接着,可以利用前式(7)对其进行缩放来获得实际的权值梯度。然后,可以根据所得到的本层权值的实际梯度来更新本层的权值(相当于图中的w[fp32])。
基于上述的处理过程,本披露的方案在通过对损失函数LossDiff的损失值进行缩放后,可以获得当前的输入梯度数据。接着,将其经过量化算子quantify进行量化,以得到8位定点型输入梯度数据dy[int8]和关于该梯度数据的量化参数paramdy。接着,可以将本层在前向传播过程中的对应权值量化参数paramw、8位定点型权值w[int8]、连同前述获得的输入梯度dy[int8]和梯度量化参数paramdy作为输入送入到输入数据 梯度算子convBpData进行运算,以获得本层的输出梯度dx[fp16]作为反向传播方向上的下一层(如果存在的话)的输入梯度数据(即dy[fp16])。进一步,可以将输入梯度dy[int8]、梯度数据的量化参数paramdy连同本层前向传播过程中的对应量化参数paramx和前述量化后的神经元数据x[int8]作为输入送入到权重梯度算子convBpFilter,以获得本层的权值梯度dw[fp32]。接着,通过求解器solver,可以基于dw[fp32](经如前所述的相应缩放后)来计算得到本层更新的权值w[fp32],以用于下一次前向传播过程中的运算。
通过上面的描述,本领域技术人员可以理解,在反向传播过程中,权重w和输入神经元数据x可以复用前向传播过程中使用的数据,因此反向传播过程中只需要量化梯度dy,梯度dy量化后的结果是量化参数paramdy和量化数据dy[int8],而无需对输入的神经元数据和权重数据再次量化,从而减少了数据的多次量化,缩短了训练的时间。
上面结合图3对本披露的训练方案及其所涉及的量化操作进行了详细的描述。如前所述,当本披露的量化电路在反向传播过程中确定需要对量化参数进行更新时,则本披露的调整电路将相应地对缩放因子进行调整。在一个实施例中,可以基于下式来计算调整的缩放因子:
t=floor(log 2(th/m1))   (19)
scale_new=scale×pow(2,t)   (20)
其中上式中的th是超参数,其可以例如根据神经元数据的大小来设置,例如为512。m1可以是反向传播中梯度数据的最大值。floor()表示向下取整函数,而pow(x,y)函数表示求x的y次幂(或次方)。
图5是示出根据本披露实施例的用于训练神经网络的方法500的流程图,其中这里训练神经网络的过程包括迭代执行的前向传播和反向传播。如图5中所示,在步骤502处,方法500根据缩放因子对所述前向传播获得的损失值进行缩放,以获得缩放的损失值。接着,在步骤504处,方法500基于所述缩放的损失值来执行所述反向传播中的更新操作。在一个实施例中,这里的更新操作可以包括梯度更新和权值更新。在步骤506处,方法500至少根据所述反向传播中的梯度数据来调整所述缩放因子,以便用于下一代反向传播中所述损失值的缩放。根据前文结合图3的描述,本领域技术人员可以理解量化操作穿插于前述的更新操作中,从而有利地加速更新操作的执行。进一步,步骤506的输出即为调整后的缩放因子,并且该缩放因子被反馈至步骤502,以用于对损失值进行新的缩放。可以看出,可以反复执行方法500,直至神经网络的损失值达到预期值,从而完成神经网络的训练。在一个实施例中,作为步骤506的替代或补充,本披露的方案还提出在执行步骤504的更新操作后,根据量化参数对操作数据执行量化操作,以及基于所述操作数据来确定是否对所述量化参数进行更新。响应于确定对量化参数进行更新,调整缩放因子以用于下一代反向传播中损失值的缩放。
基于上面的描述,本领域技术人员可以理解方法500可以由本披露结合图2所描述的设备200来执行,因此结合图2对设备200具体操作的描述也适用于方法500所执行的步骤,此处不再赘述。
图6是示出根据本披露实施例的一种组合处理装置600的结构图。如图6中所示,该组合处理装置600包括计算处理装置602、接口装置604、其他处理装置606和存 储装置608。根据不同的应用场景,计算处理装置中可以包括一个或多个计算装置610,该计算装置可以配置用于执行本文结合附图1-5所描述的操作。特别地,在一些应用场景中,计算装置610可以包括本披露结合图2所描述的设备200,并且执行结合图5所描述的步骤。
在不同的实施例中,本披露的计算处理装置可以配置成执行用户指定的操作。在示例性的应用中,该计算处理装置可以实现为单核人工智能处理器或者多核人工智能处理器。类似地,包括在计算处理装置内的一个或多个计算装置可以实现为人工智能处理器核或者人工智能处理器核的部分硬件结构或电路,以实现例如本披露所公开的各类电路,例如缩放电路、更新电路、量化电路或调整电路。当多个计算装置实现为人工智能处理器核或人工智能处理器核的部分硬件结构时,就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。
在示例性的操作中,本披露的计算处理装置可以通过接口装置与其他处理装置进行交互,以共同完成用户指定的操作。根据实现方式的不同,本披露的其他处理装置可以包括中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)、人工智能处理器等通用和/或专用处理器中的一种或多种类型的处理器。这些处理器可以包括但不限于数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算处理装置和其他处理装置共同考虑时,二者可以视为形成异构多核结构。
在一个或多个实施例中,该其他处理装置可以作为本披露的计算处理装置(其可以具体化为人工智能例如神经网络运算的相关运算装置)与外部数据和控制的接口,执行包括但不限于数据搬运、对计算装置的开启和/或停止等基本控制。在另外的实施例中,其他处理装置也可以和该计算处理装置协作以共同完成运算任务。
在一个或多个实施例中,该接口装置可以用于在计算处理装置与其他处理装置间传输数据和控制指令。例如,该计算处理装置可以经由所述接口装置从其他处理装置中获取输入数据,写入该计算处理装置片上的存储装置(或称存储器)。进一步,该计算处理装置可以经由所述接口装置从其他处理装置中获取控制指令,写入计算处理装置片上的控制缓存中。替代地或可选地,接口装置也可以读取计算处理装置的存储装置中的数据并传输给其他处理装置。
附加地或可选地,本披露的组合处理装置还可以包括存储装置。如图中所示,该存储装置分别与所述计算处理装置和所述其他处理装置连接。在一个或多个实施例中,存储装置可以用于保存所述计算处理装置和/或所述其他处理装置的数据,该数据例如可以是本披露的操作数据,包括但不限于量化前或量化后的神经元数据、权值数据和/或梯度数据。在一些实施例中,该数据可以是在计算处理装置或其他处理装置的内部或片上存储装置中无法全部保存的数据。
在一些实施例里,本披露还公开了一种芯片(例如图7中示出的芯片702)。在一种实现中,该芯片是一种系统级芯片(System on Chip,SoC),并且集成有一个或 多个如图6中所示的组合处理装置。该芯片可以通过对外接口装置(如图7中示出的对外接口装置706)与其他相关部件相连接。该相关部件可以例如是摄像头、显示器、鼠标、键盘、网卡或wifi接口。在一些应用场景中,该芯片上可以集成有其他处理单元(例如视频编解码器)和/或接口模块(例如DRAM接口)等。在一些实施例中,本披露还公开了一种芯片封装结构,其包括了上述芯片。在一些实施例里,本披露还公开了一种板卡,其包括上述的芯片封装结构。下面将结合图7对该板卡进行详细地描述。
图7是示出根据本披露实施例的一种板卡700的结构示意图。如图7中所示,该板卡包括用于存储数据的存储器件704,其包括一个或多个存储单元710。该存储器件可以通过例如总线等方式与控制器件708和上文所述的芯片702进行连接和数据传输。进一步,该板卡还包括对外接口装置706,其配置用于芯片(或芯片封装结构中的芯片)与外部设备712(例如服务器或计算机等)之间的数据中继或转接功能。例如,待处理的数据可以由外部设备通过对外接口装置传递至芯片。又例如,所述芯片的计算结果可以经由所述对外接口装置传送回外部设备。根据不同的应用场景,所述对外接口装置可以具有不同的接口形式,例如其可以采用标准PCIE接口等。
在一个或多个实施例中,本披露板卡中的控制器件可以配置用于对所述芯片的状态进行调控。为此,在一个应用场景中,该控制器件可以包括单片机(Micro Controller Unit,MCU),以用于对所述芯片的工作状态进行调控。
根据上述结合图6和图7的描述,本领域技术人员可以理解本披露也公开了一种电子设备或装置,其可以包括一个或多个上述板卡、一个或多个上述芯片和/或一个或多个上述组合处理装置。
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺 序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行划分,而实际实现时也可以有另外的划分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本披露的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本披露实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如CPU、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
虽然本文已经示出和描述了本披露的多个实施例,但对于本领域技术人员显而易见的是,这样的实施例只是以示例的方式来提供。本领域技术人员可以在不偏离本 披露思想和精神的情况下想到许多更改、改变和替代的方式。应当理解的是在实践本披露的过程中,可以采用对本文所描述的本披露实施例的各种替代方案。所附权利要求书旨在限定本披露的保护范围,并因此覆盖这些权利要求范围内的等同或替代方案。
依据以下条款可更好地理解前述内容:
条款A1、一种用于训练神经网络的设备,其中训练所述神经网络包括迭代执行的前向传播和反向传播,所述设备包括:
缩放电路,其配置成根据缩放因子对所述前向传播获得的损失值进行缩放,以获得缩放的损失值;
更新电路,其配置成基于所述缩放的损失值来执行所述反向传播中的更新操作;以及
调整电路,其配置成至少根据所述反向传播中的梯度数据来调整所述缩放因子,以便用于下一代反向传播中所述损失值的缩放。
条款A2、根据条款A1所述的设备,其中所述设备还包括量化电路,其配置成:
根据量化参数对参与所述更新操作的操作数据执行量化操作;以及
基于所述操作数据来确定是否对所述量化参数进行更新,
其中所述调整电路还配置成在所述量化电路确定更新所述量化参数时,调整所述缩放因子,以用于所述损失值的缩放。
条款A3、根据条款A2所述的设备,其中所述操作数据包括所述梯度数据,所述量化参数包括第一点位置参数或第一位宽参数,并且所述调整电路配置成当所述第一点位置参数或所述第一位宽参数更新时,调整所述缩放因子以用于下一代反向传播中所述损失值的缩放。
条款A4、根据条款A2所述的设备,其中所述操作数据包括所述梯度数据和神经元数据,所述量化参数包括第二点位置参数或第二位宽参数,并且所述调整电路配置成当所述第二点位置参数或所述第二位宽参数更新时,调整所述缩放因子以用于下一代反向传播中所述损失值的缩放。
条款A5、根据条款A2所述的设备,其中所述操作数据包括所述梯度数据、神经元数据和权值数据,所述量化参数包括第三点位置参数或第三位宽参数,并且所述调整电路配置成当所述第三点位置参数或所述第三位宽参数更新时,调整所述缩放因子以用于下一代反向传播中所述损失值的缩放。
条款A6、根据条款A2所述的设备,其中所述量化电路包括正量化电路和反量化电路,其中所述正量化电路配置成根据所述量化参数将高精度数据类型的操作数据量化成低精度数据类型的操作数据,而所述反量化电路配置成根据所述量化参数将低精度数据类型的操作数据反量化成高精度数据类型的操作数据。
条款A7、根据条款A6所述的设备,其中所述高精度数据类型是浮点型数据类型并且所述低精度数据类型是定点型数据类型。
条款A8、根据条款A6所述的设备,其中所述更新电路配置成利用所述缩放的损失值来获得所述梯度数据中的权值梯度数据,并且所述反量化电路配置成将所述权值梯度数据反量化成高精度数据类型的权值梯度数据,并且所述更新电路配置成利用所述缩放因子和所述高精度数据类型的权值梯度数据对权值进行更新。
条款A9、根据条款A1所述的设备,其中所述调整电路配置成根据一个或多个 超参数来调整所述缩放因子。
条款A10、根据条款A1所述的设备,其中所述调整电路配置成根据所述梯度数据的数据分布来确定所述缩放因子。
条款A11、根据条款A1所述的设备,其中所述调整电路配置成根据所述梯度数据的最大值和预设的阈值来确定所述缩放因子。
条款A12、根据条款A2所述的设备,其中所述量化电路配置成根据所述操作数据的量化误差来确定是否对所述量化参数进行更新。
条款A13、根据条款A12所述的设备,其中所述量化电路配置成基于所述操作数据在量化前和量化后的均值来执行运算,以确定所述量化误差。
条款A14、根据条款A2-A13的任意一项所述的设备,其中所述神经网络包括由多个神经元连接形成的多层结构,并且其中所述量化电路配置成针对每一层来确定是否对所述量化参数进行更新,以及当所述量化电路在任意一层处确定更新所述量化参数时,所述调整电路配置成动态地调整所述缩放因子以用于下一代反向传播中所述损失值的缩放。
条款A15、一种集成电路,其包括根据条款A1-A14的任意一项所述的设备。
条款A16、一种板卡,其包括根据条款A1-A14的任意一项所述的设备。
条款A17、一种用于训练神经网络的方法,其中训练所述神经网络包括迭代执行的前向传播和反向传播,所述方法包括:
根据缩放因子对所述前向传播获得的损失值进行缩放,以获得缩放的损失值;
基于所述缩放的损失值来执行所述反向传播中的更新操作;以及
至少根据所述反向传播中的梯度数据来调整所述缩放因子,以便用于下一代反向传播中所述损失值的缩放。
条款A18、根据条款A17所述的方法,其中所述方法还包括执行量化操作,该量化操作包括:
根据量化参数对参与所述更新操作的操作数据执行量化;以及
基于所述操作数据来确定是否对所述量化参数进行更新,
其中在确定对所述量化参数进行更新时,调整所述缩放因子以用于所述损失值的缩放。
条款A19、根据条款A18所述的方法,其中所述操作数据包括所述梯度数据,所述量化参数包括第一点位置参数或第一位宽参数,并且所述调整包括:
当所述第一点位置参数或所述第一位宽参数更新时,调整所述缩放因子以用于下一代反向传播中所述损失值的缩放。
条款A20、根据条款A18所述的方法,其中所述操作数据包括所述梯度数据和神经元数据,所述量化参数包括第二点位置参数或第二位宽参数,并且所述调整包括:
当所述第二点位置参数或所述第二位宽参数更新时,调整所述缩放因子以用于下一代反向传播中所述损失值的缩放。
条款A21、根据条款A18所述的方法,其中所述操作数据包括所述梯度数据、神经元数据和权值数据,所述量化参数包括第三点位置参数或第三位宽参数,并且所述调整包括:
当所述第三点位置参数或所述第三位宽参数更新时,调整所述缩放因子以用于 下一代反向传播中所述损失值的缩放。
条款A22、根据条款A18所述的方法,其中所述量化操作包括正量化操作和反量化操作,其中所述正量化操作包括根据所述量化参数将高精度数据类型的操作数据量化成低精度数据类型的操作数据,而所述反量化操作包括根据所述量化参数将低精度数据类型的操作数据反量化成高精度数据类型的操作数据。
条款A23、根据条款A22所述的方法,其中所述高精度数据类型是浮点型数据类型并且所述低精度数据类型是定点型数据类型。
条款A24、根据条款A22所述的方法,其中所述反向传播中的所述更新操作包括利用所述缩放的损失值来获得所述梯度数据中的权值梯度数据,并且所述反量化操作包括将所述权值梯度数据反量化成高精度数据类型的权值梯度数据,并且所述反向传播中的更新操作包括利用所述缩放因子和所述高精度数据类型的权值梯度数据对权值进行更新。
条款A25、根据条款A17所述的方法,其中所述调整包括根据一个或多个超参数来调整所述缩放因子。
条款A26、根据条款A17所述的方法,其中所述调整包括根据所述梯度数据的数据分布来确定所述缩放因子。
条款A27、根据条款A17所述的方法,其中所述调整包括根据所述梯度数据的最大值和预设的阈值来确定所述缩放因子。
条款A28、根据条款A18所述的方法,其中所述量化包括根据所述操作数据的量化误差来确定是否对所述量化参数进行更新。
条款A29、根据条款A28所述的方法,其中所述量化包括基于所述操作数据在量化前和量化后的均值来执行运算,以确定所述量化误差。
条款A30、根据条款A18-A29的任意一项所述的方法,其中所述神经网络包括由多个神经元连接形成的多层结构,并且其中所述方法包括:
针对每一层来确定是否对所述量化参数进行更新;以及
当在任意一层处确定更新所述量化参数时,动态地调整所述缩放因子以用于下一代反向传播中所述损失值的缩放。
条款A31、一种计算机可读存储介质,其存储有用于训练神经网络的计算机程序,当所述计算机程序由一个或多个处理器运行时实现根据条款A17-A30的任意一项所述的方法。
条款A32、一种用于训练神经网络的设备,包括:
至少一个处理器;
至少一个存储器,其存储有计算机程序代码,所述至少一个存储器和所述计算机程序代码被配置为利用所述处理器,以使得所述设备执行根据条款A17-A30的任意一项所述的方法。
应当理解,本披露的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本披露的方法及其核心思想。同时,本领域技术人员依据本披露的思想,基于本披露的具体实施方式及应用范围上做出的改变或变形之处,都属于本披露保护的范围。综上所述,本说明书内容不应理解为对本披露的限制。

Claims (32)

  1. 一种用于训练神经网络的设备,其中训练所述神经网络包括迭代执行的前向传播和反向传播,所述设备包括:
    缩放电路,其配置成根据缩放因子对所述前向传播获得的损失值进行缩放,以获得缩放的损失值;
    更新电路,其配置成基于所述缩放的损失值来执行所述反向传播中的更新操作;以及
    调整电路,其配置成至少根据所述反向传播中的梯度数据来调整所述缩放因子,以便用于下一代反向传播中所述损失值的缩放。
  2. 根据权利要求1所述的设备,其中所述设备还包括量化电路,其配置成:
    根据量化参数对参与所述更新操作的操作数据执行量化操作;以及
    基于所述操作数据来确定是否对所述量化参数进行更新,
    其中所述调整电路还配置成在所述量化电路确定更新所述量化参数时,调整所述缩放因子,以用于所述损失值的缩放。
  3. 根据权利要求2所述的设备,其中所述操作数据包括所述梯度数据,所述量化参数包括第一点位置参数或第一位宽参数,并且所述调整电路配置成当所述第一点位置参数或所述第一位宽参数更新时,调整所述缩放因子以用于下一代反向传播中所述损失值的缩放。
  4. 根据权利要求2所述的设备,其中所述操作数据包括所述梯度数据和神经元数据,所述量化参数包括第二点位置参数或第二位宽参数,并且所述调整电路配置成当所述第二点位置参数或所述第二位宽参数更新时,调整所述缩放因子以用于下一代反向传播中所述损失值的缩放。
  5. 根据权利要求2所述的设备,其中所述操作数据包括所述梯度数据、神经元数据和权值数据,所述量化参数包括第三点位置参数或第三位宽参数,并且所述调整电路配置成当所述第三点位置参数或所述第三位宽参数更新时,调整所述缩放因子以用于下一代反向传播中所述损失值的缩放。
  6. 根据权利要求2所述的设备,其中所述量化电路包括正量化电路和反量化电路,其中所述正量化电路配置成根据所述量化参数将高精度数据类型的操作数据量化成低精度数据类型的操作数据,而所述反量化电路配置成根据所述量化参数将低精度数据类型的操作数据反量化成高精度数据类型的操作数据。
  7. 根据权利要求6所述的设备,其中所述高精度数据类型是浮点型数据类型并且所述低精度数据类型是定点型数据类型。
  8. 根据权利要求6所述的设备,其中所述更新电路配置成利用所述缩放的损失值来获得所述梯度数据中的权值梯度数据,并且所述反量化电路配置成将所述权值梯度数据反量化成高精度数据类型的权值梯度数据,并且所述更新电路配置成利用所述缩放因子和所述高精度数据类型的权值梯度数据对权值进行更新。
  9. 根据权利要求1所述的设备,其中所述调整电路配置成根据一个或多个超参数来调整所述缩放因子。
  10. 根据权利要求1所述的设备,其中所述调整电路配置成根据所述梯度数据的数据分布来确定所述缩放因子。
  11. 根据权利要求1所述的设备,其中所述调整电路配置成根据所述梯度数据的 最大值和预设的阈值来确定所述缩放因子。
  12. 根据权利要求2所述的设备,其中所述量化电路配置成根据所述操作数据的量化误差来确定是否对所述量化参数进行更新。
  13. 根据权利要求12所述的设备,其中所述量化电路配置成基于所述操作数据在量化前和量化后的均值来执行运算,以确定所述量化误差。
  14. 根据权利要求2-13的任意一项所述的设备,其中所述神经网络包括由多个神经元连接形成的多层结构,并且其中所述量化电路配置成针对每一层来确定是否对所述量化参数进行更新,以及当所述量化电路在任意一层处确定更新所述量化参数时,所述调整电路配置成动态地调整所述缩放因子以用于下一代反向传播中所述损失值的缩放。
  15. 一种集成电路,其包括根据权利要求1-14的任意一项所述的设备。
  16. 一种板卡,其包括根据权利要求1-14的任意一项所述的设备。
  17. 一种用于训练神经网络的方法,其中训练所述神经网络包括迭代执行的前向传播和反向传播,所述方法包括:
    根据缩放因子对所述前向传播获得的损失值进行缩放,以获得缩放的损失值;
    基于所述缩放的损失值来执行所述反向传播中的更新操作;以及
    至少根据所述反向传播中的梯度数据来调整所述缩放因子,以便用于下一代反向传播中所述损失值的缩放。
  18. 根据权利要求17所述的方法,其中所述方法还包括执行量化操作,该量化操作包括:
    根据量化参数对参与所述更新操作的操作数据执行量化;以及
    基于所述操作数据来确定是否对所述量化参数进行更新,
    其中在确定对所述量化参数进行更新时,调整所述缩放因子以用于所述损失值的缩放。
  19. 根据权利要求18所述的方法,其中所述操作数据包括所述梯度数据,所述量化参数包括第一点位置参数或第一位宽参数,并且所述调整包括:
    当所述第一点位置参数或所述第一位宽参数更新时,调整所述缩放因子以用于下一代反向传播中所述损失值的缩放。
  20. 根据权利要求18所述的方法,其中所述操作数据包括所述梯度数据和神经元数据,所述量化参数包括第二点位置参数或第二位宽参数,并且所述调整包括:
    当所述第二点位置参数或所述第二位宽参数更新时,调整所述缩放因子以用于下一代反向传播中所述损失值的缩放。
  21. 根据权利要求18所述的方法,其中所述操作数据包括所述梯度数据、神经元数据和权值数据,所述量化参数包括第三点位置参数或第三位宽参数,并且所述调整包括:
    当所述第三点位置参数或所述第三位宽参数更新时,调整所述缩放因子以用于下一代反向传播中所述损失值的缩放。
  22. 根据权利要求18所述的方法,其中所述量化操作包括正量化操作和反量化操作,其中所述正量化操作包括根据所述量化参数将高精度数据类型的操作数据量化成低精度数据类型的操作数据,而所述反量化操作包括根据所述量化参数将低精度数据类型的操作数据反量化成高精度数据类型的操作数据。
  23. 根据权利要求22所述的方法,其中所述高精度数据类型是浮点型数据类型并且所述低精度数据类型是定点型数据类型。
  24. 根据权利要求22所述的方法,其中所述反向传播中的所述更新操作包括利用所述缩放的损失值来获得所述梯度数据中的权值梯度数据,并且所述反量化操作包括将所述权值梯度数据反量化成高精度数据类型的权值梯度数据,并且所述反向传播中的更新操作包括利用所述缩放因子和所述高精度数据类型的权值梯度数据对权值进行更新。
  25. 根据权利要求17所述的方法,其中所述调整包括根据一个或多个超参数来调整所述缩放因子。
  26. 根据权利要求17所述的方法,其中所述调整包括根据所述梯度数据的数据分布来确定所述缩放因子。
  27. 根据权利要求17所述的方法,其中所述调整包括根据所述梯度数据的最大值和预设的阈值来确定所述缩放因子。
  28. 根据权利要求18所述的方法,其中所述量化包括根据所述操作数据的量化误差来确定是否对所述量化参数进行更新。
  29. 根据权利要求28所述的方法,其中所述量化包括基于所述操作数据在量化前和量化后的均值来执行运算,以确定所述量化误差。
  30. 根据权利要求18-29的任意一项所述的方法,其中所述神经网络包括由多个神经元连接形成的多层结构,并且其中所述方法包括:
    针对每一层来确定是否对所述量化参数进行更新;以及
    当在任意一层处确定更新所述量化参数时,动态地调整所述缩放因子以用于下一代反向传播中所述损失值的缩放。
  31. 一种计算机可读存储介质,其存储有用于训练神经网络的计算机程序,当所述计算机程序由一个或多个处理器运行时实现根据权利要求17-30的任意一项所述的方法。
  32. 一种用于训练神经网络的设备,包括:
    至少一个处理器;
    至少一个存储器,其存储有计算机程序代码,所述至少一个存储器和所述计算机程序代码被配置为利用所述处理器,以使得所述设备执行根据权利要求17-30的任意一项所述的方法。
PCT/CN2021/119122 2020-11-30 2021-09-17 用于训练神经网络的方法、设备和计算机可读存储介质 WO2022111002A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202011379857.0A CN114580624A (zh) 2020-11-30 2020-11-30 用于训练神经网络的方法、设备和计算机可读存储介质
CN202011379857.0 2020-11-30
CN202011379869.3A CN114580625A (zh) 2020-11-30 2020-11-30 用于训练神经网络的方法、设备和计算机可读存储介质
CN202011379869.3 2020-11-30

Publications (1)

Publication Number Publication Date
WO2022111002A1 true WO2022111002A1 (zh) 2022-06-02

Family

ID=81753983

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/119122 WO2022111002A1 (zh) 2020-11-30 2021-09-17 用于训练神经网络的方法、设备和计算机可读存储介质

Country Status (1)

Country Link
WO (1) WO2022111002A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116894469A (zh) * 2023-09-11 2023-10-17 西南林业大学 端边云计算环境中的dnn协同推理加速方法、设备及介质
CN117454948A (zh) * 2023-12-25 2024-01-26 福建亿榕信息技术有限公司 一种适用于国产硬件的fp32模型转换方法
CN117556208A (zh) * 2023-11-20 2024-02-13 中国地质大学(武汉) 多模态数据的智能卷积通用网络预测方法、设备及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180121796A1 (en) * 2016-11-03 2018-05-03 Intel Corporation Flexible neural network accelerator and methods therefor
CN110070119A (zh) * 2019-04-11 2019-07-30 北京工业大学 一种基于二值化深度神经网络的手写数字图像识别分类方法
CN110073371A (zh) * 2017-05-05 2019-07-30 辉达公司 用于以降低精度进行深度神经网络训练的损失缩放
CN110428042A (zh) * 2018-05-01 2019-11-08 国际商业机器公司 往复地缩放神经元的连接权重和输入值来挫败硬件限制
CN110619392A (zh) * 2019-09-19 2019-12-27 哈尔滨工业大学(威海) 一种面向嵌入式移动设备的深度神经网络压缩方法
CN111612147A (zh) * 2020-06-30 2020-09-01 上海富瀚微电子股份有限公司 深度卷积网络的量化方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180121796A1 (en) * 2016-11-03 2018-05-03 Intel Corporation Flexible neural network accelerator and methods therefor
CN110073371A (zh) * 2017-05-05 2019-07-30 辉达公司 用于以降低精度进行深度神经网络训练的损失缩放
CN110428042A (zh) * 2018-05-01 2019-11-08 国际商业机器公司 往复地缩放神经元的连接权重和输入值来挫败硬件限制
CN110070119A (zh) * 2019-04-11 2019-07-30 北京工业大学 一种基于二值化深度神经网络的手写数字图像识别分类方法
CN110619392A (zh) * 2019-09-19 2019-12-27 哈尔滨工业大学(威海) 一种面向嵌入式移动设备的深度神经网络压缩方法
CN111612147A (zh) * 2020-06-30 2020-09-01 上海富瀚微电子股份有限公司 深度卷积网络的量化方法

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116894469A (zh) * 2023-09-11 2023-10-17 西南林业大学 端边云计算环境中的dnn协同推理加速方法、设备及介质
CN116894469B (zh) * 2023-09-11 2023-12-15 西南林业大学 端边云计算环境中的dnn协同推理加速方法、设备及介质
CN117556208A (zh) * 2023-11-20 2024-02-13 中国地质大学(武汉) 多模态数据的智能卷积通用网络预测方法、设备及介质
CN117556208B (zh) * 2023-11-20 2024-05-14 中国地质大学(武汉) 多模态数据的智能卷积通用网络预测方法、设备及介质
CN117454948A (zh) * 2023-12-25 2024-01-26 福建亿榕信息技术有限公司 一种适用于国产硬件的fp32模型转换方法

Similar Documents

Publication Publication Date Title
WO2022111002A1 (zh) 用于训练神经网络的方法、设备和计算机可读存储介质
CN109949255B (zh) 图像重建方法及设备
CN111652368A (zh) 一种数据处理方法及相关产品
WO2019238029A1 (zh) 卷积神经网络系统和卷积神经网络量化的方法
CN111027691B (zh) 用于神经网络运算、训练的装置、设备及板卡
WO2022067508A1 (zh) 一种神经网络加速器、加速方法以及装置
KR20190107766A (ko) 계산 장치 및 방법
US20220108150A1 (en) Method and apparatus for processing data, and related products
TW202022798A (zh) 處理卷積神經網路的方法
CN113238989A (zh) 将数据进行量化的设备、方法及计算机可读存储介质
WO2021082725A1 (zh) Winograd卷积运算方法及相关产品
CN114781618A (zh) 一种神经网络量化处理方法、装置、设备及可读存储介质
CN113238987B (zh) 量化数据的统计量化器、存储装置、处理装置及板卡
WO2022012233A1 (zh) 一种量化校准方法、计算装置和计算机可读存储介质
Huai et al. Latency-constrained DNN architecture learning for edge systems using zerorized batch normalization
CN112085175A (zh) 基于神经网络计算的数据处理方法和装置
CN117351299A (zh) 图像生成及模型训练方法、装置、设备和存储介质
CN112561050B (zh) 一种神经网络模型训练方法及装置
WO2023109748A1 (zh) 一种神经网络的调整方法及相应装置
CN114580625A (zh) 用于训练神经网络的方法、设备和计算机可读存储介质
US20230259780A1 (en) Neural network sparsification apparatus and method and related product
CN114580624A (zh) 用于训练神经网络的方法、设备和计算机可读存储介质
WO2021082746A1 (zh) 运算装置及相关产品
Ren et al. Hardware implementation of KLMS algorithm using FPGA
CN114692865A (zh) 一种神经网络量化训练方法、装置及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21896506

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21896506

Country of ref document: EP

Kind code of ref document: A1