CN117332830A - Identifying one or more quantization parameters for quantizing values to be processed by the neural network - Google Patents

Identifying one or more quantization parameters for quantizing values to be processed by the neural network Download PDF

Info

Publication number
CN117332830A
CN117332830A CN202310791759.5A CN202310791759A CN117332830A CN 117332830 A CN117332830 A CN 117332830A CN 202310791759 A CN202310791759 A CN 202310791759A CN 117332830 A CN117332830 A CN 117332830A
Authority
CN
China
Prior art keywords
layer
input
output
quantization
implementation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310791759.5A
Other languages
Chinese (zh)
Inventor
S·塞法尔瓦伊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Imagination Technologies Ltd
Original Assignee
Imagination Technologies Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GBGB2216947.8A external-priority patent/GB202216947D0/en
Application filed by Imagination Technologies Ltd filed Critical Imagination Technologies Ltd
Publication of CN117332830A publication Critical patent/CN117332830A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The present application relates to identifying one or more quantization parameters for quantizing values to be processed by a neural network. A computer-implemented method of identifying quantization parameters, comprising: (a) Determining an output of a model of the NN, the model including quantization blocks, each quantization block transforming a set of values into a corresponding fixed point number format defined by quantization parameters before the model processes the set of values according to a layer of the NN; (b) Determining a cost measure of the NN, the cost measure being a combination of an error measure and an implementation measure, the implementation measure representing an implementation cost of the NN based on the quantization parameter, the value set having been transformed according to the quantization parameter, the implementation measure depending on, for each layer of the NN: a first contribution representing implementation costs of the output of the layer; and a second contribution representing implementation costs of outputs of layers preceding the layer; (c) Counter-propagating the derivative of the cost metric to the at least one quantization parameter to generate a gradient of the cost metric for the at least one quantization parameter; and (d) adjusting at least one quantization parameter based on the gradient.

Description

Identifying one or more quantization parameters for quantizing values to be processed by the neural network
Cross reference to related applications
The present application claims priority from uk patent application 2209612.7 filed on 6/30 of 2022, which is incorporated herein by reference in its entirety. The present application also claims priority from uk patent application 2216948.6 filed on 11/14 of 2022, which is incorporated herein by reference in its entirety. The present application also claims priority from uk patent application 2209616.8 filed at 2022, 6 and 30, which is incorporated herein by reference in its entirety. The present application also claims priority from uk patent application 2216947.8 filed on 14, 11, 2022, which is incorporated herein by reference in its entirety.
Technical Field
The present application relates to identifying one or more quantization parameters for quantizing values to be processed by a neural network.
Background
Neural Networks (NNs) are one form of artificial networks that include multiple layers (e.g., layers) of interconnections that may be used in machine learning applications. Specifically, NN may be used in signal processing applications including, but not limited to, image processing applications and computer vision applications. FIG. 1 illustrates an example NN 100 comprising a plurality of layers 102-1, 102-2, 102-3. Each layer 102-1, 102-2, 102-3 receives input activation data, which is processed according to the layer to produce output data. The output data is provided to another layer as input activation data or output as final output data of the NN. For example, in the NN 100 of fig. 1, the first layer 102-1 receives raw input activation data 104 for the NN 100 and processes the input activation data according to the first layer 102-1 to produce output data. The output data of the first layer 102-1 becomes input activation data for the second layer 102-2, which processes the input activation data according to the second layer 102-2 to generate output data. The output data of the second layer 102-2 becomes input activation data for the third layer 102-3, which processes the input activation data according to the third layer 102-3 to generate output data. The output data of the third layer 102-3 is output as the output data 106 of NN.
The processing performed on the activation data input to the layer depends on the type of layer. For example, each layer of NN may be one of a plurality of different types. Example NN layer types include, but are not limited to: convolution layer, activation layer, normalization layer, pooling layer and full connection layer. It will be apparent to those skilled in the art that these are example NN layer types and that this is not an exhaustive list and that other NN layer types may exist.
In the convolution layer, activation data input to the layer is convolved with weight data input to the layer. The output of convolving the activation data with the weight data may optionally be combined with one or more offset offsets input to the convolution layer.
FIG. 2A shows an example overview of the data format in the convolutional layer for NN. The activation data input to the convolutional layer includes a plurality of data values. Referring to fig. 2A, the activation data input to the convolution layer may have dimensions b×c in ×H a ×W a . In other words, the activation data may be arranged as C in A number of input channels (e.g., sometimes referred to as "data channels"), each of which has a spatial dimension H a ×W a -wherein H a And W is a The height dimension and the width dimension, respectively. In fig. 2A, the activation data is shown as including four input channels (i.e., C in =4). Each input channel is a set of input data values. The activation data input to the convolutional layer may also be defined by a lot size B. The batch size B is not shown in fig. 2A, but defines the number of data batches input to the convolutional layer. For example, in an image classification application, a batch size may refer to an individual image in the data input to the convolutional layerNumber of parts.
The weight data input to the convolutional layer includes a plurality of weight values, which may also be referred to as filter weights, coefficients, or weights. The weight data is arranged in one or more input channels and one or more output channels. The output channel may alternatively be referred to as a kernel or a filter. Referring again to FIG. 2A, the weight data may have a dimension C out ×C in ×H w ×W w . In general, the number of input channels in the weight data corresponds to (e.g., is equal to) the number of input channels in the activation data with which the weight data is to be combined (e.g., C in the example shown in FIG. 2A in =4). Each input channel of each filter of the weight data input to the convolution layer has a spatial dimension H w ×W w -wherein H w And W is w The height dimension and the width dimension, respectively. Each input channel is a set of weight values. Each output channel is a set of weight values. Each weight value is included in (e.g., included by or part of) one input channel and one output channel. C (C) out The dimensions (e.g., number of output channels) are not shown in fig. 2A-but represent the number of channels in the output data generated by combining the weight data with the activation data. As shown in fig. 2A, in the convolutional layer, the weight data may be combined with the activation input data according to a convolution operation across multiple steps in directions s and t.
Fig. 2B schematically illustrates an example convolutional layer 202 arranged to combine input activation data 206 with input weight data 208. FIG. 2B also illustrates the use of an optional offset bias 212 within layer 202. In fig. 2B, the activation data 206 input to the layer 202 is arranged in three input channels 1, 2, 3. The number of input channels in the weight data 208 corresponds to (e.g., is equal to) the number of input channels in the activation data 206 with which the weight data 208 is to be combined. Thus, the weight data 208 is arranged in three input channels 1, 2, 3. The weight data 208 is also arranged in four output channels (e.g., filters) A, B, C, D. The number of output channels in the weight data 208 corresponds to (e.g., is equal to) the number of channels (e.g., data channels) in the output data 210. Each weight value is included in (e.g., included by or part of) one input channel and one output channel. For example, the weight value 216 is included in the input channel 1 and the output channel a. The input activation data 206 is convolved with the input weight data 208 to generate output data 210 having four data lanes A, B, C, D. The first input channel of each filter in the weight data 208 is convolved with the first input channel of the activation data 206, the second input channel of each filter in the weight data 208 is convolved with the second input channel of the activation data 206, and the third input channel of each filter in the weight data 208 is convolved with the third input channel of the activation data 206. The results of the convolution with each filter of each input channel of activation data may be summed (e.g., accumulated) to form an output data value for each data channel of output data 210. If the convolutional layer 202 is not configured to use offset bias, the output data 210 will be the output of the convolutional layer. In fig. 2B, output data 210 is intermediate output data to be combined with offset bias 212. Each of the four output channels A, B, C, D of the weight data 208 input to the layer 202 is associated with a respective bias A, B, C, D. In the convolutional layer, the offsets A, B, C, D are summed with the corresponding data lanes A, B, C, D of the intermediate data 210 to generate the output data 214 having four data lanes A, B, C, D.
An activation layer, typically but not necessarily after the convolutional layer, performs one or more activation functions on activation data input to the layer. The activation function takes a single number and performs a specific nonlinear mathematical operation on it. In some examples, the activation layer may act as a rectifying linear unit (ReLU) by implementing a ReLU function (i.e., f (x) =max (0, x)), or may act as a parameterized rectifying linear unit (prilu) by implementing a prilu function.
The normalization layer is configured to perform a normalization function, such as a Local Response Normalization (LRN) function, on activation data input to the layer. A pooling layer, typically but not necessarily interposed between successive convolution layers, performs a pooling function (such as a maximum or mean function) to aggregate a subset of the activation data input to that layer. The purpose of the pooling layer is therefore to reduce the spatial size of the representation to reduce the number of parameters and calculations in the network and thus also to control the overfitting.
The fully connected layer, typically but not necessarily after multiple convolution and pooling layers, takes the three-dimensional input activation data set and outputs an N-dimensional vector. In the case where NN is used for classification, N may be the number of categories, and each value in the vector may represent the probability of a certain category. The N-dimensional vector is generated by matrix multiplication with weight data, optionally followed by a bias offset. Thus, the fully connected layer receives the activation data, the weight data, and optionally the offset bias. As known to those skilled in the art, the activation data input to the fully-connected layer may be arranged in one or more input channels and the weight data input to the fully-connected layer may be arranged in one or more input channels and one or more output channels, with each of those output channels optionally being associated with a respective offset bias, in a manner equivalent to that described herein for the convolutional layer.
Thus, as shown in fig. 3, each layer 302 of NN receives input activation data and generates output data; and some layers (such as convolutional layers and fully-connected layers) also receive weight data and/or offsets.
The hardware used to implement the NN (e.g., an NN accelerator) includes hardware logic that may be configured to process input data for the NN according to a layer of the NN. In particular, the hardware for implementing the NN includes hardware logic that may be configured to process activation data input to each layer according to the layer and generate output data for the layer that becomes input activation data for another layer or becomes output of the NN. For example, if the NN includes a convolution layer followed by an activation layer, the hardware logic that may be configured to implement the NN includes hardware logic that may be configured to perform convolution on activation data input to the NN using weight data and optionally offsets input to the convolution layer to produce output data for the convolution layer, and hardware logic that may be configured to apply an activation function to activation data input to the activation layer (i.e., output data of the convolution layer) to produce output data for the NN.
As known to those skilled in the art, for hardware that processes a set of values, each value is represented in a digital format. The two most suitable digital formats are fixed point number format and floating point number format. As known to those skilled in the art, fixed point number formats have a fixed number of digits after a radix point (e.g., a decimal point or binary point). In contrast, floating-point number formats have no fixed radix points (i.e., may "float"). In other words, cardinality points may be placed at multiple locations within the representation. While representing values input to and output from layers of the NN in a floating point format may allow for more accurate or precise output data to be generated, processing values in a floating point format in hardware is more complex, which tends to increase silicon area, power consumption, and complexity of hardware compared to processing values in a fixed point format. Thus, the hardware used to implement the NN may be configured to represent the values input to the layers of the NN in a fixed point number format to reduce the area of hardware logic, power consumption, and memory bandwidth.
In general, the fewer the number of bits that can be used to represent values input to and output from the layers of the NN, the more efficient the NN is implemented in hardware. However, in general, the fewer bits used to represent values input to and output from the layers of NN, the less accurate the NN becomes. Accordingly, it is desirable to identify a fixed point number format for representing the value of NN that balances the number of bits used to represent the value of NN with the accuracy of NN.
The embodiments described below are provided by way of example only and are not limiting on the implementation of any or all of the disadvantages of the methods and systems for identifying a fixed point number format for representing a value of NN.
Disclosure of Invention
This summary is provided to introduce a selection of concepts that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
According to a first aspect of the present invention, there is provided a computer-implemented method of identifying one or more quantization parameters for transforming values to be processed by a neural network "NN" to implement NN in hardware, the method comprising, in at least one processor: (a) Determining an output of a model of the NN in response to the training data, the model of the NN comprising one or more quantization blocks, each of the one or more quantization blocks configured to transform one or more sets of values input to a layer of the NN into a respective fixed-point number format defined by one or more quantization parameters before the model processes the one or more sets of values according to the layer; (b) Determining a cost metric for the NN, the cost metric being a combination of an error metric and an implementation metric, the implementation metric representing an implementation cost of the NN based on one or more quantization parameters from which one or more sets of values have been transformed, the implementation metric depending, for each of a plurality of layers of the NN: a first contribution representing implementation costs of outputs from the layers; and a second contribution representing implementation costs of outputs from layers preceding the layer; (c) Counter-propagating the derivative of the cost metric to at least one of the one or more quantization parameters to generate a gradient of the cost metric for the at least one of the one or more quantization parameters; and (d) adjusting at least one of the one or more quantization parameters based on the gradient for the at least one of the one or more quantization parameters.
Each quantization parameter of the one or more quantization parameters may include a respective bit width, and the method may further include: after the adjusting step (d), when the adjusted bit width for the set of values or corresponding set of values is zero, the set of values is removed from the model of NN.
The first contribution may be formed from the implementation cost of one or more output channels of the weight data input to the layer, and the second contribution may be formed from the implementation cost of one or more input channels of the activation data input to the layer.
Each of the one or more quantization parameters may include a respective bit width, each of the one or more sets of values may be a channel of values input to the layer, and the method may include transforming each of the one or more input channels of activation data input to the layer according to the respective bit width, and transforming each of the one or more output channels of weight data input to the layer according to the respective bit width.
The method may include: after the adjusting step (d), when the adjusted bit width for the corresponding input channel of the activation data input to the layer is zero, the output channel of the weight data input to the previous layer is removed from the model of NN.
The first contribution may be formed from an implementation cost of one or more output channels of the weight data input to the layer, and the second contribution may be formed from an implementation cost of one or more input channels of the weight data input to the layer.
Each of the one or more quantization parameters may include a respective bit width, each of the one or more sets of values may be a channel of values input to the layer, and the method may include determining a respective bit width for each of one or more input channels of weight data input to the layer, and determining a respective bit width for each of one or more output channels of weight data input to the layer.
For each weight value input to the layer, a first bit width and a second bit width may be determined, respectively, and the method may comprise transforming each weight value according to its respective first bit width and/or its respective second bit width, optionally according to the smaller of its respective first bit width and its respective second bit width.
The method may include: after the adjusting step (d), when the adjusted bit width for the corresponding input channel of the weight data input to the layer is zero, the output channel of the weight data input to the preceding layer is removed from the model of NN.
The first contribution may be formed from an implementation cost of one or more output channels of weight data input to a layer and an implementation cost of one or more biases input to a layer, and the second contribution may be formed from an implementation cost of one or more output channels of weight data input to a preceding layer and an implementation cost of one or more biases input to a preceding layer.
Each of the one or more quantization parameters may include a respective bit width, the one or more sets of values may include one or more output channels and associated offsets of weight data input to a layer, and one or more output channels and associated offsets of weight data input to a preceding layer, and the method may include transforming each of the one or more output channels of weight data input to the layer according to the respective bit width, transforming each of the one or more offsets input to the layer according to the respective bit width, transforming each of the one or more output channels of weight data input to the preceding layer according to the respective bit width, and transforming each of the one or more offsets input to the preceding layer according to the respective bit width.
The same bit width may be used to transform the output channels of the weight data and the associated bias of the output channels.
The method may include: after the adjusting step (d), removing the output channel from the model of NN when the adjusted bit width of the output channel and the associated bias of the output channel for weight data input to the layer are zero.
The first contribution may be formed from the implementation cost of one or more output channels of weight data input to a layer, and the second contribution may be formed from the implementation cost of one or more output channels of weight data input to a preceding layer.
Each of the one or more quantization parameters may include a respective bit width, the one or more sets of values may include one or more output channels of weight data input to a layer and one or more output channels of weight data input to a preceding layer, and the method may include transforming each of the one or more output channels of weight data input to the layer according to the respective bit width and transforming each of the one or more output channels of weight data input to the preceding layer according to the respective bit width.
The method may include: after the adjusting step (d), when the adjusted bit width for the output channel of the weight data input to the previous layer is zero, the output channel is removed from the model of NN.
For each of the plurality of layers of the NN, the implementation metrics may further depend on an additional contribution to the implementation cost representing one or more biases input to the previous layer.
The method may include: after the adjusting step (d), removing the output channel from the model of NN when the absolute value of the adjusted bit width for the output channel and the associated bias of the output channel for weight data input to the preceding layer is zero.
A layer of NN may receive activation input data that has been derived from activation output data of more than one preceding layer, and implementation metrics for the layer may depend on: a first contribution representing implementation costs of outputs from the layers; a second contribution representing implementation costs of outputs from a first layer preceding the layer; and a third contribution representing implementation costs of outputs from a second layer preceding the layer.
A layer of NN may output activation data input to a first subsequent layer and a second subsequent layer, the method may further include adding a new layer to the NN between the layer and the first subsequent layer, and an implementation metric of the first subsequent layer may depend on: a first contribution representing implementation costs of outputs from the first subsequent layer; and a second contribution representing implementation costs of the output from the new layer.
The new layer may not perform any calculations on the output activation data of the layer.
The second contribution may represent an implementation cost of output from a layer immediately preceding the layer.
The method may include repeating (a), (b), (c), and (d) using the adjusted at least one of the one or more quantization parameters.
The method may also include outputting the adjusted at least one of the one or more quantization parameters for configuring the hardware logic to implement the NN.
The method may also include configuring hardware logic to implement the NN using the adjusted quantization parameter.
The hardware logic may include a neural network accelerator.
According to a second aspect of the present invention, there is provided a computing-based device configured to identify one or more quantization parameters for transforming values to be processed by a neural network "NN" to implement NN in hardware, the computing-based device comprising: at least one processor; and a memory coupled to the at least one processor, the memory comprising: computer readable code, which when executed by the at least one processor, causes the at least one processor to: (a) Determining an output of a model of the NN in response to the training data, the model of the NN comprising one or more quantization blocks, each of the one or more quantization blocks configured to transform one or more sets of values input to a layer of the NN into a respective fixed-point number format defined by one or more quantization parameters before the model processes the one or more sets of values according to the layer; (b) Determining a cost metric for the NN, the cost metric being a combination of an error metric and an implementation metric, the implementation metric representing an implementation cost of the NN based on one or more quantization parameters from which one or more sets of values have been transformed, the implementation metric depending, for each of a plurality of layers of the NN: a first contribution representing implementation costs of outputs from the layers; and a second contribution representing implementation costs of outputs from layers preceding the layer; (c) Counter-propagating the derivative of the cost metric to at least one of the one or more quantization parameters to generate a gradient of the cost metric for the at least one of the one or more quantization parameters; and (d) adjusting at least one of the one or more quantization parameters based on the gradient for the at least one of the one or more quantization parameters.
According to a third aspect of the present invention, there is provided a computer-implemented method of processing data using a neural network "NN" implemented in hardware, the NN comprising a plurality of layers, each layer being configured to operate on activation data input to the layer so as to form output data for the layer, the data being arranged in a data channel, the method comprising: for the identification channel of the output data aiming at the layer, operating the activation data input to the layer so that the output data aiming at the layer does not comprise the identification channel; and inserting a replacement channel into the output data for the layer in place of the identification channel according to information indicating a structure of the output data for the layer, which should include the identification channel, before an operation of the NN configured to operate the output data for the layer.
According to a fourth aspect of the present invention, there is provided a computing-based device configured to process data using a neural network "NN" implemented in hardware, the NN comprising a plurality of layers, each layer configured to operate on activation data input to the layer so as to form output data for the layer, the data being arranged in a data channel, the computing-based device comprising at least one processor configured to: for the identification channel of the output data aiming at the layer, operating the activation data input to the layer so that the output data aiming at the layer does not comprise the identification channel; and inserting a replacement channel into the output data for the layer in place of the identification channel according to information indicating a structure of the output data for the layer, which should include the identification channel, before an operation of the NN configured to operate the output data for the layer.
Hardware logic configurable to implement an NN (e.g., an NN accelerator) may be embodied in hardware on an integrated circuit. A method of manufacturing hardware logic configurable to implement an NN (e.g., an NN accelerator) at an integrated circuit manufacturing system may be provided. An integrated circuit definition data set may be provided that, when processed in an integrated circuit manufacturing system, configures the system to manufacture hardware logic that may be configured to implement an NN (e.g., an NN accelerator). A non-transitory computer readable storage medium having stored thereon a computer readable description of hardware logic configurable to implement NN (e.g., NN accelerator) may be provided that, when processed in an integrated circuit manufacturing system, causes the integrated circuit manufacturing system to manufacture an integrated circuit embodying the hardware logic configurable to implement NN (e.g., NN accelerator).
An integrated circuit manufacturing system may be provided, the integrated circuit manufacturing system comprising: a non-transitory computer-readable storage medium having stored thereon a computer-readable description of hardware logic configurable to implement NN (e.g., NN accelerator); a layout processing system configured to process the computer-readable description to generate a circuit layout description of an integrated circuit embodying hardware logic configurable to implement an NN (e.g., NN accelerator); and an integrated circuit generation system configured to fabricate hardware logic configurable to implement NN (e.g., NN accelerator) in accordance with the circuit layout description.
Computer program code for performing the methods as described herein may be provided. A non-transitory computer readable storage medium having stored thereon computer readable instructions may be provided which, when executed at a computer system, cause the computer system to perform a method as described herein.
As will be apparent to those skilled in the art, the above features may be suitably combined and combined with any of the aspects of the examples described herein.
Drawings
Examples will now be described in detail with reference to the accompanying drawings, in which:
FIG. 1 is a schematic diagram of an example Neural Network (NN);
FIG. 2A illustrates an example overview of a data format in a convolutional layer for NN;
fig. 2B schematically illustrates an example convolutional layer.
FIG. 3 is a schematic diagram showing data input to and output from layers of an NN;
FIG. 4 is a schematic diagram showing an example model of NN with and without quantization blocks;
FIG. 5 is a flow chart of an example method for identifying quantization parameters for NN;
FIG. 6 is a schematic diagram illustrating a first example method for generating an error metric;
FIG. 7 is a schematic diagram illustrating a second example method for generating an error metric;
FIG. 8 is a graph illustrating an example gradient of an example cost metric with respect to bit width;
FIG. 9 is a schematic diagram showing interactions between two adjacent layers of NN;
fig. 10A is a schematic diagram showing NN including a remaining layer.
FIG. 10B is a flow chart of an example method for inserting an alternate channel.
Fig. 10C to 10E are schematic diagrams showing NNs including remaining layers.
FIG. 11 is a flow chart of an example method for identifying quantization parameters and weights for NNs;
FIG. 12 is a schematic diagram illustrating quantization into an example fixed point number format;
FIG. 13 is a block diagram of an example NN accelerator;
FIG. 14 is a block diagram of an example computing-based device;
FIG. 15 is a block diagram of an example computer system in which NN accelerators may be implemented; and is also provided with
FIG. 16 is a block diagram of an example integrated circuit manufacturing system for generating an integrated circuit embodying an NN accelerator as described herein.
The figures illustrate various examples. Skilled artisans will appreciate that element boundaries (e.g., blocks, groups of blocks, or other shapes) illustrated in the figures represent one example of boundaries. In some examples, it may be the case that one element may be designed as a plurality of elements, or that a plurality of elements may be designed as one element. Where appropriate, common reference numerals have been used throughout the various figures to indicate like features.
Detailed Description
The following description is presented by way of example to enable a person skilled in the art to make and use the invention. The invention is not limited to the embodiments described herein and various modifications to the disclosed embodiments will be apparent to those skilled in the art. Embodiments are described by way of example only.
Since the number of bits that efficiently represent a set of values is based on a range of values in the set, NN can be efficiently implemented without significantly degrading its accuracy by dividing the values input to NN into sets and selecting a fixed point number format on a per set basis. Since the values input to the same layer tend to be related, each set may be all or part of a particular type of input for a layer. For example, each set may be all or a portion of the input activation data values of the layer; all or a portion of the input weight data for the layer; or all or a portion of the bias of the layers. Whether these sets include all or only a portion of the specific types of inputs for the layers depends on the hardware used to implement the NN. For example, some hardware for implementing an NN may support only a single fixed-point number format per input type per layer, while other hardware for implementing an NN may support multiple fixed-point number formats per input type per layer.
Each fixed point number format is defined by one or more quantization parameters. A common fixed point number format is the Q format, which specifies a predetermined number of integer bits a and decimal bits b. Thus, the number may be denoted qa.b, which requires a+b+1 bits (including sign bits) in total. An example Q format is shown in table 1 below.
TABLE 1
Q format Description of the invention Example
Q4.4 4 integer digits and 4 decimal digits 0110.1110 2
Q0.8 0 integer and 8 decimal places .01101110 2
In the case where the Q format is used to represent the value of NN, the quantization parameter may include an integer digit a and a decimal digit b for each fixed point number format.
In other cases, instead of using the Q format to represent the value input to the layers of NN, the fixed-point number format defined by the fixed integer exponent exp and the b-bit mantissa m may be used such that the value z is equal to z=2 exp m. In some cases, mantissa m may be represented in a two's complement format. However, in other cases, other signed or unsigned integer formats may be used. In these cases, the exponent exp and mantissa digit b need only be stored once for the value set represented in this format. In the case where such fixed point formats are used to represent the values of NN, the quantization parameter may include, for each fixed point format, a mantissa bit length b # Which may also be referred to herein as a bit width or bit length) and an exponent exp.
In yet other cases, an 8-bit asymmetric fixed point (Q8A) format may be used to represent values input to layers of NN. This format includes a minimum representable number r min Maximum representable number r max Zero z, and 8-bit numbers for each value, the 8-bit numbers identifying a linear interpolation factor between the minimum and maximum numbers. In other cases, variations of the Q8A format may be used in which the number of bits used to store the interpolation factor is variable (e.g., the number of bits used to store the interpolation factor may be one of a number of possible integers). Floating point value d float It can be constructed from this format shown in equation (1), where b is the number of bits used by the quantized representation, and z is the quantized zero point that will always map exactly back to 0.F. In the case where such fixed point formats are used to represent the values of NN, the quantization parameter may include a maximum representable number or value r for each fixed point format max Minimum representable number or value r min Quantization zero z and optionally mantissa bit length b (i.e., when the bit length is not fixed at 8).
While the fixed-point number format (and more specifically, the quantization parameters thereof) for efficiently representing a set of values may be determined only from the range of values in the set, since the layers of NN are interconnected, a better tradeoff may be achieved between the number of bits used to represent the values of NN and the performance (e.g., accuracy) of NN by considering the interactions between the layers when selecting the fixed-point number format (and more specifically, the quantization parameters thereof) for representing the values of NN.
Accordingly, described herein are methods and systems for identifying a fixed-point number format for representing the value of NN using back propagation, and in particular, the quantization parameters thereof (e.g., exponent and mantissa bit lengths). As known to those skilled in the art, back propagation is a technique that can be used to train NNs. Training the NN includes identifying appropriate weights to configure the NN to perform particular functions.
Specifically, to train the NN via back propagation, a model of the NN is configured to use a particular set of weights, training data is then applied to the model, and the model is recorded in response to the output of the training data. A differentiable error metric is then calculated from the recorded output, the differentiable error metric quantitatively indicating performance of the NN using the particular set of weights. In some cases, the error metric may be a distance (e.g., a mean square distance) between the recorded output and the expected output for the training data. However, this is merely an example, and any suitable error metric may be used. The derivative of the error metric is then back propagated to the weights of the NN to produce a gradient/derivative of the error metric with respect to each weight. The weights are then adjusted based on the gradient to reduce the error metric. This process may be repeated until the error metric converges.
NNs are typically trained using a model of NNs in which the values of NNs (e.g., activation data, weight data, and offsets) are represented and processed in a floating point number format. NNs that use a floating point format to represent and process the values of NNs are referred to herein as floating point NNs. The model of the floating point NN may be referred to herein as the floating point model of NN. However, as described above, hardware for implementing the NN (e.g., NN accelerator) may use a fixed-point number format to represent the values of the NN (e.g., activation data, weight data, and bias) to reduce the size of the hardware and to increase the efficiency of the hardware. NN for which at least some of the values use a fixed point number format is referred to herein as fixed point NN. To train the fixed-point NN, quantization blocks may be added to the floating-point model of the NN that quantize (or simulate the quantization of) the NN values to a predetermined fixed-point number format prior to processing the values. This allows for the quantization of values into a fixed point number format to be considered in training the NN. A model of an NN that includes one or more quantization blocks to quantize one or more sets of input values (or to simulate quantization of the sets of input values) is referred to herein as a quantization model of the NN.
For example, FIG. 4 shows an example NN 400 comprising a first layer 402 and a second layer 404, the first layer being in accordance with a first set of weight data W 1 And a first bias set B 1 To process a first set of input activation data values X 1 The method comprises the steps of carrying out a first treatment on the surface of the The second layer is based on a second set of weight data W 2 And a second bias set B 2 To process the second set of input activation data values X2 (the output of the first layer 402). Such a floating-point model of NN 400 may be augmented with one or more quantization blocks that each quantize (or simulate the quantization of) one or more sets of input values to a layer of NN such that quantization of the values of NN may be considered in training NN. For example, as shown in fig. 4, a quantization model 420 of NN may be generated from a floating point model of NN by adding: a first quantization block 422 that sets a first input activation data value, X 1 Quantize (or simulate quantization of the first set of input activation data values) into one or more fixed point number formats defined by respective sets of quantization parameters; a second quantization block 424 that quantizes the first set of weight data W 1 And a first bias set B 1 Quantize (or simulate their quantization) into one or more fixed point number formats defined by respective sets of quantization parameters; a third quantization block 426 that sets the second input activation data values X 2 Quantize (or simulate quantization of the second set of input activation data values) to one or more fixed point number formats defined by respective sets of quantization parameters; and a fourth quantization block 428 that quantizes the second set of weight data W 2 And a second bias set B 2 Quantizes (or models their quantization) into one or more fixed point number formats defined by corresponding quantization parameters.
Adding quantization blocks to the floating-point model of NN allows the quantization parameters (e.g., mantissa bit length and exponent) themselves to be determined via back propagation, as long as the quantization parameters are differentiable. In particular, this may be achieved by making the quantization parameters (e.g., bit length b and exponent exp) learnable and generating a cost metric based on the error metric and the implementation cost of NN. The derivatives of the cost metrics may then be back-propagated to the quantization parameters (e.g., bit depth b and exponent exp) to produce gradients/derivatives of the cost metrics with respect to each of the quantization parameters. Each gradient indicates whether the corresponding quantization parameter (e.g., bit depth or index) should be higher or lower than now to reduce the cost metric. The quantization parameter may then be adjusted based on the gradient to minimize the cost metric. Similar to training NNs (i.e., identifying weights for NNs), this process may be repeated until the cost metrics converge.
Tests have shown that using back propagation to identify quantization parameters of the NN can generate a fixed point NN with a good level of performance (e.g., with accuracy above a predetermined threshold) and a minimum number of bits, which allows the NN to be implemented efficiently in hardware.
Referring now to fig. 5, an example methodology 500 for identifying quantization parameters of NN via back propagation is illustrated. In an example, the method 500 of fig. 5 may be used to identify quantization parameters of a Deep Neural Network (DNN), which is one type of NN, via back propagation. The method 500 may be implemented by a computing-based device, such as the computing-based device 1400 described below with respect to fig. 14. For example, there may be a computer-readable storage medium having stored thereon computer-readable instructions that, when executed at a computing-based device, cause the computing-based device to perform the method 500 of fig. 5.
The method begins at block 502, where a determination is made that a quantization model of NN is responsive to the output of training data. The model of the NN is a representation of the NN that may be used to determine the output of the NN in response to input data. The model may be a software implementation of the NN or a hardware implementation of the NN, for example. Determining the output of the model of the NN in response to the training data includes passing the training data through a layer of the NN and obtaining an output thereof. This may be referred to as forward pass of NN because the computational flow is from input through NN to output. The model may be configured to use a trained set of weights (e.g., a set of weights obtained by training a floating point model of the NN).
The quantization model of the NN is a model of the NN that includes one or more quantization blocks (e.g., as shown in fig. 4). Each quantization block is configured to transform (e.g., quantize or simulate quantization of) one or more sets of values input to a layer of NN before the model processes the one or more sets of values according to the layer. The quantization block allows to quantize the effect of one or more sets of values of the NN on the output of the NN to be measured.
Quantization is the process of converting a number in a higher precision digital format to a lower precision digital format, as known to those skilled in the art. Quantizing the digits in the higher precision format to the lower precision format generally includes selecting one of the representable digits in the lower precision format to represent the digit in the higher precision format based on a particular rounding mode, such as, but not limited to, round-near (RTN), round-to-zero (RTZ), round-near tie-even (RTE), round-to-positive-infinity (RTP), and round-to-negative-infinity (RTNI).
For example, equation (2) lists a method for quantizing a value z in a first digital format to a value z in a second, lower precision digital format q Wherein X is max Is the highest representable digit in the second digit format, X min Is the lowest representable digit in the second digit format, and RND (z) is a rounding function:
the formula listed in equation (2) quantifies a value in a first number format into one of the representable numbers in a second number format selected based on the rounding mode RND (e.g., RTN, RTZ, RTE, RTP or RTNI).
In the examples described herein, the lower precision format is a fixed point format, and the higher precision format may be a floating point format or a fixed point format. In other words, each quantization block is configured to receive one or more sets of values in an input digital format (which may be a floating point number format or a fixed point number format) and quantize (or simulate quantization of) the sets of values to one or more lower precision output fixed point number formats.
As described above with respect to fig. 3, each layer of NN receives input activation data and generates output data. The layer may also receive weight data and/or bias. Thus, the set of values transformed by the quantization block may be all or a subset of the activation data values input to the layer, all or a subset of the weight data values input to the layer, or all or a subset of the offsets input to the layer. By way of example, any one or more of the following may be considered as a set of values to be transformed by a quantization block: an input channel for activation data input to a layer, an input channel for weight data input to a layer, an output channel for weight data input to a layer, a bias input to a layer, and/or an output channel for weight data input to a layer and its associated bias.
Each quantization block may be configured to transform (e.g., quantize or simulate the quantization of) different subsets of values of a particular input type into a different output fixed point number format. For example, the quantization block may transform a first subset of input activation values for a layer into a first output fixed-point number format and a second subset of input activation values for the layer into a second, different output fixed-point number format. In other words, in an example, one quantization block may transform each of the input channels of activation data input to a layer, each of those input channels being transformed into a corresponding (e.g., possibly different) output fixed point number format. In other cases, there may be multiple quantization blocks per input type. For example, there may be multiple quantization blocks for transforming the activation data of the layer, wherein each of these quantization blocks transforms only a portion (or only a subset) of the activation data values of the layer. In other words, in an example, each quantization block may transform one input channel of activation data into an output fixed point number format.
Each output fixed point number format used by the quantization block is defined by one or more quantization parameters. The quantization parameters defining a particular output fixed point number format may be based on the particular fixed point number format supported by the hardware logic that will implement the NN. For example, each fixed point number format may be defined by an exponent exp and a mantissa bit length b.
In the first iteration of block 502, the quantization parameters used by the quantization block may be randomly selected from the supported quantization parameters, or these quantization parameters may be selected in another manner. For example, in some cases, the mantissa bit length may be set to a value higher than the highest bit length supported by the hardware to be used to implement NN so that information is not lost due to initial quantization. For example, in the case where the hardware to implement NN supports a maximum bit length of 16 bits, then the mantissa bit length may be initially set to a value higher than 16 (e.g., 20).
Once the model of the NN has been determined in response to the output of the training data, the method 500 proceeds to block 504.
At block 504, a cost metric cm for the set of quantization parameters used in block 502 is determined from (i) the output of the quantization model of the NN in response to the training data and (ii) the implementation cost of the NN based on the set of quantization parameters. The cost metric cm is a quantitative measure of the quality of the set of quantization parameters. In the examples described herein, when the quantization parameter set is used to quantize the values of NN (or simulate quantization of those values), the quality of the quantization parameter set is based on the error of NN, and when the quantization parameter set is used, the quality is based on the implementation cost of NN (e.g., expressed in number of bits or bytes). Thus, in some cases, the cost metric cm may be a combination of the error metric em and the implementation metric sm. The implementation metrics may be referred to as implementation cost metrics or size metrics. In some examples, the cost metric cm may be calculated as a weighted sum of the error metric em and the implementation metric sm, as shown in equation (3), where α and β are weights applied to the error metric em and the implementation metric sm, respectively. The weights α and β are chosen to achieve some balance between error metrics and achieving metrics. In other words, the weight is used to indicate whether the error or implementation cost is more important. For example, if the implementation metric weight β is smaller, the cost metric will be dominated by the error metric, resulting in a more accurate network. In contrast, if the implementation metrics weight β is greater, the cost metrics will be governed by the implementation metrics, resulting in smaller networks with lower accuracy. However, in other examples, the error metric em and the implementation metric sm may be combined in another suitable manner to generate the cost metric cm.
cm=(α*em)+(β*sm) (3)
The error metric em may be any metric that provides a quantitative measure of error in the output of the NN's quantization model when the values of the NN are quantized (or the quantization of these values is simulated) using a particular set of quantization parameters. In some examples, the quantized model of NN in response to errors in the output of training data may be calculated as errors in the output relative to a baseline output. In some cases, as shown at 600 of fig. 6, the baseline output may be the output of a floating point model of NN (i.e., a model of NN where the values of NN are in floating point format). Because values can generally be more accurately represented, or more accurately represented, in a floating point number format, a floating point model of an NN represents the model of the NN that will produce the most accurate output. Thus, the output generated by the floating point model of the NN may be used as a reference or baseline output from which to estimate the accuracy of the output data generated by the quantized model of the NN.
In other examples, as shown at 700 of fig. 7, the baseline output may be a ground truth output for training data. In these examples, an error in the output of the quantized model of the NN may indicate the accuracy of the output of the quantized model of the NN relative to the known results of the training data.
The error between the baseline output and the output of the quantized model of NN may be determined in any suitable manner. Where NN is a classification network, the output of NN may be a set of scores. As known to those skilled in the art, the classification network determines the probability that the input data falls into each of a plurality of categories. Classification NNs generally output data vectors, with one element corresponding to each class, and each of these elements is referred to as a score. For example, a classification network with 1425 potential class labels may output a vector of 1425 scores. In these cases, the error between the baseline output and the output of the quantized model of NN may be calculated as the L1 distance between the corresponding scores. This is shown in equation (4), where r is the set of scores in the baseline output and r' is the set of scores in the output of the quantization model of NN:
em=∑ i |r i -r′ i | (4)
in other examples, the output of the classification NN may instead be the output of a SoftMax function applied to the score. As known to those skilled in the art, the SoftMax function is a transformation applied to the scores output by NN such that the value associated with each class adds up to 1. This allows the output of the SoftMax function to represent the probability distribution over the categories. The output of the SoftMax function may be referred to as a SoftMax normalized score. The SoftMax function may be represented as shown in equation (5) (with or without an additional temperature parameter T), where s i Is the softmax output for class i, r i Is the score for category i, and i and j are vector indexes corresponding to the category. Increasing the temperature T makes the "SoftMax value" softer (i.e., less saturated for 0 and 1) and thus more resistant to training.
In the case where the output of the classification NN is a SoftMax normalized score set, the error between the baseline output and the output of the quantized model of NN may be calculated as the L1 distance between the outputs of the SoftMax function.
In other cases, the error in the output of the quantized model of NN in response to the training data may be the previous nth classification accuracy, where N is an integer greater than or equal to one. As known to those skilled in the art, the top N classification accuracy is a measure of how often the correct classification occurs in the top N classifications output by the NN. The top N classification accuracies for frequency are the top 1 st and top 5 th classification accuracies, but any top N classification accuracy may be used.
Generally, it is advantageous that NN will be trained (i.e. the weights thereof selected) according to an error metric, and that the quantization parameters are selected using the same error metric used in the training.
The implementation metrics sm are metrics that provide a quantitative measure of the hardware-related costs of implementing NN when a particular set of quantization parameters is used. The implementation metrics represent the cost of implementing the NN based on one or more quantization parameters according to which one or more sets of values are transformed in block 502. The implementation metrics may be referred to as implementation cost metrics or size metrics. The hardware-related costs of implementing an NN may include, for example, the cost of transferring data from memory to an NNA chip. When using a particular set of quantization parameters, the implementation metrics may reflect some measure of performance of the NN, such as: how fast NN runs on a certain hardware; or how much power the NN consumes on a certain hardware. For example, the implementation metrics may be hardware specific (e.g., specific to the NN accelerator that will implement the NN) such that the implementation metrics may be customized to reflect the characteristics of the hardware so that NN training effectively optimizes the set of quantization parameters for the hardware. The implementation metrics may be expressed, for example, in physical units (e.g., joules (Joules)) or information units (e.g., bits or bytes).
In a simple approach, the implementation metrics may depend on the total number of bits or bytes used to represent some set of values (e.g., input activation data set, weight data set, or bias set) for each of the layers of the NN. That is, the inventors have discovered that this simple method can be improved by considering interactions between layers (e.g., particularly adjacent layers) when used in a method for identifying one or more quantization parameters as described herein. For example, consider an illustrative network that includes a first layer configured to output 5 data channels (e.g., using weight data arranged in 5 output channels) to a second layer configured to output 1000 data channels (e.g., using weight data arranged in 1000 output channels). A simple method for evaluating the implementation cost of the network may be to evaluate the sum of the sizes (e.g., in bits) of the output channels of the weight data input to each layer. That is, the implementation cost of each layer may be estimated from the sum of the number of bits used to encode each of the output channels of the weight data input to the layer, and the implementation cost of the network may be represented by the sum of the implementation costs of the layers. Assuming that each output weight channel includes a comparable number of weight values, this simple method can determine that the first layer (using weight data arranged in 5 output channels) is relatively small and the second layer (using weight data arranged in 1000 output channels) is relatively large. Thus, training methods based on such implementation metrics may "aim" at the output channels of the weight data input to the second layer (e.g., based on the second layer appearing larger, and thus reducing its size would obviously have a greater impact on the implementation cost of the NN). However, this simple method does not take into account that each of the 5 channels of the output data generated by the first layer will be convolved with the 1000 output channels of the weight data input to the second layer. Thus, reducing the implementation cost of any one of those 5 channels of output data generated by the first layer (e.g., by reducing the size of the output channel of weight data input to the first layer) can have a significant impact on the total extrapolated time of the NN. This concept is clearly illustrated by means of an extreme example, reducing the size of any one of the 5 output channels of the weight data input to the first layer to zero, thereby enabling the corresponding channel of the output data to be omitted from NN, which reduces the amount of computation to be performed in the second layer by 1000 times the addition operation. A simple method for evaluating the implementation cost of a network does not take into account this type of interaction between layers. It will be appreciated that similar disadvantages will be experienced if alternative simple methods are used, wherein the implementation cost of the network is assessed in terms of: the size (e.g., number of bits) of the output channel of the data generated by each layer; the size (e.g., in number of bits) of the input channel of the weight data input to each layer; or the size of the input channel (e.g., in bits) of the activation data to be input to each layer.
In accordance with the principles described herein, for each of the plurality of layers of the NN, the implementation metrics depend on a first contribution (e.g., a number of output channels for the layer) representing an implementation cost of the output from that layer, and a second contribution (e.g., a number of input channels for the layer for which the implementation cost is determined) representing an implementation cost of the output from a layer preceding the layer. That is, each of the plurality of layers may provide a respective first contribution and second contribution. The implementation metric may be a sum of implementation costs for each of a plurality of layers determined from the first contribution and the second contribution. In this way interactions between layers (e.g. in particular adjacent layers) can be better considered. Training methods based on such implementation metrics that better account for interactions between layers may better "target" sets of values that have a greater impact on the implementation cost of NN-such as those involved in a greater number of multiply-add operations.
It should be appreciated that the implementation cost of each layer of NN need not be determined in this manner to be included in the implementation metrics. For example, the implementation cost of the last layer of the NN need not depend on a first contribution representing the implementation cost of the output from that layer, and/or the implementation cost of the first layer of the NN need not depend on a second contribution representing the implementation cost of the output from a layer preceding that layer. Alternatively or additionally, the implementation metrics may include the first and second contributions of the layers (e.g., convolution and/or fully connected layers) that only receive weight data and/or bias from the NN as inputs. That is, the plurality of layers may include a plurality of rolled and/or fully connected layers. In other words, the implementation metrics may not include contributions from layers that do not receive weight data and/or bias as input (e.g., an activation layer, a normalization layer, or a pooling layer).
In an example, the layer preceding the layer for which the implementation cost is determined may be the layer immediately preceding the layer (e.g., the layer outputting data that is the input activation data for the layer for which the implementation cost is determined); may be a preceding layer in the NN that also receives weight data and/or bias as input (e.g., a preceding convolutional layer and/or fully-connected layer of the NN); or may be the same preceding layer in the NN as the layer for which the implementation cost is determined (e.g., the previous convolutional layer in the NN if the layer for which the implementation cost is determined is a convolutional layer). In other words, the layer for which implementation costs are determined may be separated from the layer preceding the layer by other types of layers (e.g., an activation layer, a normalization layer, or a pooling layer) and/or intermediate operations (such as a summation block between the layer and the layer preceding the layer). In other words, a layer for which implementation costs are determined may be separated from a layer preceding the layer by other types of layers that do not change the number of data channels in the input activation data received by the layer, such that the input activation data of the layer for which implementation costs are determined and the output data of the layer preceding the layer are arranged in the same number of data channels. In other words, the layer for which implementation costs are determined may be separated from the layer preceding the layer by other types of layers that process the data channels independently (e.g., that do not result in a "mix" of data values between the input data channel and the output data channel).
In the following, nine specific examples are provided, wherein the implementation metrics depend on the first contribution and the second contribution of each of the plurality of layers as described herein. It is to be understood that these specific implementations are provided by way of example only and that the principles described herein may be implemented differently.
Example 1
In example 1, the first contribution is formed from an implementation cost of one or more output channels of the weight data input to the layer, and the second contribution is formed from an implementation cost of one or more input channels of the activation data input to the layer. As described herein, the number of output channels in the weight data of a layer corresponds to (e.g., is equal to) the number of channels (e.g., data channels) in the output data of the layer. Thus, the implementation cost of one or more output channels of weight data input to a layer may be considered to represent the implementation cost of output from that layer. As described herein, the activation data input to a layer is output data from a layer preceding the layer (or derived directly from the output data, e.g., in the case of an intermediate operation such as a summation block between layers). Thus, the implementation cost of one or more input channels of activation data input to a layer may be considered to represent the implementation cost of an output of a layer preceding the layer.
In example 1, in block 502 of fig. 5, each of the one or more sets of values transformed by the one or more quantization blocks is a channel of values input to the layer. Each quantization parameter of the one or more quantization parameters includes a respective bit width, and one or more channels of values are transformed according to the one or more quantization parameters. One or more quantization blocks are allocatedArranged according to the corresponding bit widthTo transform each of one or more of the input channels i of the activation data input to the layer (wherein bit width +.>Can be expressed as vector +.>) And according to the corresponding bit width +.>To transform each of one or more of the output channels j of the weight data input to the layer (wherein bit width +.>Can be expressed as vector +.>). More specifically, the input activation data x and the input weight data w may be transformed according to equations (6) and (7), wherein the respective bit widths of the activation data +.>And index->Coding with vectors each having I elements, corresponding bit width of weight data +.>And index->Encoded in vectors each having O elements. That is to say + >And->Quantizing each input channel of the activation data x using a separate pair of quantization parameters, and +.>And->Each output channel of the weight data w is quantized using a separate pair of quantization parameters. Examples q of suitable quantization functions are described below (e.g., with reference to equations (37A), (37B), or (37C)).
In example 1, in block 504 of FIG. 5, layer s l The implementation cost of (2) may be defined according to equation (8), which is a differentiable function. In equation (8), the first contribution depends on the bit width according to the bit width greater than zeroThe number of input channels i of the transformed activation data multiplied by the bit width +.>Each of the one or more output channels j of the weight data is transformed according to these bit widths. The second contribution depends on +.>Transformed by a transformerThe number of output channels j of the weight data multiplied by the bit width +.>Each of the one or more input channels i of the activation data is transformed according to these bit widths. In equation (8), the term +.>And->Can be used to ensure that the bit width is not +.>And->Respectively, is adjusted to be below zero, as described in further detail below. Cost s of implementation of the layer l Determined from the sum of the first contribution and the second contribution, multiplied by the height H of the weight data input to the layer w And width W w The product of the dimensions. The implementation metrics for NN may be formed by summing the implementation costs of multiple layers of NN as determined according to equation (8).
Example 2
In example 2, as in example 1, the first contribution is formed from the implementation cost of the one or more output channels of the weight data input to the layer, and the second contribution is formed from the implementation cost of the one or more input channels of the activation data input to the layer.
In example 2, the transformation of the set of input values by one or more of the quantization blocks in block 502 of fig. 5 is the same as the transformation described with reference to example 1. In other words, the input activation data x and the input weight data w may be transformed according to equations (6) and (7), as described herein with reference to example 1.
In example 2, in block 504 of FIG. 5, layer s l The implementation cost of (2) may be defined according to equation (9), which is a differentiable function. In equation (9), the first contribution depends on the bit widthEach of the one or more output channels j of the weight data is transformed according to these bit widths. The second contribution depends on the bit width +. >Each of the one or more input channels i of the activation data is transformed according to these bit widths. In equation (9), the term +.>And->Can be used to ensure that the bit width is not +.>And->Respectively, is adjusted to be below zero, as described in further detail below. Cost s of implementation of the layer l Determined from the product of the first contribution and the second contribution, multiplied by the height H of the weight data input to the layer w And width W w The product of the dimensions. The implementation metrics for NN may be formed by summing the implementation costs of multiple layers of NN as determined according to equation (9).
Example 3
In example 3, the first contribution is formed from an implementation cost of one or more output channels of the weight data input to the layer, and the second contribution is formed from an implementation cost of one or more input channels of the weight data input to the layer. As described herein, the number of output channels in the weight data of a layer corresponds to (e.g., is equal to) the number of channels (e.g., data channels) in the output data of the layer. Thus, the implementation cost of one or more output channels of weight data input to a layer may be considered to represent the implementation cost of output from that layer. As described herein, the number of input channels in the weight data of a layer corresponds to (e.g., is equal to) the number of input channels in the activation data with which the weight data is to be combined. Furthermore, as described herein, the activation data input to a layer is output data from a layer preceding the layer (or derived directly from the output data, e.g., in the case of an intermediate operation such as a summation block between layers). Thus, the implementation cost of one or more input channels of weight data input to a layer may be considered to represent the implementation cost of the output of the layer preceding the layer.
In example 3, in block 502 of fig. 5, each of the one or more sets of values transformed by the one or more quantization blocks is a channel of values input to the layer. Each quantization parameter of the one or more quantization parameters includes a respective bit width, and one or more channels of values are transformed according to the one or more quantization parameters. In example 3, for the purpose of step 504 of fig. 5, a respective bit width b is determined for each of one or more input channels i of weight data input to the layer i (wherein the bit width b i Can be expressed as a vector) And determining a respective bit width b for each of one or more output channels j of the weight data input to the layer j (wherein the bit width b j Can be expressed as a vector). More specifically, the input weight data w may be transformed according to equation (10A), (10B) or (10C), wherein the bit width B of the input channel of the weight data i Encoding with a vector of I elements and weighting the bit width b of the output channel of the data j Encoded in a vector of O elements. In equation (10A), the indices e of the input and output channels of the weight data ij Encoding is performed in a two-dimensional matrix. In other words, b i And e ij Quantize each input channel of the weight data w using a separate pair of quantization parameters, and b j And e ij Each output channel of the weight data w is quantized using a separate pair of quantization parameters. Examples q of suitable quantization functions are described below (e.g., with reference to equations (37A), (37B), or (37C)). />
w′=q(w,min(b i ,b j ),e ij )(10A)
w′=q(q(w,b i ,e i ),b j ,e j )(10B)
w′=q(w,min(b i ,b j ),e j )(10C)
It should be understood that each weight value is comprised by one input channel and one output channel of weight data, as described herein. This means that for each weight value input to the layer, the first bit width b is determined separately i And a second bit width b j . For the purposes of block 502 of FIG. 5, as shown in equation (10A), each weight value input to a layer may be based on its respective first or second bit width-and an exponent associated with that bit width (e.g., if b is selected i Then is e i Or if b is selected j Then is e j ) -transforming. Alternatively, the smaller (e.g., smallest) of its respective first and second bit widths may be selected. This is represented in equation (10A) by term min (b i ,b j ) And (3) representing. Alternatively, as shown in equation (10B), every input to a layerThe weight values may be transformed according to their respective first and second bit widths, as well as the exponents associated with those bit widths, e.g., in two passes. That is, the input weight data w may alternatively be transformed according to (10B). Also alternatively, as shown in equation (10C), each weight value input to a layer may be transformed according to its respective first or second bit width, and the exponent j associated with the output channel, including the weight value. Alternatively, the smaller (e.g., smallest) of its respective first and second bit widths may be selected. This is represented in equation (10C) by term min (b i ,b j ) And (3) representing. Index e of output channel of weight data j The encoding may be performed with a vector having O elements. Saving the vector e j Can be compared with the two-dimensional index matrix e ij Less memory space is consumed as described with reference to equation (10A). Selecting between the index associated with input channel i and the index associated with output channel j, as opposed to depending on which of the first bit width and the second bit width is selected (as shown in equation (10A)), for each transition, regardless of which of the first bit width and the second bit width is selected, uses the index e associated with output channel j j And (as shown in equation (10C)) may be more robust (e.g., less likely to cause training errors). This is because if the exponent is used to quantize more values during training (i.e., since e is always used j Instead of e ij ) The index is less likely to "jump out of range" during training (e.g., becomes too large or too small for quantization to give a reasonable output due to "large jumps").
In example 3, layer s l The implementation cost of (2) may be defined according to equation (11), which is a differentiable function. In equation (11), the first contribution depends on the bit width b determined for each of the one or more output channels j of the weight data j Is a sum of (a) and (b). The second contribution depends on the bit width b determined for each of the one or more input channels i of the weight data i Is a sum of (a) and (b). In equation (11), term max(0,b i ) And max (0, b) j ) Can be used to ensure that the bit width b is not to be taken in a subsequent step of the method i And b j Respectively, is adjusted to be below zero, as described in further detail below. Cost s of implementation of the layer l Determined from the product of the first contribution and the second contribution, multiplied by the height H of the weight data input to the layer w And width W w The product of the dimensions. The implementation metrics for NN may be formed by summing the implementation costs of multiple layers of NN as determined according to equation (11).
Example 4
In example 4, the first contribution is formed from an implementation cost of the one or more output channels of the weight data input to the layer and an implementation cost of the one or more biases input to the layer. The second contribution is formed from the implementation costs of the one or more output channels of the weight data input to the preceding layer and the implementation costs of the one or more biases input to the preceding layer. As described herein, the number of output channels in the weight data of a layer corresponds to (e.g., is equal to) the number of channels (e.g., data channels) in the output data of the layer. Furthermore, in a layer using offset biases, each of the output channels of the weight data is associated with a respective bias. Thus, the implementation cost of one or more output channels of weight data input to a layer and the implementation cost of one or more biases input to the layer may be considered to represent the magnitude of the output from the layer. For the same reason, the implementation cost of one or more output channels of weight data input to a preceding layer and the implementation cost of one or more biases input to the preceding layer may be considered to represent the implementation cost of outputs from the preceding layer.
In example 4, in block 502 of fig. 5, each of the one or more quantization parameters includes a respective bit width. One or more quantized block transformed by one or more quantization blocksThe value set includes one or more output channels and associated biases of weight data input to a layer and one or more output channels and associated biases of weight data input to a preceding layer. The one or more quantization blocks are configured to be dependent on the respective bit widthsTo transform each of one or more output channels j of weight data input to the layer (wherein bit width +.>Can be represented as a vector +.>) According to the corresponding bit width->To transform each of one or more biases j input to the layer (wherein bit width +.>Can be represented as a vector +.>) According to the corresponding bit width->To transform each of one or more output channels i of weight data input to a preceding layer (wherein bit width +.>Can be represented as a vector +.>) And according to the corresponding bit width +.>To transform each of one or more biases i input to the preceding layer (wherein bit width +. >Can be represented as a vector with I elements). Alternatively, the same bit width may be used to transform the output channels of the weight data and the associated bias of the output channels. That is to say +>Can be equal to->And/or +.>Can be equal to->More specifically, weight data w input to a layer j The bias β input to the layer can be transformed according to equation (12) j The weight data w input to the previous layer can be transformed according to equation (13) i The bias β input to the previous layer can be transformed according to equation (14) i The transformation can be performed according to equation (15). In equations (12) to (15), +.>And->Respectively for w j 、β j 、w i And beta i An index for performing the transformation. />The encoding may be performed with a vector having O elements. />The encoding may be performed with a vector having I elements. Examples q of suitable quantization functions are described below (e.g., with reference to equations (37A), (37B), or (37C)).
In example 4, layer s l The implementation cost of (2) may be defined according to equation (16), which is a differentiable function. In equation (16), the first contribution depends on the sum of the output channel of the weight data input to the preceding layer and the number of instances in which one or both of its associated biases input to the preceding layer are transformed according to a bit width greater than zero, multiplied by the bit width according to which each of the one or more output channels of the weight data input to the layer is transformed and the bit width according to which each of the one or more associated biases input to the layer is transformed. In equation (16), the weighted sum is weighted by the term α. The second contribution depends on the number of instances in which one or both of the output channels of the weight data input to a layer and its associated bias input to the layer are transformed according to a bit width greater than zero, multiplied by one or more output channels of the weight data input to the preceding layer The sum of the bit width from which each output channel in the track is transformed and the weighted sum of the bit width from which each associated bias of the one or more associated biases input to the previous layer is transformed. In equation (16), the weighted sum is weighted by the term α. In equation (16), the termAnd->Can be used to ensure that the bit width is not +.>And->Respectively, is adjusted to be below zero, as described in further detail below. Cost s of implementation of the layer l Determined from the sum of the first contribution and the second contribution, multiplied by the height H of the weight data input to the layer w And width W w The product of the dimensions. The implementation metrics for NN may be formed by summing the implementation costs of multiple layers of NN as determined according to equation (16).
Example 5
In example 5, the first contribution is formed from the implementation cost of one or more output channels of weight data input to a layer, and the second contribution is formed from the implementation cost of one or more output channels of weight data input to a preceding layer. As described herein, the number of output channels in the weight data of a layer corresponds to (e.g., is equal to) the number of channels (e.g., data channels) in the output data of the layer. Thus, the implementation cost of one or more output channels of weight data input to a layer may be considered to represent the implementation cost of output from that layer. Example 5 may be used in preference to example 4 in response to determining that the layer and the previous layer did not receive the bias.
In example 5, in block 502 of fig. 5, each of the one or more quantization parameters includes a respective bit width. The one or more sets of values transformed by the one or more quantization blocks include one or more output channels of weight data input to a layer and one or more output channels of weight data input to a preceding layer. The one or more quantization blocks are configured to be dependent on the respective bit width b j To transform each of one or more of the output channels j of the weight data input to the layer (wherein the bit width b j Can be represented as a vector with O elements) And according to the corresponding bit width b' i To transform each of one or more output channels i of weight data input to a preceding layer (wherein the bit width b' i Can be represented as a vector +.>). More specifically, weight data w input to a layer j Can be transformed according to equation (17) and input to the weight data w 'of the preceding layer' i The transformation may be performed according to equation (18). In equations (17) and (18), e j And e' i Respectively for w j And w' i An index for performing the transformation. e, e j The encoding may be performed with a vector having O elements. e' i The encoding may be performed with a vector having I elements. Examples q of suitable quantization functions are described below (e.g., with reference to equations (37A), (37B), or (37C)). />
In example 5, layer s l The implementation cost of (2) may be defined according to equation (19), which is a differentiable function. In equation (19), the first contribution depends on the number of instances in which the output channels of the weight data input to the preceding layer are transformed according to a bit width greater than zero, multiplied by the sum of the bit widths according to which each of the one or more output channels of the weight data input to the layer are transformed. The second contribution depends on the number of instances in which the output channels of the weight data input to the layer are transformed according to a bit width greater than zero, multiplied by the sum of the bit widths according to which each of the one or more output channels of the weight data input to the preceding layer are transformed. In equation (19), the term max (0, b j ) And max (0, b' i ) Can be used to ensure that the bit width b is not to be taken in a subsequent step of the method j And b' i Respectively, is adjusted to be below zero, as described in further detail below. Cost s of implementation of the layer l Determined from the sum of the first contribution and the second contribution, multiplied by the height H of the weight data input to the layer w And width W w The product of the dimensions. The implementation metrics for NN may be formed by summing the implementation costs of multiple layers of NN as determined according to equation (19).
Example 6
In example 6, the first contribution and the second contribution are the same as the first contribution and the second contribution described for example 5. That is, in example 6, relative to example 5, layer s l The implementation cost of (c) is further dependent on the offset (beta) representing the input to the preceding layer i ') additional contribution to implementation costs. Example 6 may be used in preference to example 5 in response to determining that the bias was received by the previous layer.
In example 6, the blocks of FIG. 5The transformation of the set of input values by the one or more quantization blocks in 502 is the same as described with reference to example 5. In other words, the weight data w input to the layer j Is input to the weight data w 'of the preceding layer' i May be transformed according to equations (17) and (18), as described herein with reference to example 6.
In example 6, layer s l The implementation cost of (2) may be defined according to equation (20), which is a differentiable function. In equation (20), the first contribution and the second contribution are the same as the first contribution and the second contribution shown in equation (19). In equation (20), the sum of the first contribution and the second contribution is multiplied by the height H of the weight data input to the layer w Dimension and width W w The product of the dimensions. In equation (20), the additional contribution depends on the number of instances in which the output channels of the weight data input to the preceding layer are transformed according to a bit width of zero or less, times the number of instances in which the output channels of the weight data input to the layer are transformed according to a bit width of more than zero, times the bias (β' i ) Is the absolute value of (c). It will be appreciated that the bias (β 'input to the preceding layer' i ) May or may not be quantized. As shown in equation (20), optionally, this additional contribution is multiplied by the height H of the weight data input to the layer w Dimension and width W w The product of the dimensions. Optionally, this additional contribution is weighted by the term α, as shown in equation (20). The implementation metrics for the NN may be formed by summing the implementation costs of multiple layers of NN as determined according to equation (20).
Example 7
In many NN structures, the activation input for each layer is derived from the activation output of only one preceding layer. That is, in some NN structures, the activation inputs of a layer may be derived from the activation outputs of more than one preceding layer. An example of such an NN is shown in fig. 10C, which is a schematic diagram showing an NN including a remaining layer. In fig. 10C, a summation operation 1020 receives input from both layer E1012 and layer F1016. The output of the summation operation 1020 is input to layer G1018. That is, the activation input of layer G1018 is derived from the activation outputs of two preceding layers, layer E1012 and layer F1016. Example 7 relates to determining implementation costs for layers receiving activation input data that has been derived from activation outputs of more than one preceding layer.
In example 7, the implementation metrics of a layer (e.g., layer G1018) depend on: a first contribution representing implementation costs of output from the layer (e.g., layer G1018); a second contribution representing implementation costs of the output of the first layer (e.g., layer E1012) preceding the layer; and a third contribution representing implementation costs of the output of a second layer (e.g., layer F1016) preceding the layer. The first contribution may be formed according to the same factors as the first contribution described with reference to any of examples 1 to 6. The second contribution may be formed according to the same factors as the second contribution described with reference to any of examples 1 to 6. The third contribution may be formed according to the same factors as the second contribution described with reference to any of examples 1 to 6. In example 7, the implementation metrics for the layers may further depend on additional contributions representing implementation costs of the bias input to the first previous layer and the second previous layer, according to principles described herein with reference to example 6.
To give one specific example in which the contribution to the implementation metrics is based on the same factors as described with reference to example 5, in example 7, the implementation cost s of the layer l Can be defined according to equation (21), which is a differentiable function. In equation (21), superscripts E, F and G are used to refer to terms associated with a first previous layer (e.g., layer E1012), a second previous layer (e.g., layer F1016), and a layer for which implementation costs are determined (e.g., layer G1018). The one or more quantization blocks are configured to be dependent on the respective bit widthsTo input to a layer (e.g. layerG) Each of the one or more output channels j of the weight data of (wherein the bit width +.>Can be represented as a vector +.>) According to the corresponding bit width->To transform each of one or more output channels i of weight data input to a first preceding layer (e.g., layer E) (wherein the bit width +.>Can be represented as a vector +.>) And according to the corresponding bit width +.>To transform each of one or more output channels i of weight data input to a second preceding layer (e.g., layer F) (wherein the bit width +.>Can be represented as a vector +.>). Referring to equations (17) and (18) as described herein with reference to example 5, one of ordinary skill in the art will understand how these transforms may be performed. Cost s of implementation of the layer l Determined from the sum of the first, second and third contributions multiplied by the height H of the weight data input to the layer for which the implementation cost is determined (e.g., layer G1018) w Dimension and width W w The product of the dimensions. For NThe implementation metrics for N may be formed by summing the implementation costs for multiple layers of NN as determined according to equation (21). />
Example 8
In many NN structures, the output of each layer is input to (or output from) only one other layer. That is, in some NN structures, the output of a layer may be input to more than one subsequent layer. An example of such an NN is shown in fig. 10D, which is a schematic diagram showing an NN including a remaining layer. In fig. 10D, the output of layer T1032 is input to both layer U1038 and layer V1036.
Referring to fig. 10D, it may not be necessary to determine an implementation cost for, for example, layer V1036 that depends on a second contribution of the implementation cost representing the output from layer T1032. This is because, in a later stage of the method (described in further detail below), one or more quantization parameters are adjusted based at least in part on this second contribution, and the value set is optionally removed from the model of the NN according to the adjusted quantization parameters. Adjusting the quantization parameters used to transform the weight data input to layer T1032, adjusting the quantization parameters used to transform the activation data output from layer T1032, or even removing the value sets from the input/output of layer T1032 may affect the calculations performed at layer U1038.
Example 8 may be used in order to prevent implementation metrics formed for layer V1036 from potentially affecting the calculations performed at layer U1038. Referring to fig. 10E, in example 8, a new layer X1034 is added to NN between layer T1032 and layer V1036. Layer X1034 may be configured to receive the activation data output by layer T1032 and output the activation data to layer V1036. That is, layer X1034 does not need to perform any computation on the activation data output by layer T1032. In other words, layer X1034 does not receive any weight data or bias. One or more quantization blocks may be added to the quantization model of the NN to transform the set of values input to the new layer X according to the corresponding quantization parameters. The implementation metrics for layer V1036 may then be formed using layer X1034 (i.e., instead of layer T1032) as the previous layer. The implementation metrics may be formed using the principles described herein with reference to any of examples 1-3. Since the output of layer X1034 is provided only to layer V1036 (i.e., and not to layer U1038), any subsequent adjustments to quantization parameters used to transform the activation data output from layer X1034 or any removal of the value sets from the activation data output by layer X1034 do not affect the calculations performed at layer U1038.
Although not shown in fig. 10E, the same steps may be performed in order to form the implementation metrics of layer U1038. That is, a new layer may be added between layer T1032 and layer U1038. This new layer may be considered a preceding layer for the purpose of calculating the implementation cost of layer U1038.
Example 9
In some NN structures, the methods described herein with reference to examples 7 and 8 may be combined. An example of such an NN structure is shown in fig. 10A, which is a schematic diagram showing an NN including a remaining layer. In fig. 10A, the output of layer a 1002 is input to layer B1004 and summation operation 1010. The output of layer B1004 is input to layer C1006. The summation operation 1010 receives inputs from both layer a 1002 and layer C1006. The output of the summation operation 1010 is input to layer D1008.
This means that the activation input of layer D1008 is derived from the activation outputs of two preceding layers, layer a 1002 and layer C1006. That is, the output of layer a 1002 is also input to layer B1004. Thus, performing the methods described herein using example 7 to form layer D1008, an implementation metric that depends on a contribution representing implementation costs of the output from layer a 1002 may affect the computation performed at layer B1004.
Thus, in example 9, a new layer (not shown in fig. 10A) may be added between layer a 1002 and summation operation 1010 according to the principles described with reference to example 8. Subsequently, according to the principles described with reference to example 7, an implementation metric for layer D1008 may be formed, the implementation metric depending on: a first contribution representing implementation costs of output from the layer (e.g., layer D1008); a second contribution representing implementation costs from output of a first layer (e.g., a newly added layer—not shown in fig. 10A) preceding the layer; and a third contribution representing implementation costs of the output from a second layer (e.g., layer C1006) preceding the layer.
It should be appreciated that in implementing the metrics, the implementation costs of different ones of the multiple layers need not be calculated in the same manner. For example, the implementation cost of a first layer of the plurality of layers may be calculated according to example 1, the implementation cost of a second layer of the plurality of layers may be calculated according to example 4, and so on. Returning to fig. 5, once the cost metric cm for the quantization parameter set has been determined, the method 500 proceeds to block 506.
At block 506, the derivative of the cost metric cm is back-propagated to the one or more quantization parameters to generate a gradient of the cost metric relative to each of the one or more quantization parameters.
As known to those skilled in the art, the derivative of a function at a particular point is the rate or speed at which the function changes at that point. The derivatives are resolvable and thus may be back-propagated to the parameters of the NN to generate derivatives or gradients of the cost metric relative to those parameters. As described above, back-propagation (which may also be referred to as back-propagation of errors) is a method for training the NN to calculate the gradient of the error metric relative to the NN's weight. The back propagation can also be used to determine the derivative of the cost metric cm with respect to quantization parameters (e.g., bit width b and exponent exp)The back-propagation of the derivative of the cost metric cm to the quantization parameter may be performed, for example, using any suitable tool for using a tool such as, but not limited to, tensorFlow TM Or PyTorch TM Training NN.
Cost metrics relative to specific quantization parametersThe gradient of (c) indicates in which direction to move the quantization parameter to reduce the cost metric cm. In particular, a positive gradient indicates that the cost metric cm can be reduced by reducing the quantization parameter; and a negative gradient indicates that the cost metric cm can be reduced by increasing the quantization parameter. For example, FIG. 8 shows the bit width b relative to a particular bit width i Is a graph 800 of an example cost metric cm. Graph 800 shows the current bit width b i Having a first value x 1 The lowest cost metric is achieved. As can be seen from graph 800, when the bit width b i Less than x 1 (e.g. when it has a second value x 2 ) When the bit width has a negative gradient 802, and the cost metric cm can be measured by increasing the bit width b i To reduce. Similarly, when the bit width b i Greater than x 1 (e.g. when it has a third value x 3 ) When the bit width has a positive gradient 804 and a cost metric cm. The gradient of the cost metric cm with respect to a particular quantization parameter may be referred to herein as a gradient for the quantization parameter.
Once the derivative of the cost metric is back-propagated to the one or more quantization parameters to generate a gradient of the cost metric for each of those quantization parameters, the method 500 proceeds to block 508.
At block 508, one or more of the quantization parameters (e.g., bit width b) are adjusted based on the gradient i And index exp i ). The goal of method 500 is to identify the set of quantization parameters that will produce the 'best' cost metric. The content that constitutes the 'best' cost metric will depend on how the cost metric is calculated. For example, in some cases, the lower the cost metric, the better the cost metric, while in other cases, the higher the cost metric, the better the cost metric.
As described above, the sign of the gradient for the quantization parameter indicates whether the cost metric is to be reduced by increasing or decreasing the quantization parameter. In particular, if the gradient for the quantization parameter is positive, a decrease in the quantization parameter will decrease the cost metric; and if the gradient for the quantization parameter is negative, an increase in the quantization parameter will decrease the cost metric. Thus, adjusting the quantization parameter may include increasing or decreasing the quantization parameter according to the sign of the gradient in order to increase or decrease the cost metric (depending on whether it is desired to increase or decrease the cost metric). For example, if a lower cost metric is desired and the gradient to the quantization parameter is negative, the quantization parameter may be increased in an attempt to reduce the cost metric. Similarly, if a lower cost metric is desired and the gradient to the quantization parameter is positive, the quantization parameter may be reduced in an attempt to reduce the cost metric.
In some cases, the amount by which the quantization parameter is increased or decreased may be based on the magnitude of the gradient. In particular, in some cases, the quantization parameter may increase or decrease by the magnitude of the gradient. For example, if the magnitude of the gradient is 0.4, the quantization parameter may be increased or decreased by 0.4. In other cases, the quantization parameter may be increased or decreased by a factor of the magnitude of the gradient.
More generally, when the goal is to reduce the cost metric cm, the adjusted quantization parameter (qp adj ) The gradient (g) for the quantization parameter (qp) may be subtracted from the quantization parameter (qp) qp ) Is generated as shown in equation (22). In some cases, it is possible to adjust the rate of adjusting the different quantization parameters by multiplying the gradient by the learning rate l, as shown in equation (23). The higher the learning rate, the faster the quantization parameter will be adjusted. The learning rate may be different for different quantization parameters.
qp adj =qp-g qp (22)
qp adj =qp-l*g qp (23)
In general, the hardware used to implement NN may support only an integer bit width b i And index exp i And in some cases may only support a specific set of integer values for bit width and/or exponent. For example, the hardware logic for implementing NN may support only bit widths of 4, 5, 6, 7, 8, 10, 12, and 16. Thus, before the quantization parameter is used to implement NN in hardware, the quantization parameter is rounded to the nearest integer or the nearest integer in the supported set of integers. For example, if the method is to be optimizedThe bit width is determined to be 4.4, then the bit width may be quantized (e.g., rounded) to the nearest (RTN) integer (in this case 4) before it is used to implement NN in hardware.
Thus, in some cases, to account for quantization (e.g., rounding) of quantization parameters that occur when NN is implemented in hardware, when identifying the 'best' quantization parameter, the increasing/decreasing quantization parameter may be rounded to the nearest integer or the nearest integer of the set of integers before the increasing/decreasing quantization parameter is used in the next iteration, as shown in equation (24), where RTN is the integer function rounded to the nearest integer, andis the quantization parameter that increases/decreases after it has been rounded to the nearest integer. For example, after increasing or decreasing the bit width according to the gradient associated with a particular bit width, the increased or decreased bit width may be rounded to the nearest integer or nearest of the set {4,5,6,7,8,10,12,16} before being used in the next iteration.
In other cases, rather than actually quantizing (e.g., rounding) the quantization parameter after it has been increased/decreased, the transform represented by the quantization (e.g., rounding) of the quantization parameter may be modeled only. For example, in some cases, instead of rounding the increased/decreased quantization parameter to the nearest integer or the nearest integer in the set, quantization may be simulated by performing random quantization on the increased/decreased quantization parameter. Performing the random quantization on the increased/decreased quantization parameter may include adding a random value u between-a and +a to the increased/decreased quantization parameter to generate a randomized quantization parameter, where a is half the distance between: (i) An integer less than the increasing/decreasing quantization parameter in the set closest to the increasing/decreasing quantization parameter, and (ii) quantization in the set closest to the increasing/decreasing quantization parameter The nearest integer of the parameter greater than the increasing/decreasing quantization parameter; and then setting the randomized quantization parameter to the nearest of the two nearest integers. When random quantization is used to simulate rounding to the nearest integer, then a is equal to 0.5, and random quantization may be implemented as shown in equation (25), where RTN is the function of rounding to the nearest integer, andis a quantization parameter that increases/decreases after random quantization.
For example, if in a hardware implementation, the bit width may be any integer in the set {4,5,6,7,8,10,12,16}, if the bit width b i Increasing/decreasing to 4.4, the bit width b is increased/decreased i A random value between-0.5 and +0.5 is added, because the distance between the nearest lower integer and the higher integer (4 and 5) in the set is 1; and then the randomized bit width is set to the nearest of those two nearest integers (4 and 5). Similarly, if the bit width b i Increasing/decreasing to 10.4, the bit width b is increased/decreased i Adding a random value between-1 and +1, because the distance between the nearest lower integer and the higher integer (10, 12) in the set is 2; and then the randomized bit width is set to the nearest of those two nearest integers (10, 12). In this way, the increased/decreased quantization parameter is rounded up or down to an integer with a probability proportional to the distance to the integer. For example, 4.2 would be rounded to 4 with a 20% probability and to 5 with an 80% probability. Similarly, 7.9 would be rounded to 7 with a 10% probability and to 8 with a 90% probability. Tests have shown that in some cases quantization parameters can be identified more efficiently and effectively by adding random values to the increased/decreased quantization parameters and then rounding off, rather than merely rounding off the increased/decreased quantization parameters.
In other cases, rather than rounding the increased/decreased quantization parameter to the nearest integer or the nearest integer in the set, quantization of the quantization parameter may be simulated by performing uniform noise quantization on the increased/decreased quantization parameter. Performing uniform noise quantization on the increased/decreased quantization parameter may include adding a random value u between-a and +a to the increased/decreased quantization parameter, where a is half the distance between: (i) An integer less than the increased/decreased quantization parameter in the set closest to the increased/decreased quantization parameter, and (ii) an integer greater than the increased/decreased quantization parameter in the set closest to the increased/decreased quantization parameter. When using uniform noise quantization to simulate rounding to the nearest integer, then a is equal to 0.5, and uniform noise quantization may be implemented as shown in equation (26), whereIs a parameter that increases/decreases after uniform noise quantization. By adding random values only to the increased/decreased quantization parameter, the increased/decreased quantization parameter is distorted in a similar manner as the rounded increased/decreased quantization parameter.
In yet other cases, rather than rounding the increased/decreased quantization parameter to the nearest integer or the nearest integer in the set, quantization of the quantization parameter may be simulated by performing gradient-average quantization on the increased/decreased quantization parameter. Performing gradient average quantization may include taking the highest allowable integer of allowable integers that are less than or equal to the increasing/decreasing quantization parameter, and then adding a random value h between 0 and c, where c is the distance between: (i) An integer less than an increase/decrease amount of quantization parameter closest to the increase/decrease quantization parameter in the set, and (ii) an integer greater than the increase/decrease quantization parameter closest to the increase/decrease quantization parameter in the set (or by mathematically comparing with Any operation equivalent to text). When using gradient average quantization to simulate rounding to the nearest integer then c is equal to 1 and gradient average quantization can be implemented as shown in equation (27), where RTNI is a function rounded to negative infinity (which may also be referred to as a floor function) andis a quantization parameter that increases/decreases after gradient average quantization.
For example, if the bit width b i May be any integer in the set 4,5,6,7,8,10,12,16 and a particular bit width b i According to the gradient increase/decrease to 4.4, the highest integer (i.e. 4) of the quantization parameter that is less than or equal to the increase/decrease in the set is selected, and a uniform random value between 0 and 1 is added to the highest integer, because the distance between the nearest lower integer and the higher integer (4 and 5) in the set is 1. Similarly, if the bit width b i Increasing/decreasing to 10.4 according to the gradient, the highest integer (i.e. 10) in the set that is less than or equal to this value is selected, and a random value between 0 and 2 is added to this highest integer, since the distance between the nearest lower integer and the higher integer (10 and 12) in the set is 2.
Tests have shown that the gradient average quantization method works well for the problem that the quantized parameters are largely independent, but not so well when highly correlated parameters are optimized.
In yet other cases, rather than rounding the increasing/decreasing quantization parameter to the nearest integer or the nearest integer in the set, quantization of the quantization parameter may be modeled by performing a bimodal quantization, which is a combination of the quantization rounded to the nearest integer (e.g., equation (24)) and a gradient average quantization (e.g., equation (27)). Specifically, in bimodal quantization, gradient average quantization is performed with probability p on increasing/decreasing quantization parameters, and otherwise on increasing/decreasingPerforms rounding quantization of the quantization parameter of (c). When bimodal quantization is used to simulate rounding to the nearest integer, p is twice the distance to the nearest integer, and bimodal quantization can be achieved as shown in equation (28), whereIs a quantization parameter that increases/decreases after its bimodal quantization.
An ordered set of integers in which the differences between consecutive integers in the set are not constant is referred to as a non-uniform set of integers. For example, the ordered set of integers {4,5,6,7,8,10,12,16} is a non-uniform set of integers because the difference between integers 4 and 5 is one, but the difference between integers 12 and 16 is four. In contrast, the ordered set of integers {1,2,3,4,5} is a uniform set of integers, because the difference between any two consecutive integers is one.
As described above, in order to simulate rounding of the increased/decreased quantization parameter to the nearest integer in the non-uniform set of integers, the quantization parameter (e.g., a or c) may be selected for one of the quantization simulation methods described above (e.g., random quantization, uniform noise quantization, gradient average quantization, or bimodal quantization) based on the difference between the nearest integer in the set below the increased/decreased quantization parameter and the nearest integer in the set above the increased/decreased quantization parameter described in Yu Rushang, and the increased/decreased quantization parameter is quantized according to the desired simulation method. In other cases, rounding the increasing/decreasing quantization parameter to the nearest integer in the non-uniform set of integers may be modeled by: (1) Scaling the increased/decreased quantization parameter based on a distance/difference between a nearest lower integer in the non-uniform set of integers and a nearest higher integer in the non-uniform set of integers (which may be described as a local "density" of values) to generate a transformed or scaled increased/decreased quantization parameter; (2) Simulating rounding the transformed increased/decreased quantization parameter to the nearest integer using one of the simulation methods described above, such as equations (25), (26), (27) or (28); and (3) inverting the transformation or scaling performed in step (1) to obtain an increased/decreased quantization parameter of the final quantization.
This will be further described by way of example. In this example, the non-uniform set of integers is {4,5,6,7,8,10,12,16}. In step (1), the increasing/decreasing quantization parameter is scaled based on a distance/difference between a nearest lower integer in the non-uniform set of integers and a nearest higher value in the non-uniform set of integers. In particular, the transformed or scaled increased/decreased quantization parameter is equal to the increased/decreased quantization parameter divided by the distance between the nearest lower integer in the set and the nearest higher integer in the set. For example, when the distance between the nearest lower integer in the set (i.e., 8 or 10) and the nearest higher integer in the set (i.e., 10 or 12) is 2, the increasing/decreasing quantization parameter between 8 and 12 is scaled (multiplied) by 1/2; when the distance between the nearest lower integer in the set (i.e., 12 or 14) and the nearest higher integer in the set (i.e., 14 or 16) is 4, the increasing/decreasing quantization parameter between 12 and 16 is scaled by 1/4; and when the distance between the nearest lower integer in the set (i.e. 4,5,6, 7) and the nearest higher integer in the set (i.e. 5,6,7, 8) is 1, the increasing/decreasing quantization parameter between 4 and 8 is scaled by 1. For example, 13 is transformed to 3.25;5.4 is transformed to 5.4;8.9 is transformed to 4.45; and 11.5 is transformed to 5.75. This transformation can be represented by equation (29), wherein qp adj Is the quantization parameter that is increased/decreased,is a transformed increasing/decreasing quantization parameter, and s is as shown in equation (30), where when qp adj >8, the->Is 1, otherwise 0, when qp adj >At 12, the +>Is 1, otherwise 0, such that for qp adj <8,s =1 for 8<qp adj <12, s=2, and for qp adj >12,s=4。
In step (2), rounding the transformed value to the nearest integer is simulated using one of the methods described above for simulating rounding to the nearest integer, such as equations (25), (26), (27) or (28). In step (3), the transform performed in step (1) is inverted to generate a final quantized value. This is represented by equation (31) in whichIs the quantized transform value generated in step (2), and +.>Is the quantization parameter of the increase/decrease of the final quantization.
For example, if the output of step (2) is 3 and s=4, then this is transformed back to 12; if the output of step (2) is 5 and s=1, then this is transformed back to 5; if the output of step (2) is 4 and s=2, then this is transformed back to 8; and if the output of step (2) is 6 and s=2, this is transformed back to 12. This is summarized in table 2.
TABLE 2
It will be apparent to those skilled in the art that these are examples of functions that may be used to quantize or simulate the quantization of quantization parameters, and that other functions may be used to quantize or simulate the quantization of quantization parameters. However, in order to be able to back-propagate the derivative of the cost metric cm to the quantization parameter, a quantization function q is defined (e.g. ) So that the derivative of the cost metric can be defined in terms of quantization parameters. The inventors have realized that if the quantization function q (e.g. +.>) The derivative with respect to the quantized quantization parameter is defined as one, then the machine learning framework may generate a useful gradient of the cost function with respect to the quantization parameter.
In some cases, quantization (e.g., rounding) of the increased/decreased quantization parameter may be performed by the associated quantization block. For example, in some cases (as described in more detail below), increasing/decreasing quantization parameters may be provided to quantization blocks, and each quantization block may be configured to quantize (e.g., round) the quantization parameters of the quantization block, or simulate the quantization (e.g., round) thereof, before using the quantization parameters to quantize the input value.
Where adjusting the quantization parameter includes quantizing (e.g., rounding) or simulating the quantization of the increased/decreased quantization parameter (according to the gradient), a higher precision (e.g., floating point) version of the quantization parameter may be maintained by any of the methods described above, and in subsequent iterations of block 508, the higher precision version of the quantization parameter is the one that is increased/decreased according to the gradient. In some cases, a random quantization pattern of increasing/decreasing quantization parameters may be maintained, and increasing/decreasing in subsequent iterations is a random quantization pattern of quantization parameters.
After adjusting quantization parameters (e.g., bit width b) based on gradients i And index exp i ) After the quantization parameter or parameters of (a), the method moves to block 509 where the value sets may optionally be removed from the model of NN. In block 508, the value set may be removed from the model of NN depending on a quantization parameter (e.g., bit width) of the value set or an associated value set that is adjusted to zero. This is because, in some cases, removing a set of values from the model of the NN that can be quantized using a zero bit width (i.e., where each value in the set of values can be quantized to zero) may not affect the output of the model of the NN relative to preserving the set of values that include zero values. That is, removing the set of values may reduce the inference time of the NN (and thus increase its efficiency) because removing those values reduces the number of multiplication operations to be performed in the layer (even if those multiplications are zero-multiplied multiplications).
Six specific examples of removing a set of values from a model of NN in response to adjusting a quantization parameter to zero are provided. These examples are referred back to examples 1 through 6 described with reference to block 504. It is to be understood that these specific implementations are provided by way of example only and that the principles described herein may be implemented differently.
These examples can be understood with reference to fig. 9, which illustrates interactions between two adjacent layers of NN. Fig. 9 shows layer 904 and layer 902 preceding the layer. In fig. 9, layers 902 and 904 are both convolutional layers. Activation data 906-1, weight data 908-1, and bias 912-1 are input to the previous layer 902. Activation data 906-2 (e.g., output of previous layer 902), weight data 908-2, and bias 912-2 are input to layer 904. Intermediate output data 910-1 and 910-2 are shown for layer 902 and preceding layer 904, respectively, for ease of understanding, but it should be understood that the intermediate data need not be physically formed by those layers, and may merely represent logical values that conveniently describe the processing performed by those layers between their inputs and outputs.
In examples 1 and 2, the output channel of the weight data input to the previous layer (and (if present) its associated bias) may be removed from the model of NN when the adjusted bit width for the corresponding input channel of the activation data input to the layer is zero. For example, in fig. 9, the output channel 920 of the weight data input to the previous layer 902 may be removed from the model of NN when the adjusted bit width for the corresponding input channel 922 of the activation data input to the layer 904 is zero. The correspondence between these channels is shown using cross-hatching (as can be appreciated with reference to fig. 2B). Using the implementation metrics defined with reference to examples 1 or 2, it may be determined that the output of the model that removes the output channel 920 without affecting the NN is "safe" (relative to retaining the output channel 920 that includes a zero value). This is because the result of convolving with output channel 920 to generate intermediate output channel 924 and then summing with offset 926 generates method 500 determines input channel 922 that can be quantized using the zero width. Thus, it will be appreciated that convolution and summation need not be performed. Thus, the output channel 920 of the weight data 908-1, the bias 926 in the bias 912-1, the input channel 922 of the activation data 906-2, and the input channel 928 of the weight data 908-2 may be removed from the model of NN without affecting the output of the model of NN (as opposed to retaining the output channel 920 that includes a zero value).
In example 3, when the adjusted bit width for the corresponding input channel of weight data input to a layer is zero, the output channel of weight data input to the preceding layer (and (if present) its associated bias) may be removed from the model of NN. For example, in fig. 9, the output channel 920 of weight data input to the previous layer 902 may be removed from the model of NN when the adjusted bit width for the corresponding input channel 928 of weight data input to layer 904 is zero. The correspondence between these channels is shown using cross-hatching (as can be appreciated with reference to fig. 2B). Using the implementation metrics defined with reference to example 3, it may be determined that the output of the model that removes the output channel 920 without affecting the NN is "safe" (relative to retaining the output channel 920 that includes a zero value). This is because the result of convolving with output channel 920 to generate intermediate output channel 924 and then summing with offset 926 generates method 500 determines input channel 922 to be convolved with input channel 928 that can be quantized using the zero width. Thus, it will be appreciated that convolution and summation need not be performed. Thus, the output channel 920 of the weight data 908-1, the bias 926 in the bias 912-1, the input channel 922 of the activation data 906-2, and the input channel 928 of the weight data 908-2 may be removed from the model of NN without affecting the output of the model of NN (as opposed to retaining the output channel 920 that includes a zero value).
Example 5 may be used to remove the output channel of weight data input to the previous layer when it is known that the previous layer did not receive a bias (not shown in fig. 9). In example 5, when the adjusted bit width for the output channel of the weight data input to the previous layer is zero, the output channel may be removed from the model of NN. The corresponding input channels that are input to the activation data for which the layer that forms the implementation cost and the corresponding input channels that are input to the weight data for which the layer that forms the implementation cost are also removed from the model of NN without affecting the output of the model of NN (relative to the output channels that include zero values that retain the weight data input to the preceding layer).
It should be appreciated that an output channel from which weight data is removed in response to determining only that the output channel is input to the layer may be encoded using a zero bit width is not necessarily "safe". As described herein, this is because the output channel of the weight data may be associated with a bias. Referring to fig. 9, even though the adjusted bit width of the output channel 930 for the weight data input to layer 904 is zero, its associated bias 932 may still be non-zero (e.g., have a non-zero bit width). In this case, if the output channel 930 of the weight data 908-2 is to be removed from the model of NN, then the intermediate output channel 934 will not be formed, meaning that the bias 932 will not have a value for summation. This is an advantage of using an implementation metric such as defined in examples 1 to 3 that takes into account interactions between two adjacent layers (e.g. by virtue of a second contribution depending on the implementation cost representing the output from the preceding layer).
In example 4, the output channel of the weight data input to the layer may be removed from the model of NN when its associated bias and the adjusted bit width of the output channel are zero. For example, in fig. 9, when the adjusted bit width and its associated bias 926 for the output channel 920 of weight data input to the previous layer 902 is zero, the output channel 920 may be removed from the model of NN. The correspondence between these channels and offsets is shown using cross-hatching (as can be appreciated with reference to fig. 2B). Thus, the output channel 920 of the weight data 908-1, the bias 926 in the bias 912-1, the input channel 922 of the activation data 906-2, and the input channel 928 of the weight data 908-2 may be removed from the model of NN without affecting the output of the model of NN (as opposed to retaining the output channel 920 that includes a zero value). Alternatively or additionally, the output channel for the weight data input to layer 904 may be removed from the model of NN when its associated bias and the adjusted bit width of the output channel are zero.
In example 6, an output channel for weight data input to a preceding layer may be removed from the model of NN when the absolute value of its associated bias and the adjusted bit width (e.g., as adjusted during back propagation-as described with reference to fig. 11) for the output channel is zero. For example, in fig. 9, when the adjusted bit width of the output channel 920 and its associated bias 926 for the weight data input to the previous layer 902 are zero, the output channel 920 may be removed from the model of NN. The correspondence between these channels and offsets is shown using cross-hatching (as can be appreciated with reference to fig. 2B). Thus, the output channel 920 of the weight data 908-1, the bias 926 in the bias 912-1, the input channel 922 of the activation data 906-2, and the input channel 928 of the weight data 908-2 may be removed from the model of NN without affecting the output of the model of NN (as opposed to retaining the output channel 920 that includes a zero value).
An additional advantage of removing one or more value sets in block 509 is that training of NN will then be "sped up" in subsequent iterations of blocks 502-508, as described in further detail below. This is because removing one or more value sets from the model of NN reduces the implementation cost of the model of NN and thus increases its inference speed. Accordingly, subsequent iterations of blocks 502 through 508 may be performed more quickly.
In many NN structures where the output of each layer is input to (or output from) only one other layer, the removal of one or more sets of values (e.g., output weight channels) in block 509 may be performed without any further modification to the NN. That is, in some NN structures, the output of a layer may be input to more than one subsequent layer, or the operation of the NN may receive input from more than one preceding layer. An example of such an NN is shown in fig. 10A, which is a schematic diagram showing an NN including a remaining layer. In fig. 10A, the output of layer a 1002 is input to layer B1004 and summation operation 1010. The output of layer B1004 is input to layer C1006. The summation operation 1010 receives inputs from both layer a 1002 and layer C1006. The output of the summation operation 1010 is input to layer D1008.
In fig. 10A, a summation operation 1010 may require receiving two inputs having the same structure (e.g., two input activation data sets having the same number of data lanes). Thus, for example, if the output channel of the weight data input to layer a 1002 is removed in block 509, resulting in the corresponding data channel of its output not being formed (as can be appreciated with reference to fig. 2B), it may be desirable to provide an alternate channel in the output of layer a 1002 prior to the summation operation 1010. In other words, if the output channel of the weight data input to layer C1006 is removed in block 509, resulting in the corresponding data channel of its output not being formed (as can be appreciated with reference to fig. 2B), it may be necessary to provide an alternate channel in the output of layer C1006 prior to the summation operation 1010. A method 1020 of inserting replacement channels into the output data of layers in such NN is described with reference to fig. 10B. In an example, the method 1020 of fig. 10B may be used to insert an alternate channel into output data of a layer in a Deep Neural Network (DNN), which is one type of NN.
In block 1022, for the identified channels of the output data of the layer, the activation data input to the layer is operated on such that the output data of the layer does not include the identified channels. This may be achieved, for example, by not including output channels responsible for forming the weight data of the identification channel, such that the output data of the layer does not include the identification channel. As described herein, it may be identified in the training phase of the NN that the output channel (and optionally the corresponding bias) responsible for forming the weight data identifying the channel is quantifiable with a bit width of zero (e.g., the output channel may be removed from the model of the NN relative to the output channel including zero values that preserve the weight data input to the previous layer without affecting the output of the model of the NN-as described with reference to blocks 508 and 509). In other words, after determining that the output channel of the weight data (and optionally the corresponding offset) may be quantized using the zero bit width in the training phase of the NN, the identified channel of the output data may be identified as the channel of the output data that the output channel of the weight data (and optionally the corresponding offset) is responsible for forming. In fig. 10A, the effect of this step may be that the output data of layer a 1002 does not include an identification channel. The output of layer a 1002 (i.e., excluding the identified channels) may be operated on by layer B1004. In fig. 10A, the effect of this step may be that the output data of layer C1006 does not include an identification channel.
In block 1024, a replacement channel may be inserted into the output data of the layer in place of (e.g., in lieu of) the identified channel prior to an operation (e.g., summation operation 1010) of the NN configured to operate on the output data of the layer. For example, the replacement channel may be a channel that includes a plurality of zero values. The identification channel may be an array of data values and the replacement channel may be an array of zeros (e.g., zero values) having the same dimensions as the array of data values. The operation of NN (e.g., summation operation 1010) may then be performed in accordance with the alternate channel. It should be appreciated that if the identified channel includes a plurality of zero values, inserting an alternate channel including a plurality of zero values as described herein does not alter the result of the operation of NN relative to performing the operation of NN by retaining the identified channel including a plurality of zero values.
If the identification channel has been included, the replacement channel may be inserted according to information indicating the structure of the output data of the layer. That is, the information may indicate what the structure of the output data of the layer would be if the output data had been formed to include the identification channel. In other words, if the identification channel has been included, the replacement channel may be inserted according to information indicating the structure of the output data of the layer. This information may be generated in a training phase of the NN, the information indicating a structure of output data including layers identifying channels. For example, the information may include a bitmask. Each bit of the bitmask may represent a data channel, a first bit value (e.g., 1 or 0) indicating a data channel included in the output data, and a second bit value (e.g., 0 or 1) indicating a data channel not included in the output data. The replacement channel may be inserted into the output data of the layer indicated by the second bit value of the bit mask. For example, if the bitmask includes a series of bit values … 1, 0, 1 …, an alternate channel may be inserted between two data channels included in the output data represented by bit value 1 at the location indicated by bit value 0. It should be understood that the method of inserting replacement channels described herein may be used to insert multiple replacement channels in place of multiple corresponding identification channels. For example, the bitmask may include a plurality of second bit values that each indicate a data lane that is not included in the output data, such that a plurality of replacement lanes may be inserted into the output data of the layer indicated by those second bit values.
This method of inserting replacement channels may be performed during a training phase of the NN (e.g., when subsequent iterations of blocks 502-509 are performed after it is determined that the output channels of the weight data are quantifiable with a bit width of zero in an earlier iteration, as described in further detail below) and/or when the NN is subsequently implemented to process data in a use phase (e.g., in block 514, also described in further detail below).
Once one or more of the quantization parameters have been adjusted based on the gradient in block 508 (and optionally one or more sets of values are removed in block 509), method 500 may end, or method 500 may proceed to block 510, where blocks 502-509 may be repeated.
At block 510, a determination is made as to whether blocks 502 through 509 are to be repeated. In some cases, the determination as to whether blocks 502-509 are to be repeated is based on whether a predetermined number of iterations of blocks 502-509 have been completed, or whether a predetermined amount of training time has elapsed. The predetermined number of iterations or the predetermined amount of training may have been empirically determined to be sufficient to produce good results. In other cases, the determination as to whether blocks 502 through 509 are to be repeated may be based on whether the cost metric has converged. Any suitable criteria may be used to determine when the cost metrics converge. For example, in some cases, if the cost metric does not change significantly (e.g., is greater than a predetermined threshold) within a predetermined number of iterations, it may be determined that the cost metric has converged.
If it is determined that blocks 502 through 509 are not to be repeated, method 500 may end or method 500 may proceed to block 512. However, if it is determined that blocks 502 through 509 are to be repeated, then the method 500 returns to block 502, where blocks 502 through 509 are repeated (and optionally excluding the value sets removed in block 509) using the quantization parameters as adjusted in block 508. For example, if in a first iteration a set of values is transformed by the quantization block into a fixed-point number form defined by a mantissa bit width 6 and an exponent 4, and the mantissa bit width is adjusted to a bit width 5, and the exponent is not adjusted, in a next iteration the set of values will be transformed by the quantization block into a fixed-point number format defined by the bit width 5 and the exponent 4.
At block 512, the quantization parameter as adjusted in block 508 and optionally information indicative of the value set removed in block 509 are output for configuring the hardware logic to implement NN. In some cases, a floating point version of the quantization parameter is output. In other cases, what is output is a version of the quantization parameter that can be used by the hardware logic (i.e., a floating point version of the quantization parameter after it has been quantized to an integer or set of integers). The quantization parameter may be output in any suitable manner. Once the quantization parameters as adjusted in block 508 have been output, the method 500 may end or the method 500 may proceed to block 514.
At block 514, the NN-enabled hardware logic is configured to implement NN using the quantization parameters output in block 512. In the case where the quantization parameter output in block 512 is in a floating point format, the quantization parameter may be quantized to an integer or set of integers before the quantization parameter is used to configure the hardware logic to implement NN. Configuring the hardware logic to implement the NN may generally include configuring the hardware logic to process input for each layer of the NN in accordance with the layer and to provide output of the layer to a subsequent layer or to provide the output as output of the NN. For example, if the NN includes a first convolution layer and a second normalization layer, configuring the hardware logic to implement such NN includes configuring the hardware logic to receive an input for the NN, and to process the input as input activation data according to weight data of the convolution layer, to process an output of the convolution layer according to the normalization layer, and then to output the output of the normalization layer as an output of the NN. Configuring the hardware logic to implement the NN using the quantization parameters output in block 512 may include configuring the hardware logic to receive and process input for each layer according to the quantization parameters of that layer (i.e., according to a fixed point number format defined by the quantization parameters). For example, if the quantization parameter indicates that a fixed-point number format defined by exponent 4 and bit width 6 is to be used for an input data value of a layer of NN, the hardware logic for implementing NN may be configured to interpret the input data value of the layer based on the input data value of the layer in a fixed-point number format defined by exponent 4 and bit width 6.
When implementing the NN at block 514, the value set removed from the model of the NN at block 509 may not be included in the runtime implementation of the NN. For example, where an output channel of weight data input to a layer is removed at block 509, the weight values of that output channel may not be written to memory for use by the runtime implementation of the NN and/or the runtime implementation hardware implementing the NN may not be configured to use those weight values to perform the multiplication.
In the method 500 of fig. 5, a complete cost metric is calculated (e.g., according to equation (3)), and the derivative of the cost metric is back-propagated to the quantization parameters to calculate a gradient for each quantization parameter. The gradient for a particular quantization parameter is then used to adjust the quantization parameter. However, in other examples, calculating the cost metrics may include calculating an error metric and an implementation metric, and determining a separate gradient for each metric for each quantization parameter. In other words, a gradient of the error metric with respect to each quantization parameter is generated, and a gradient of the realization metric with respect to each quantization parameter is generated. In the same way as the derivative of the cost metric is counter-propagated to the quantization parameter, the gradient of the error metric with respect to the quantization parameter may be generated by counter-propagating the derivative of the error metric to the quantization parameter. The gradient of the implementation metrics relative to the quantization parameters may be generated by back propagation or may be generated directly from the implementation metrics. The final gradient for each quantization parameter may be generated from the two gradients in the same way, i.e. the corresponding cost metrics are combined to form the cost metric. For example, the final gradient may be generated as a weighted sum of the two gradients. By varying the weights associated with the two gradients, a balance can be found between implementation cost and error. The quantization parameter may then be adjusted according to the final gradient in the same manner as described above.
Identifying quantization parameters and weights
Although the method 500 of fig. 5 has been described as identifying quantization parameters of the NN, in other examples, the weight values (e.g., weights) and optional offsets of the NN may be identified concurrently with the quantization parameters. In these cases, the derivative of the cost metric may also be back-propagated to the weights (and optional biases) to generate gradients of the cost metric relative to the weights (and optional biases), and the weights (and optional biases) may be adjusted based on the corresponding gradients in a similar manner as the quantization parameters.
Referring now to fig. 11, a method 1100 of identifying quantization parameters and weights (and optionally bias) of NN is shown. In an example, the method 1100 of fig. 11 can be used to identify quantization parameters and weights (and optionally bias) of a Deep Neural Network (DNN), which is one type of NN, via back propagation. The method 1100 may be used to retrain the network to account for quantification of the values of the NN (e.g., update weights after an initial training term, such as an initial training term performed on a floating point model of the NN), or may be used to perform an initial training of the network (e.g., train the network according to an untrained set of weights). Method 1100 includes blocks 502 through 512 of method 500 of fig. 5, but also includes blocks 1102 and 1104 (and optional blocks 1106 and 1108). Blocks 502 through 512 operate in the same manner as described above. When the method 1100 is used to retrain an NN, the initial set of weights in the quantization model for the NN may be a trained set of weights. However, where the method 1100 is used to train an NN, the initial set of weights in the model for the NN may be a random set of weights or another set of weights designed to train the NN.
At block 1102, after the quantized model of the NN has been determined to be responsive to the output of the training data (block 502) and the cost metric has been determined from the output of the quantized model of the NN and the quantization parameters (block 504), the derivative of the cost metric is back-propagated to the one or more weights (and optional bias) to generate a gradient of the cost metric relative to each of those weights (and optional bias). The gradient of the cost metric with respect to the weight is referred to herein as the gradient for the weight. As with the gradient for the quantization parameter, a positive gradient indication for a weight may reduce the cost metric by reducing the weight, and a negative gradient indication for a weight may reduce the cost metric by increasing the weight. Once the gradient for the one or more weights (and optional bias) has been generated, the method proceeds to block 1104.
At block 1104, one or more of the weights (and optional bias) are adjusted based on the gradient for the weights (and optional bias). The weights (and optionally the bias) may be adjusted in a similar manner as the quantization parameters. For example, as described above, the sign of the gradient for the weight indicates whether the cost metric will decrease by increasing or decreasing the weight. In particular, if the gradient for the weight is positive, the reduction of the weight will reduce the cost metric; and if the gradient for the weight is negative, an increase in the quantization parameter will decrease the cost metric. Thus, adjusting the weights may include increasing or decreasing the weights according to the sign of the gradient in order to increase or decrease the cost metric (depending on whether it is desired to increase or decrease the cost metric). For example, if a lower cost metric is desired and the gradient for the weight is negative, the weight may be increased in an attempt to reduce the cost metric. Similarly, if a lower cost metric is desired and the gradient for the weight is positive, the weight may be reduced in an attempt to reduce the cost metric.
In some cases, the amount by which the weight increases or decreases may be based on the magnitude of the gradient for that weight. In particular, in some cases, the weight may be increased or decreased by the magnitude of the gradient for that weight. For example, if the magnitude of the gradient is 0.6, the weight may be increased or decreased by 0.6. In other cases, the weight may be increased or decreased by a factor of the magnitude of the gradient for that weight. In particular, in some cases, the weights may converge more quickly by adjusting the weights with a so-called learning rate.
Once the weights (and optionally the bias) have been adjusted based on the corresponding gradients, the method 1100 may end, or the method 1100 may proceed to block 509, where one or more value sets may optionally be removed from the model of NN. Thereafter, blocks 502 through 508 and 1102 through 1104 may be repeated. Similar to blocks 512 and 514, method 1100 may also include outputting the adjusted weights (and optional offsets) (at 1106) and/or configuring hardware to implement NN using the adjusted weights (and optional offsets) (at 1108).
While in the method 1100 of fig. 11, the weights (and optional offsets) and quantization parameters are adjusted each iteration, in other examples, one or both of the weights (and optional offsets) and quantization parameters may be selected for adjustment in each iteration. For example, the quantization parameter may be adjusted for a predetermined number of iterations, and then the weights (and optionally the bias) may be adjusted for a predetermined number of iterations. In other cases, the weights (and optionally the bias) and quantization parameters may be adjusted in alternating iterations. For example, the weight (and optional bias) adjustment may be performed in an odd number of iterations and the quantization parameter adjustment may be performed in an even number of iterations. This will allow the weights (and optionally the offsets) to be adjusted when the quantization parameter is rounded (or its rounding is simulated) and the quantization parameter to be adjusted when the weights (and optionally the offsets) are rounded.
Quantization block
An example implementation of the quantization block of the quantization model of the NN will now be described. As described above, each quantization block is configured to transform one or more sets of values input to a layer of NN into a fixed point number format defined by one or more quantization parameters. In these examples, each fixed-point format is defined by a mantissa bit length b and an exponent exp, where exponent exp is an integer shared by the set of values represented in the fixed-point format, such that the size of the set of input data values in the fixed-point format is based on the mantissa bit length b.
In order to be able to back-propagate the derivative of the cost metric to the quantization parameter, not only the quantization function performed by each quantization block is defined, but also its derivative. In practice, the derivative of the equation is determined by a method such as, but not limited to, tensorFlow TM Or PyTorch TM Is automatically defined by the machine learning framework.
The process of quantizing the value x into a fixed point number format may be described as including two steps— (i) thresholding the value x into a digital range that can be represented by the fixed point number format (e.g., line 1202 of fig. 12 for exponent-1 and bit width 3); and (ii) exp by rounding the thresholding value x to the nearest 2 th The representable numbers in fixed point number format are selected to represent the value x by power (e.g., line 1204 of fig. 12 for exponent-1 and bit width 3).
The thresholding step of the value x to the quantization operation in fixed point format defined by the mantissa bit length b and exponent exp, i.e., thresholding the value x to a range representable by the fixed point format, may be achieved by equation (32), where clip (x, low, high) is as defined in equation (33), and low is the smallest or lowest representable number in fixed point format defined by b and exp (e.g., low = -2 b-exp-1 ) And high is the maximum or highest representable number in a fixed point number format defined by b and exp (e.g., high=2 b+exp-1 -2 exp ):
thresh(x,b,exp)=clamp(x,-2 exp+b-1 ,2 exp+b-1 -2 exp ) (32)
clamp(x,low,high)=min(max(x,low),high) (33)
In order to be able to perform back propagation by thresholding operations, the derivative of the thresholding operation is defined. The derivative of the thresholding function defined in equation (32) with respect to x is 1 for values falling within the representable range, otherwise 0. However, in some cases, the more useful derivative is 1 for all values that fall within the quantization interval, otherwise 0. This may be achieved by using the thresholding function set forth in equation (34) instead of the thresholding function set forth in equation (32):
thresh(x,b,exp)=clamp(x,-2 exp+b-1 -2 exp-1 ,2 exp+b-1 -2 exp-1 ) (34)
rounding step of quantization operation-i.e. rounding the value to the nearest 2 exp th The power of the power-can be achieved by either equation (35A) or equation (35B), where Is an RTNI (round to minus infinity) function (also known as a floor function).
The derivative of the rounding function defined in either equation (35A) or equation (35B) with respect to x may not be useful in identifying NN parameters (e.g., weights and/or quantization parameters) because the derivative is zero almost anywhere and thus may be set to 1.
Thus, the total quantization of the value x to a fixed point number format defined by the bit width B and the exponent exp (x, B, exp) can be implemented using a combination of thresholding equation (32) or equation (34)) and rounding equation (35A) or equation (35B)) as shown in equation (36).
quant(x,b,exp)=round(thresh(x,b,exp),exp) (36)
In the case where the quantization block is not configured to quantize (e.g., round) the received quantization parameter before quantizing the input value using the quantization parameter, the combined formula may be written as shown in equation (37A). During a training phase (e.g., as described herein with reference to blocks 502-510 of fig. 5), the quantization block is not configured to quantize (e.g., round) the received quantization parameters, such that it may be advantageous for the quantization block to not be constrained to have integer values of the quantization parameters used to quantize the input values during the training phase—which may enable higher resolution (e.g., higher precision) training of those quantization parameters.
quant(x,b,exp)=2 exp round(min(max(2 -exp x,-2 (b-1) ),2 (b-1) -1))(37A)
In an alternative example, where the quantization block is not configured to quantize (e.g., round) the received quantization parameter before quantizing the input value using the quantization parameter, the combined formula may be written as shown in equation (37B). The main difference between equation (37A) and equation (37B) is the introduction of α as a scaling factor (e.g., a shift parameter).
quant(x,b,exp,α)=2 exp round(min(max(2 -exp x,(α-1)2 (b-1) ),(α+
1)2 (b-1) -1))(37B)
In the case where the quantization block is configured to receive increasing/decreasing quantization parameters and quantize (e.g., round) the received quantization parameters before quantizing the input values using the quantization parameters, the combined formula may be written as shown in equation (37C), where q is a rounding function or quantization function used to quantize or simulate the quantization of the quantization parameters. Example rounding functions for quantizing quantization parameters or for modeling quantization thereof are described above with respect to block 508. In other words, the quantization function q may implement any of (i) the rounding method described above to round to the nearest integer or set of nearest integers, or (ii) the method described above to simulate rounding to the nearest integer or set of integers (e.g., one of a random quantization method, a uniform quantization method, a gradient average quantization method, or a bimodal quantization method). As described above, in order to be able to back-propagate the derivative of the cost measure cm to the quantization parameter, a quantization function q is defined such that the derivative of the cost measure can be defined in accordance with the quantization parameter. During the training phase (e.g., as described herein with reference to blocks 502-510 of fig. 5), it may be advantageous for the quantization block to be configured to quantize (e.g., round) the received quantization parameters before quantizing the input values using the quantization parameters, as this may enable training to take into account quantization (e.g., rounding) of the quantization parameters that would occur when the NN is subsequently implemented in hardware—especially if the quantization block is configured to quantize the input activation values using those quantization parameters.
quant(x,b,exp)=2 q(exp) round(clamp(2 -q(exp) x,-2 q(b-1) ,2 q(b-1) -1))
(37C)
The inventors have realized that if the derivative of the quantization function q with respect to the quantization parameter it is quantizing is defined as one, the machine learning framework can generate a useful gradient of the cost function with respect to the quantization parameter (e.g., a gradient that can be used to adjust the quantization parameter). For example, tests have shown that if the derivative of the quantization function q with respect to the quantization parameter it is quantizing is set to one, the machine learning framework can generate: (i) Derivative d of the main quantization function quat with respect to the quantization parameter b b (x) As shown in equation (38), where low is the smallest or lowest representable number in the fixed point number format defined by b and exp, and high is the largest or highest representable number in the fixed point number format defined by b and exp; and (ii) the derivative d of the main quantization function quat with respect to the quantization parameter exp exp (x) As shown in equation (39).
It can be seen that the machine learning framework can calculate, for each input value quantized by a quantization block, the derivative of the cost function for each quantization parameter (e.g. b, exp) of that quantization block. The machine learning framework may then calculate the final derivative of the cost function for each quantization parameter (e.g., b, exp) based on the respective derivatives of each quantization parameter. For example, in some cases, the machine learning framework may calculate the final derivative of the cost function for each quantization parameter of the quantization block by adding or summing the respective derivatives of that quantization parameter.
In the case where the variable bit length variant of the Q8A fixed point number format is used to represent the input value for the layer of NN and zero z is 0, the quantization function performed by the quantization block may be represented by equation (40), where b, exp, and α are trainable quantization parameters:
quant(x,b,exp,α)=
2 exp round(clamp(2 -exp x,(α-1)2 q(b-1) ,(α+1)2 q(b-1) -1))(40)
the main difference between equation (40) and equation (37C) is the fact that α as a scaling factor and exp are not quantized introduced. As shown in equation (1), quantization parameters for variable bit length variants of the Q8A format may be generated from trained quantization parameters exp, b, and α as shown in equations (41), (42), and (43):
r min =2 exp RND(2 RND(b)-1 (α-1)) (41)
r max =2 exp RND(2 RND(b)-1 (α+1)-1) (42)
z=0 (43)
in the case of using a variable bit length variant of the Q8A fixed-point number format to represent the input value for the layer of NN, where the zero point z may not be zero, the quantization function performed by the quantization block may be represented by equation (44).
quant(x,b,exp,α)=2 exp (round(clamp(2 -exp x-2 q(b-1) α,-2 q(b-1) ,(α+1)2 q(b-1) -1))+2 q(b-1) α) (44)
For equations (40) and (44), although the quantization parameter for the bit length variation of the Q8A fixed-point number format is r min 、r max Z and b, but tests have shown that training b, exp and α and calculating r from them min 、r max And z has shown better training.
In some cases, rather than the quantization block quantizing the value input to the quantization block into an output fixed point number format defined by one or more quantization parameters (e.g., according to equations (36), (37A), (37B), (37C), (40), or (44)), the quantization block may be configured to only simulate a transformation represented by quantization of the input value. It should be appreciated that where a quantization block is described herein as transforming a set of values into a fixed-point number format defined by one or more quantization parameters, the transformation may involve quantizing the set of values according to the one or more quantization parameters, or may involve simulating quantization of the set of values by the one or more quantization parameters.
For example, in some cases, rather than the quantization block being configured to thresholde a weight or input/activation value to a representable range in a fixed-point format and then round the thresholded activation/weight/offset value to the nearest representable number in a fixed-point format, quantization may be simulated by thresholding the weight/activation value and adding a random value u between-a and +a to the thresholded activation/weight/offset value and then rounding, where a is half the distance between representable numbers in a fixed-point format (i.e.). For example, if the exponent exp of the fixed point number format is 0, a random value between-0.5 and +0.5 is added to the thresholded activation/weight/offset value before rounding the activation/weight/offset value, as it may be indicated that the distance between the digits is 1. Similarly, if the exponent of the fixed point number format is 1, a random value between-1 and +1 is added to the thresholding activation/weight/bias, as it may be represented that the distance between the digits is 2. In this way, the thresholded activation/weight/bias value is rounded up or down to the representable number with a probability proportional to the distance to the representable number. For example, in the case of an exponent exp of 0, a thresholding activation/weight/bias value of 4.2 would be rounded to 4 with a probability of 80% and to 5 with a probability of 20%. Similarly, 7.9 would be rounded to 7 with a 10% probability and to 8 with a 90% probability. In other examples, the ordering of randomization and thresholding may be reversed. For example, rather than thresholding the activation/weight/bias values, adding random values to the threshold activation/weight/bias values and The random value may be added to the activation/weight/bias value to generate a randomized weight, which may be thresholded and then rounded.
In other cases, rather than the quantization block being configured to round the thresholding activation/weight/offset value to the nearest representable number, the quantization block may be configured to simulate quantization of the activation/weight/offset value by adding a random value u between-a and +a to the thresholding activation/weight/offset value, where a is half the representable number distance in a fixed point number format, as described above. By adding only such random values to the thresholding activation/weight/bias values, the thresholding activation/weight/bias values are distorted in a manner similar to rounding the thresholding activation/weight/bias values. In other examples, the ordering of randomization and thresholding may be reversed. For example, rather than thresholding the activation/weight/bias values and adding the random values to the threshold weights, the random values may be added to the activation/weight/bias values to generate randomized activation/weight/bias values and the randomized activation/weight/bias values may be thresholded.
In yet other cases, rather than the quantization block rounding the thresholding activation/weight/offset value to the nearest representable number, the quantization block may be configured to simulate quantization by performing gradient average quantization on the thresholding activation/weight/offset value. Performing gradient average quantization on thresholded activation/weight/bias values may include taking the bottom of the thresholded activation/weight/bias values and then adding a random value h between 0 and c, where c is the distance between representable digits in a fixed point number format. For example, if the exponent exp of the fixed point number format is 0, after taking the bottom of the thresholded activation/weight/bias value, a random value between 0 and 1 is added to the bottom because the distance between the representable digits in the fixed point number format is 1. Similarly, if the exponent exp of the fixed point number is 1, after taking the bottom of the thresholded activation/weight/bias value, a random value between 0 and 2 is added to the bottom, as it may represent a distance between the digits of 2.
In yet other cases, rather than the quantization block rounding the thresholding activation/weight/offset value to the nearest representable number, the quantization block may be configured to simulate quantization by performing a bimodal quantization on the thresholding activation/weight/offset value, which is a combination of rounded-to-nearest quantization and gradient average quantization, as described above. Specifically, in bimodal quantization, gradient average quantization is performed on thresholding activation/weight/bias values with a probability p, otherwise rounding quantization is performed on thresholding activation/weight/bias values, where p is twice the distance to the nearest representable value divided by the distance between representable digits in a fixed point number format. In other examples, the ordering of the bimodal quantization and thresholding may be reversed. For example, rather than thresholding the activation/weight/bias values and performing a bimodal quantization on the thresholded activation/weight/bias values, a bimodal quantization may be performed on the activation/weight/bias values and thresholding may be performed on the results of the bimodal quantization.
In other words, the rounding function (round) in any of equations (36), (37A), (37B), (37C), (40) and (44) may be replaced with a function that implements any of the analog rounding methods described above, such as a random quantization method, a uniform noise quantization method, a gradient average quantization method, or a bimodal quantization method.
Example NN accelerator
Referring now to fig. 13, this figure illustrates example hardware logic that may be configured to implement NN using quantization parameters identified according to the method 500 of fig. 5 or the method 1100 of fig. 11. Specifically, fig. 13 shows an example NN accelerator 1300. In an example, the NN accelerator 1300 may be configured to implement a Deep Neural Network (DNN), which is a type of NN, using quantization parameters identified according to the method 500 of fig. 5 or the method 1100 of fig. 11.
The NN accelerator 1300 of fig. 13 is configured to compute the output of the NN through a series of hardware passes (which may also be referred to as process passes), wherein during each pass the NN accelerator receives at least a portion of the input data of a layer of the NN and processes the received input data according to that layer (and optionally according to one or more subsequent layers) to produce processed data. The processed data is either output to a memory for use as input data for subsequent hardware transfer or output as NN. The number of layers that an NN accelerator may handle during a single hardware pass may be based on the size of the data, the configuration of the NN accelerator, and the order of the layers. For example, where the NN accelerator includes hardware logic that performs each of the possible layer types, an NN that includes a first convolution layer, a first activation layer, a second convolution layer, a second activation layer, and a pooling layer may be able to receive initial NN input data and process the input data according to the first convolution layer and the first activation layer in a first hardware pass, then output the output of the activation layer into memory, then receive the data as input from the memory in a second hardware pass, and process the data according to the second convolution layer, the second activation layer, and the pooling layer to produce output data for the NN.
The example NN accelerator 1300 of fig. 13 includes an input module 1301, a convolution engine 1302, an accumulation buffer 1304, an element-by-element operation module 1306, an activation module 1308, a normalization module 1310, a pooling module 1312, an output interleaving module 1314, and an output module 1315. Each module or engine implements or processes all or a portion of one or more types of layers. In particular, the convolution engine 1302 and accumulation buffer 1304 together implement or process a convolution layer or fully-connected layer. The activation module 1308 processes or implements the activation layer. Normalization module 1310 processes or implements a normalization layer. The pooling module 1312 implements the pooling layer and the output interleaving module 1314 processes or implements the interleaving layer.
The input module 1301 is configured to receive input data to be processed and provide it to downstream modules for processing.
The convolution engine 1302 is configured to perform convolution operations on received input activation data using received input weight data associated with a particular convolution layer. As shown in fig. 13, the weights for each convolution layer of NN (which may be generated by the method 1100 of fig. 11) may be stored in the coefficient buffer 1316, and when the convolution engine 1302 is processing a particular convolution layer, the weights for that particular convolution layer may be provided to the convolution engine 1302. Where the NN accelerator supports variable weight formats, the convolution engine 1302 may be configured to receive information in one or more formats that indicate the weights of the current convolution layer being processed to allow the convolution engine to correctly interpret and process the received weights.
The convolution engine 1302 may include a plurality of multipliers (e.g., 128) and a plurality of adders that add the results of the multipliers to produce a single sum. Although a single convolution engine 1302 is shown in fig. 13, in other examples, there may be multiple (e.g., 8) convolution engines so that multiple windows may be processed simultaneously. The output of the convolution engine 1302 is fed to an accumulation buffer 1304.
The accumulation buffer 1304 is configured to receive the output of the convolution engine and to add the output to the current content of the accumulation buffer 1304. In this manner, the accumulation buffer 1304 accumulates the results of the convolution engine 1302 over several hardware passes of the convolution engine 1302. Although a single accumulation buffer 1304 is shown in fig. 13, in other examples, there may be multiple (e.g., 8, one for each convolution engine) accumulation buffers. The accumulation buffer 1304 outputs the accumulated results to an element-by-element operation module 1306, which may or may not operate on the accumulated results depending on whether an element-by-element layer is to be processed during the current hardware pass.
The element-wise operation module 1306 is configured to receive input data of a current hardware pass (e.g., when a convolutional layer is not being processed in the current hardware pass) or an accumulated result from the accumulation buffer 1304 (e.g., when a convolutional layer is being processed in the current hardware pass). The element-by-element operation module 1306 may process the received input data or pass the received input data to another module (e.g., the activation module 1308 and/or the normalization module 1310) depending on whether the element-by-element layer is to be processed in the current hardware pass and/or depending on whether the activation layer is to be processed before the element-by-element layer. When the element-wise operation module 1306 is configured to process the received input data, the element-wise operation module 1306 performs an element-wise operation on the received data (optionally together with another data set (which may be obtained from an external memory). The element-wise operation module 1306 may be configured to perform any suitable element-wise operation, such as, but not limited to, addition, multiplication, maximum and minimum. The results of the element-wise operation are then provided to either the activation module 1308 or the normalization module 1310, depending on whether the activation layer is to be processed after the element-wise layer.
The activation module 1308 is configured to receive as input data one of the following: raw input for the hardware pass (via element-by-element arithmetic module 1306) (e.g., when the convolutional layer is not being processed in the current hardware pass); the data is accumulated (via the element-wise operation module 1306), e.g., when the convolutional layer is processed in the current hardware pass and the element-wise layer is not processed in the current hardware pass or the element-wise layer is processed in the current hardware pass but after the active layer. The activation module 1308 is configured to apply an activation function to input data and provide output data back to the element-wise operation module 1306, where the output data is forwarded to the normalization module 1310 directly or after processing it by the element-wise operation module 1306. In some cases, the activation function applied to the data received by activation module 1308 may vary at each activation layer. In these cases, information specifying one or more attributes of the activation function to be applied to each activation layer may be stored (e.g., in memory), and information regarding the activation layers processed in a particular hardware pass may be provided to activation module 1308 during that hardware pass.
In some cases, the activation module 1308 may be configured to store data representing the activation function in an entry of a lookup table. In these cases, the input data may be used to look up one or more entries in a look-up table and an output value representing the output of the activation function. For example, the activation module 1308 may be configured to calculate an output value by interpolating between two or more entries read from a look-up table.
In some examples, activation module 1308 may be configured to operate as a rectifying linear unit (ReLU) by implementing a ReLU function. In the ReLU function, the output element y i,j,k Calculated by identifying the maximum value listed in equation (45), whichFor x values less than 0, y=0:
y i,j,k =f (x i,j,k )=max{0,x i,j,k } (45)
in other examples, the activation module 1308 may be configured to operate as a parametric rectified linear unit (prime) by implementing a prime function. The PReLU function performs an operation similar to the ReLU function. In particular, inBeing constant, PReLU is configured to generate output element y i,j,k As shown in equation (46):
y i,j,k =f (x i,j,k ;w 1 ,w 2 ,b 1 ,b 2 )=max{(w 1 *x i,j,k +b 1 ),(w 2 *x i,j,k +b 2 )} (46)
the normalization module 1310 is configured to receive as input data one of the following: raw input data for the hardware pass (via element-by-element arithmetic module 1306) (e.g., when the convolutional layer is not processed in the current hardware pass and the layer-by-layer element layer and the active layer are not processed in the current hardware pass); accumulating output (via the element-by-element arithmetic module 1306) (e.g., when the convolutional layer is processed in the current hardware pass and the element-by-element layer and the active layer are not processed in the current hardware pass); and output data of the element-by-element operation module and/or the activation module. Normalization module 1310 then performs a normalization function on the received input data to produce normalized data. In some cases, normalization module 1310 may be configured to perform a Local Response Normalization (LRN) function and/or a Local Contrast Normalization (LCN) function. However, it will be apparent to those skilled in the art that these are merely examples, and that normalization module 1310 may be configured to implement any suitable normalization function or functions. Different normalization layers may be configured to apply different normalization functions.
The pooling module 1312 may receive the normalized data from the normalization module 1310, or may receive the input data of the normalization module 1310 via the normalization module 1310. In some cases, data may be transferred between normalization module 1310 and pooling module 1312 via XBar (or "crossbar") 1318. In this context, the term "XBar" is used to refer to a simple hardware module that contains routing logic that connects multiple modules together in a dynamic manner. In this example, XBar may dynamically connect normalization module 1310, pooling module 1312, and/or output interleaving module 1314 depending on which layers are to be processed in the current hardware pass. Thus, XBar may receive information in each pass indicating which modules 1310, 1312, 1314 are to be connected.
The pooling module 1312 is configured to perform a pooling function, such as, but not limited to, a max function or a mean function, on the received data to produce pooled data. The purpose of the pooling layer is to reduce the spatial size of the representation to reduce the number of parameters and calculations in the network and thus also to control the overfitting. In some examples, the pooling operation is performed on a sliding window defined by each pooling layer.
The output interleave module 1314 may receive normalized data from the normalization module 1310, input data to the normalization function (via the normalization module 1310), or pooled data from the pooling module 1312. In some cases, data may be transferred between normalization module 1310, pooling module 1312, and output interleaving module 1314 via XBar 1318. The output interleaving module 1314 is configured to perform a rearrangement operation to produce data in a predetermined order. This may include ordering and/or transposing the received data. The data generated by the last of the layers is provided to an output module 1315 where it is converted to the desired output format for the current hardware delivery.
The normalization module 1310, the pooling module 1312, and the output interleaving module 1314 may each access a shared buffer 1320 that these modules 1310, 1312, and 1314 may use to write data to and retrieve data from. For example, the modules 1310, 1312, 1314 may use the shared buffer 1320 to reorder the received data or the generated data. For example, one or more of the modules 1310, 1312, 1314 may be configured to write data to the shared buffer 1320 and read the same data in a different order. In some cases, although each of normalization module 1310, pooling module 1312, and output interleaving module 1314 may access shared buffer 1320, each of normalization module 1310, pooling module 1312, and output interleaving module 1314 may be allocated a portion of shared buffer 1320 that is only themselves accessible. In these cases, each of normalization module 1310, pooling module 1312, and output interleaving module 1314 may only be able to read from shared buffer 1320 the data they have written into shared buffer 1320.
The modules of the NN accelerator 1300 that are used or active during any hardware transfer are based on the layers that are processed during the hardware transfer. In particular, only the modules or components associated with the layers processed during the current hardware pass are used or active. As described above, the layers processed during a particular hardware pass are determined based on the order of the layers in the NN and optionally one or more other factors, such as the size of the data (typically by, for example, a software tool in advance). For example, in some cases, the NN accelerator may be configured to perform processing of a single layer per hardware pass unless multiple layers may be processed without writing data to memory between the layers. For example, if a first convolution layer is followed by a second convolution layer, each of the convolution layers would have to be performed in a separate hardware pass because the output data from the first hardware convolution needs to be written out to memory first before it can be used as input for the second hardware convolution. In each of these hardware passes, only the modules, components, or engines associated with the convolution layers (such as convolution engine 1302 and accumulation buffer 1304) may be used or active.
Although the NN accelerator 1300 of fig. 13 shows a particular arrangement order of modules, engines, etc., and thus shows how the processing of data flows through the NN accelerator, it should be understood that this is merely an example, and that in other examples modules, engines may be arranged in different ways. Further, other hardware logic (e.g., other NN accelerators) may implement additional or alternative types of NN layers, and thus may include different modules, engines, etc.
Cost of replacement metrics
In an example where the thresholding step described herein is implemented according to the definition of a clamp (x, low, high) in equation (33), an input x that is clamped to low or high (e.g., where the input x depends on a weight value w, such as x=w or x=2 - e w) may generate an output that is not dependent on x (e.g., and thus in examples where the input x is dependent on the weight value w, is not dependent on w). For example, where low is the smallest or lowest representable number in a fixed point number format defined by b and exp (e.g., low = -2 b-exp-1 ) And high is the maximum or highest representable number in a fixed point number format defined by b and exp (e.g., high=2 b+exp-1 -2 exp ). This is because in this example neither low nor high depends on x (e.g., and thus also on w). In these examples, it is not possible to back-propagate the cost metric to those clipping values via the equation used in the thresholding step with respect to a non-zero gradient of x (e.g., or w), which means that it may not be possible to effectively adjust the input weight values clipped during the thresholding step. In other words, in these examples, when performing the method described herein with reference to fig. 11, only the weight values that are not clamped during the thresholding step may have a non-zero gradient back-propagated to these weight values via the equations used in the thresholding step, and thus only those weight values may be effectively adjusted in blocks 1102 and 1104, respectively.
To address this, and other examples where the definition of low and high used in the thresholding step is not dependent on x (e.g., and thus also not dependent on w) (which means that it is not possible to back-propagate the cost metric to the clamped value via the equation used in the thresholding step with respect to a non-zero gradient of x (e.g., or w)), alternative cost metrics (e.g., loss functions) may be used in block 504. An example of an alternative cost metric is shown in equation (47). The main difference between equation (3) and equation (47) is the introduction of another term (γ×tm). The other term includes a "thresholding metric" tm and a weight γ applied to the thresholding metric. That is, the cost metric may be a combination (e.g., a weighted sum) of the error metric em, the implementation metric sm, and the thresholding metric tm.
cm=(α*em)+(β*sm)+(γ*tm) (47)
The purpose of the thresholding metric tm may be to assign costs to thresholding of the input values during quantization. This means that the thresholding metric tm is used to reduce the number of input values clamped during the thresholding step when it is minimized as part of the cost metric cm, for example by adjusting the clamped input values and/or low and/or high thresholds used during the thresholding step. For example, the thresholding metric tm for NN may be determined by a "thresholding cost" t for multiple layers l of NN as determined according to equation (48) l Summing to form-where x i Depending on the ith weight w i For example x i =2 -e w i And N is the number of weights in the layer.
(48)
In equation (48), the weight value w i For thresholding cost t l The contribution of (a) is non-zero only for weight values outside the representable range in fixed point number format (i.e. weight values that are smaller than low or larger than high and will therefore be clamped to low or high in the thresholding step). This is because, for example, if the weight value is within the representable range (e.g., greater than low and less than high), both of the "max" functions in equation (48) will return a "0". Thus, minimizing the thresholding metric serves to clamp the weight value w during the thresholding step i "push" towards the range of numbers that can be represented by the fixed point number format, and those weight values w are pushed i The respective low or high threshold values clamped to are oriented towards those weight values w i "pull". In other words, minimizing the thresholding metric drives the weight value w i And the low and high threshold orientations are such that in equation (48)The "max" function returns a value of "0" more often (i.e., by virtue of more weight values w i Within a representable range). In other words, this means that during back propagation and adjustment, the weight value w i Is affected by the error metric em and the implementation metric sm (e.g., if the weight value w i Within the representable range and thus not clamped to low or high), or by a thresholding metric tm (e.g., if the weight value w i Out of representable range and therefore clamped to low or high). When weight value w i When affected by the thresholding metric tm, the weight value is "pushed" back towards a range of representable values, in which the weight value may be affected by the error metric em and the implementation metric sm.
Fig. 14 illustrates various components of an exemplary general purpose computing-based device 1400 that may be implemented as any form of computing and/or electronic device and in which the embodiments of the methods 500, 1100 of fig. 5 and 10A-10E described above may be implemented.
The computing-based device 1400 includes one or more processors 1402, which may be microprocessors, controllers, or any other suitable type of processor, for processing computer-executable instructions to control the operation of the device in order to evaluate the performance of an integrated circuit as defined by a hardware design in completing tasks. In some examples, for example, where a system-on-chip architecture is used, the processor 1402 may include one or more fixed function blocks (also referred to as accelerators) that implement part of a method for determining a fixed point number format for representing a set of values input to or output from a layer of NN in hardware (rather than software or firmware). Platform software including an operating system 1404 or any other suitable platform software may be provided at the computing-based device to enable application software, such as computer executable code 1405 for implementing one or more of the methods 500, 1100 of fig. 5 and 10A-10E, to be executed on the device.
Computer-executable instructions may be provided using any computer-readable medium accessible by the computing-based device 1400. Computer-readable media may include, for example, computer storage media such as memory 1406 and communication media. Computer storage media (i.e., non-transitory machine-readable media) such as memory 1406, include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures or program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information for access by a computing device. Rather, the communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism. As defined herein, computer storage media does not include communication media. Although a computer storage medium (i.e., a non-transitory machine-readable medium, such as memory 1406) is shown within computing-based device 1400, it should be appreciated that the memory may be distributed or located at a remote location and accessible via a network or other communication link (e.g., using communication interface 1408).
The computing-based device 1400 also includes an input/output controller 1410 that is arranged to output display information to a display device 1412 that may be separate from or integral with the computing-based device 1400. The display information may provide a graphical user interface. The input/output controller 1410 is also arranged to receive and process input from one or more devices, such as a user input device 1414 (e.g., a mouse or keyboard). In one embodiment, if the display device 1412 is a touch sensitive display device, it may also serve as a user input device 1414. The input/output controller 1410 may also output data to devices other than a display device, such as a locally connected printing device (not shown in fig. 14).
FIG. 15 illustrates a computer system in which hardware logic (e.g., NN accelerator) configurable to implement NNs described herein may be implemented. The computer system includes a CPU 1502, a GPU 1504, a memory 1506, and other devices 1514 such as a display 1516, speakers 1518, and a camera 1520. As shown in fig. 15, hardware logic (e.g., NN accelerator 1300 of fig. 13) that may be configured to implement NN 1510 may be implemented on GPU 1504. The components of the computer system can communicate with each other via a communication bus 1522. In other examples, the hardware logic configurable to implement NN 1510 may be implemented independently of a CPU or GPU and may have a separate connection with communication bus 1522. In some examples, there may be no GPU and the CPU may provide control information to hardware logic that may be configured to implement NN 1510.
The NN accelerator 1300 of fig. 13 is shown as including a plurality of functional blocks. This is merely illustrative and is not intended to limit the strict division between the different logic elements of such entities. Each of the functional blocks may be provided in any suitable manner. It should be appreciated that intermediate values described herein as being formed by the NN accelerator or processing module need not be physically generated at any time by the NN accelerator or processing module, and merely represent logical values that conveniently describe the processing performed by the NN accelerator or processing module between its inputs and outputs.
Hardware logic (e.g., NN accelerator 1300 of fig. 13) configurable to implement an NN described herein may be embodied in hardware on an integrated circuit. Generally, any of the functions, methods, techniques, or components described above may be implemented in software, firmware, hardware (e.g., fixed logic circuitry) or any combination thereof. The terms "module," "functionality," "component," "element," "unit," "block," and "logic" may be used herein to generally represent software, firmware, hardware, or any combination thereof. In the case of a software implementation, the module, functionality, component, element, unit, block or logic represents program code that performs specified tasks when executed on a processor. The algorithms and methods described herein may be executed by one or more processors executing code that causes the processors to perform the algorithms/methods. Examples of a computer-readable storage medium include Random Access Memory (RAM), read-only memory (ROM), optical disks, flash memory, hard disk memory, and other memory devices that can store instructions or other data using magnetic, optical, and other techniques and that can be accessed by a machine.
The terms computer program code and computer readable instructions as used herein refer to any kind of executable code for a processor, including code expressed in a machine language, an interpreted language, or a scripting language. Executable code includes binary code, machine code, byte code, code defining an integrated circuit (such as a hardware description language or netlist), and code expressed in programming language code such as C, java or OpenCL. The executable code may be, for example, any kind of software, firmware, script, module, or library that, when properly executed, handled, interpreted, compiled, run in a virtual machine or other software environment, causes the processor of the computer system supporting the executable code to perform the tasks specified by the code.
The processor, computer, or computer system may be any kind of device, machine, or special purpose circuit, or a collection or portion thereof, that has processing capabilities such that instructions can be executed. The processor may be any kind of general purpose or special purpose processor such as a CPU, GPU, system on a chip, state machine, media processor, application Specific Integrated Circuit (ASIC), programmable logic array, field Programmable Gate Array (FPGA), etc. The computer or computer system may include one or more processors.
The invention is also intended to cover software defining the configuration of hardware as described herein, such as HDL (hardware description language) software, as used for designing integrated circuits, or for configuring programmable chips to carry out desired functions. That is, a computer readable storage medium may be provided having encoded thereon computer readable program code in the form of an integrated circuit definition data set that, when processed (i.e., run) in an integrated circuit manufacturing system, configures the system to manufacture hardware logic (e.g., an NN accelerator) that may be configured to implement an NN described herein. The integrated circuit definition data set may be, for example, an integrated circuit description.
Accordingly, a method of manufacturing hardware logic (e.g., NN accelerator 1300 of FIG. 13) at an integrated circuit manufacturing system that is configurable to implement NN as described herein may be provided. In addition, an integrated circuit definition data set may be provided that, when processed in an integrated circuit manufacturing system, causes a method of manufacturing hardware logic (e.g., NN accelerator 1300 of FIG. 13) configurable to implement NN to be performed.
The integrated circuit definition data set may be in the form of computer code, for example, as a netlist, code for configuring a programmable chip, as a hardware description language defining a hardware suitable for fabrication at any level in an integrated circuit, including as Register Transfer Level (RTL) code, as a high-level circuit representation (such as Verilog or VHDL), and as a low-level circuit representation (such as OASIS (RTM) and GDSII). A higher-level representation, such as RTL, logically defining hardware suitable for fabrication in an integrated circuit may be processed at a computer system configured to generate a fabrication definition of the integrated circuit in the context of a software environment that includes definitions of circuit elements and rules for combining the elements to generate a fabrication definition of the integrated circuit so defined by the representation. As is typically the case when software is executed at a computer system to define a machine, one or more intermediate user steps (e.g., providing commands, variables, etc.) may be required to configure the computer system to generate a manufacturing definition for an integrated circuit to execute code that defines the integrated circuit to generate the manufacturing definition for the integrated circuit.
An example of processing an integrated circuit definition data set at an integrated circuit manufacturing system to configure the system to manufacture hardware logic (e.g., NN accelerator) that may be configured to implement NN will now be described with respect to fig. 16.
Fig. 16 illustrates an example of an Integrated Circuit (IC) fabrication system 1602 configured to fabricate hardware logic (e.g., an NN accelerator) that is configurable to implement an NN as described in any of the examples herein. In particular, IC fabrication system 1602 includes layout processing system 1604 and integrated circuit generation system 1606. The IC fabrication system 1602 is configured to receive an IC definition data set (e.g., define hardware logic (e.g., NN accelerator) that is configurable to implement NN described in any of the examples herein), process the IC definition data set, and generate an IC from the IC definition data set (e.g., which embodies hardware logic (e.g., NN accelerator) that is configurable to implement NN described in any of the examples herein). Processing of the IC definition data set configures the IC fabrication system 1602 to fabricate an integrated circuit embodying hardware logic (e.g., NN accelerator) that is configurable to implement NN as described in any of the examples herein.
Layout processing system 1604 is configured to receive and process the IC definition data set to determine a circuit layout. Methods of determining a circuit layout from an IC definition dataset are known in the art and may involve, for example, synthesizing RTL codes to determine a gate level representation of a circuit to be generated, for example in terms of logic components (e.g., NAND, NOR, AND, OR, MUX and FLIP-FLOP components). By determining the location information of the logic components, the circuit layout may be determined from the gate level representation of the circuit. This may be done automatically or with the participation of a user in order to optimize the circuit layout. When the layout processing system 1604 has determined the circuit layout, it may output the circuit layout definition to the IC generation system 1606. The circuit layout definition may be, for example, a circuit layout description.
As is known in the art, the IC generation system 1606 generates ICs according to a circuit layout definition. For example, IC generation system 1606 may implement a semiconductor device fabrication process that generates ICs, which may involve a multi-step sequence of photolithography and chemical processing steps during which electronic circuits are built up on wafers made of semiconductor material. The circuit layout definition may be in the form of a mask that may be used in a lithographic process to generate an IC from the circuit definition. Alternatively, the circuit layout definitions provided to IC generation system 1606 may be in the form of computer readable code that IC generation system 1606 may use to form an appropriate mask for generating an IC.
The different processes performed by the IC fabrication system 1602 may all be implemented at one location, e.g., by a party. Alternatively, IC fabrication system 1602 may be a distributed system such that some processes may be performed at different locations and by different parties. For example, some of the following phases may be performed at different locations and/or by different parties: (i) Synthesizing an RTL code representing the IC definition dataset to form a gate level representation of the circuit to be generated; (ii) generating a circuit layout based on the gate level representation; (iii) forming a mask according to the circuit layout; and (iv) using the mask to fabricate the integrated circuit.
In other examples, processing of the integrated circuit definition data set at the integrated circuit manufacturing system may configure the system to manufacture hardware logic (e.g., NN accelerators) that may be configured to implement NN without processing the IC definition data set in order to determine a circuit layout. For example, an integrated circuit definition dataset may define a configuration of a reconfigurable processor (such as an FPGA), and processing of the dataset may configure the IC manufacturing system to generate (e.g., by loading configuration data into the FPGA) the reconfigurable processor with the defined configuration.
In some embodiments, the integrated circuit manufacturing definition data set, when processed in the integrated circuit manufacturing system, may cause the integrated circuit manufacturing system to generate an apparatus as described herein. For example, configuration of an integrated circuit manufacturing system in the manner described above with respect to fig. 16, by an integrated circuit manufacturing definition dataset, may result in the manufacture of an apparatus as described herein.
In some examples, the integrated circuit definition dataset may include software running on or in combination with hardware defined at the dataset. In the example shown in fig. 16, the IC generation system may be further configured by the integrated circuit definition data set to load firmware onto the integrated circuit in accordance with the program code defined at the integrated circuit definition data set at the time of manufacturing the integrated circuit or to otherwise provide the integrated circuit with the program code for use with the integrated circuit.
The implementation of the concepts set forth in the present application in devices, apparatuses, modules, and/or systems (and in methods implemented herein) may result in performance improvements over known implementations. Performance improvements may include one or more of increased computational performance, reduced latency, increased throughput, and/or reduced power consumption. During the manufacture of such devices, apparatuses, modules and systems (e.g., in integrated circuits), a tradeoff may be made between performance improvements and physical implementation, thereby improving the manufacturing method. For example, a tradeoff may be made between performance improvement and layout area, matching the performance of known implementations, but using less silicon. This may be accomplished, for example, by reusing the functional blocks in a serial fashion or sharing the functional blocks among elements of a device, apparatus, module, and/or system. Rather, the concepts described herein that lead to improvements in the physical implementation of devices, apparatuses, modules, and systems (e.g., reduced silicon area) may be weighed against performance improvements. This may be accomplished, for example, by fabricating multiple instances of the module within a predefined area budget.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims (20)

1. A computer-implemented method of identifying one or more quantization parameters for transforming values to be processed by a neural network "NN" to implement the NN in hardware, the method comprising, in at least one processor:
(a) Determining that a model of the NN is responsive to output of training data, the model of the NN including one or more quantization blocks, each of the one or more quantization blocks configured to transform one or more sets of values input to a layer of the NN into a respective fixed-point number format defined by one or more quantization parameters before the model processes the one or more sets of values according to the layer;
(b) Determining a cost metric for the NN, the cost metric being a combination of an error metric and an implementation metric, the implementation metric representing an implementation cost of the NN based on the one or more quantization parameters, the one or more sets of values having been transformed according to the one or more quantization parameters, the implementation metric depending, for each of a plurality of layers of the NN:
a first contribution representing implementation costs of outputs from the layers; and
a second contribution representing implementation costs of outputs from layers preceding the layer;
(c) Counter-propagating the derivative of the cost metric to at least one of the one or more quantization parameters to generate a gradient of the cost metric for the at least one of the one or more quantization parameters; and
(d) The at least one of the one or more quantization parameters is adjusted based on the gradient for the at least one of the one or more quantization parameters.
2. The computer-implemented method of claim 1, further comprising: after adjusting step (d), removing a set of values from the model of the NN according to the adjusted at least one of the one or more quantization parameters.
3. The computer-implemented method of claim 1 or 2, wherein the first contribution is formed from an implementation cost of one or more output channels of weight data input to the layer, and the second contribution is formed from an implementation cost of one or more input channels of activation data input to the layer.
4. The computer-implemented method of claim 1 or 2, wherein the first contribution is formed from an implementation cost of one or more output channels of weight data input to the layer, and the second contribution is formed from an implementation cost of one or more input channels of weight data input to the layer.
5. The computer-implemented method of claim 1 or 2, wherein each of the one or more quantization parameters comprises a respective bit width, and wherein each of the one or more sets of values is a channel of values input to the layer, the method comprising determining a respective bit width for each of one or more input channels of weight data input to the layer, and determining a respective bit width for each of one or more output channels of weight data input to the layer.
6. The computer-implemented method of claim 5, wherein for each weight value input to the layer, a first bit width and a second bit width are determined separately, and the method comprises transforming each weight value according to its respective first bit width and/or respective second bit width input to the layer, optionally according to the smaller of the respective first bit width and the respective second bit width of the each weight value.
7. The computer-implemented method of claim 5, the method comprising: after the adjusting step (d), when an adjusted bit width for a corresponding input channel of the weight data input to the layer is zero, an output channel of the weight data input to a preceding layer is removed from the model of the NN.
8. The computer-implemented method of claim 1 or 2, wherein:
forming the first contribution according to an implementation cost of one or more output channels of weight data input to the layer and an implementation cost of one or more biases input to the layer; and is also provided with
The second contribution is formed from an implementation cost of one or more output channels of weight data input to the preceding layer and an implementation cost of one or more biases input to the preceding layer.
9. The computer-implemented method of claim 1 or 2, wherein the first contribution is formed from implementation costs of one or more output channels of weight data input to the layer, and the second contribution is formed from implementation costs of one or more output channels of weight data input to the preceding layer.
10. The computer-implemented method of claim 1 or 2, wherein each of the one or more quantization parameters comprises a respective bit width, and wherein the one or more sets of values comprise one or more output channels of weight data input to the layer and one or more output channels of weight data input to the preceding layer, the method comprising transforming each of the one or more output channels of weight data input to the layer according to the respective bit width, and transforming each of the one or more output channels of weight data input to the preceding layer according to the respective bit width.
11. The computer-implemented method of claim 10, the method comprising: after the adjusting step (d), when the adjusted bit width for an output channel of the weight data input to the previous layer is zero, the output channel is removed from the model of the NN.
12. The computer-implemented method of claim 9, wherein for each of a plurality of layers of the NN, the implementation metric is further dependent on an additional contribution of implementation cost representing one or more biases input to the preceding layer.
13. The computer-implemented method of claim 12, the method comprising: after the adjusting step (d), removing the output channel from the model of the NN when the absolute value of the adjusted bit width and associated bias of the output channel for the output channel of the weight data input to the preceding layer is zero.
14. The computer-implemented method of claim 1 or 2, wherein a layer of the NN receives activation input data that has been derived from activation output data of more than one preceding layer, and wherein the implementation metrics for the layer depend on:
a first contribution representing implementation costs of outputs from the layers;
a second contribution representing implementation costs of outputs from a first layer preceding the layer; and
a third contribution representing implementation costs of outputs from a second layer preceding the layer.
15. The computer-implemented method of claim 1 or 2, wherein a layer output of the NN is input to activation data of a first subsequent layer and a second subsequent layer, wherein the method further comprises adding a new layer to the NN between the layer and the first subsequent layer, and wherein the implementation metrics for the first subsequent layer depend on:
a first contribution representing implementation costs of outputs from the first subsequent layer; and
a second contribution representing implementation costs of the output from the new layer.
16. The computer-implemented method of claim 1 or 2, wherein the second contribution represents an implementation cost of output from a layer immediately preceding the layer.
17. The computer-implemented method of claim 1 or 2, further comprising outputting the adjusted at least one of the one or more quantization parameters for configuring hardware logic to implement the NN.
18. The computer-implemented method of claim 1 or 2, further comprising configuring hardware logic to implement the NN using the adjusted quantization parameter, optionally wherein the hardware logic comprises a neural network accelerator.
19. A non-transitory computer-readable storage medium having stored thereon computer-readable instructions that, when executed at a computer system, cause the computer system to perform a computer-implemented method of identifying one or more quantization parameters for transforming values to be processed by a neural network "NN" to implement the NN in hardware, the method comprising, in at least one processor:
(a) Determining that a model of the NN is responsive to output of training data, the model of the NN including one or more quantization blocks, each of the one or more quantization blocks configured to transform one or more sets of values input to a layer of the NN into a respective fixed-point number format defined by one or more quantization parameters before the model processes the one or more sets of values according to the layer;
(b) Determining a cost metric for the NN, the cost metric being a combination of an error metric and an implementation metric, the implementation metric representing an implementation cost of the NN based on the one or more quantization parameters, the one or more sets of values having been transformed according to the one or more quantization parameters, the implementation metric depending, for each of a plurality of layers of the NN:
A first contribution representing implementation costs of outputs from the layers; and
a second contribution representing implementation costs of outputs from layers preceding the layer;
(c) Counter-propagating the derivative of the cost metric to at least one of the one or more quantization parameters to generate a gradient of the cost metric for the at least one of the one or more quantization parameters; and
(d) The at least one of the one or more quantization parameters is adjusted based on the gradient for the at least one of the one or more quantization parameters.
20. A computing-based device configured to identify one or more quantization parameters for transforming values to be processed by a neural network "NN" to implement the NN in hardware, the computing-based device comprising:
at least one processor; and
a memory coupled to the at least one processor, the memory comprising:
computer readable code, which when executed by the at least one processor, causes the at least one processor to:
(a) Determining that a model of the NN is responsive to output of training data, the model of the NN including one or more quantization blocks, each of the one or more quantization blocks configured to transform one or more sets of values input to a layer of the NN into a respective fixed-point number format defined by one or more quantization parameters before the model processes the one or more sets of values according to the layer;
(b) Determining a cost metric for the NN, the cost metric being a combination of an error metric and an implementation metric, the implementation metric representing an implementation cost of the NN based on the one or more quantization parameters, the one or more sets of values having been transformed according to the one or more quantization parameters, the implementation metric depending, for each of a plurality of layers of the NN:
a first contribution representing implementation costs of outputs from the layers; and
a second contribution representing implementation costs of outputs from layers preceding the layer;
(c) Counter-propagating the derivative of the cost metric to at least one of the one or more quantization parameters to generate a gradient of the cost metric for the at least one of the one or more quantization parameters; and
(d) The at least one of the one or more quantization parameters is adjusted based on the gradient for the at least one of the one or more quantization parameters.
CN202310791759.5A 2022-06-30 2023-06-29 Identifying one or more quantization parameters for quantizing values to be processed by the neural network Pending CN117332830A (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
GB2209616.8 2022-06-30
GB2209612.7 2022-06-30
GBGB2216947.8A GB202216947D0 (en) 2022-06-30 2022-11-14 Processing data using a neural network "nn" implemented in hardware
GB2216947.8 2022-11-14
GB2216948.6 2022-11-14

Publications (1)

Publication Number Publication Date
CN117332830A true CN117332830A (en) 2024-01-02

Family

ID=89274395

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310791759.5A Pending CN117332830A (en) 2022-06-30 2023-06-29 Identifying one or more quantization parameters for quantizing values to be processed by the neural network

Country Status (1)

Country Link
CN (1) CN117332830A (en)

Similar Documents

Publication Publication Date Title
CN111353579B (en) Method and system for selecting quantization parameters for deep neural networks using back propagation
CN110009099B (en) Method and system for determining fixed point number format for values representing layers of DNN
CN112418391B (en) Method and system for transforming weights of deep neural network
CN110033079B (en) Hardware-implemented end-to-end data format selection for deep neural networks
EP3480689B1 (en) Hierarchical mantissa bit length selection for hardware implementation of deep neural network
CN117332830A (en) Identifying one or more quantization parameters for quantizing values to be processed by the neural network
CN117332829A (en) Processing data using neural networks &#34;NN&#34; implemented in hardware
US11847567B1 (en) Loss-aware replication of neural network layers
US20240143985A1 (en) Identifying one or more quantisation parameters for quantising values to be processed by a neural network
GB2624564A (en) Identifying one or more quantisation parameters for quantising values to be processed by a neural network
GB2620173A (en) Processing data using a neural network &#34;NN&#34; implemented in hardware
US20220012574A1 (en) Methods and systems for selecting number formats for deep neural networks based on network sensitivity and quantisation error
CN117808046A (en) Method and system for online selection of digital format of network parameters of neural network
GB2603582A (en) End-to-end data format selection for hardware implementation of deep neural network
GB2603647A (en) End-to-end data format selection for hardware implementation of deep neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication