CN116583854A

CN116583854A - Worst case noise and margin management for RPU crossbar arrays

Info

Publication number: CN116583854A
Application number: CN202180080347.XA
Authority: CN
Inventors: M·J·拉施
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2020-12-07
Filing date: 2021-11-04
Publication date: 2023-08-11
Also published as: WO2022121569A1; JP2023552459A; GB2616371A; DE112021005637T5; GB2616371B; GB202308548D0; US20220180164A1

Abstract

Techniques for noise and margin management for DNN training on RPU crossbar arrays using worst case scaling factors are provided. A method for noise and boundary management, comprising: obtaining an input vector value x of an analog crossbar array of the RPU device, wherein the weight matrix is mapped to the analog crossbar array of the RPU device; and scaling the input vector value x based on a worst case scenario to provide a scaled input vector value x' for use as an input to the analog crossbar array of the RPU device, wherein the worst case scenario comprises a sum of an assumed maximum weight of the weight matrix multiplied by an absolute value from the input vector value x.

Description

Worst case noise and margin management for RPU crossbar arrays

Technical Field

The present invention relates to training of deep neural networks (deep neural network, DNN) of analog crossbar arrays with resistive processing units (resistive processing unit, RPU) devices, and more particularly to techniques for noise and boundary management of DNN training on RPU crossbar arrays using worst-case scaling factors, which provide effective run-time improvement.

Background

The Deep Neural Network (DNN) may be embodied in an analog crossbar array of a memory device such as a Resistive Processing Unit (RPU). DNN-based models have been used for a variety of different cognition-based tasks, such as object and speech recognition and natural language processing. When performing these tasks, neural network training is required to provide a high level of accuracy. However, the operations performed on the RPU array are analog in nature and are therefore susceptible to various noise sources. When the input value of the RPU array is small (such as for a backward-loop pass), the output signal may be masked by noise, thus producing an incorrect result.

In addition, digital-to-analog converters (DACs) and analog-to-digital converter (ADCs) are used to convert digital inputs of the RPU array to analog signals and convert outputs of the RPU array back to digital signals, respectively. Thus, the training process is also limited by the bounded range of DAC and ADC converters employed by the array.

The limit management becomes very important for DNN training on RPU arrays, especially when weights are set according to automatic weight scaling. With automatic weight scaling, the available resistance state resources of the RPU devices in the array are optimally mapped into a weight range (resistance value) that is useful for DNN training by scaling the bounded weight range of the RPU devices by the array size.

The conventional approach is to identify the maximum value (m) in the input vector and scale the input value of the RPU array by the maximum value (m) to obtain the best analog noise performance (noise management). To manage the limits (limit management), saturation at the output of the RPU array is eliminated by reducing the value according to which the input signals forming the RPU array are formed.

However, when the limit is exceeded, repeated calculations of the scaled-down input are required to reach the output threshold. While very effective in solving the problem of increased test errors of automatic weight scaling, this iterative scaling down approach incurs undesirable costs, i.e., variable run time.

Therefore, an effective noise and boundary management technique with improved run time would be desirable.

Disclosure of Invention

Techniques for noise and margin management using worst-case scaling factor-based Deep Neural Network (DNN) training on Resistive Processing Unit (RPU) crossbar arrays are provided that provide effective run-time improvements. In one aspect of the invention, a method for noise and boundary management is provided. The method comprises the following steps: obtaining an input vector value x of an analog crossbar array of the RPU device, wherein the weight matrix is mapped to the analog crossbar array of the RPU device; and scaling the input vector value x based on a worst case to provide a scaled input vector value x' for use as an input to the analog crossbar array of the RPU device, wherein the worst case comprises a sum of an assumed maximum weight of the weight matrix multiplied by an absolute value from the input vector value x.

For example, an absolute maximum input value x may be calculated from an input vector value x _mx The suggested scaling factor sigma may be calculated asWhere ω is the assumed maximum weight of the weight matrix, s is the total input variable, and b is the output limit of the analog crossbar array of the RPU device, the noise and limit management scaling factor α can be set to the absolute maximum input value x _mx Or the proposed scaling factor σ, whichever is larger, and the noise and bound management scaling factor α may be used to scale the input vector value x.

A more complete understanding of the present invention, as well as further features and advantages of the present invention, will be obtained by reference to the following detailed description and drawings.

Drawings

Fig. 1 is a diagram illustrating a Deep Neural Network (DNN) embodied in an analog crossbar array of a Resistive Processing Unit (RPU) device according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an exemplary method for noise and boundary management according to an embodiment of the invention;

FIG. 3 is a diagram illustrating an exemplary implementation of the present noise and bound management technique in forward loop operation in accordance with an embodiment of the present invention;

FIG. 4 is a diagram illustrating an exemplary implementation of the present noise and bound management technique in a backward-cycling operation in accordance with an embodiment of the present invention;

FIG. 5 is a diagram illustrating an alternative exemplary method for noise and boundary management according to an embodiment of the invention;

FIG. 6 is a diagram illustrating an exemplary apparatus for performing one or more methods presented herein, in accordance with an embodiment of the present invention;

FIG. 7 depicts a cloud computing environment according to an embodiment of the invention; and

FIG. 8 depicts an abstract model layer, according to an embodiment of the invention.

Detailed Description

As described above, existing noise and boundary management techniques involve scaling the input values of the RPU array using the maximum value (m) in the input vector in order to obtain optimal analog noise performance (noise management). To manage the limits, saturation at the output of the RPU array is eliminated by iteratively reducing the value from which the input signals forming the RPU array are based until the output threshold is reached (limit management). However, doing so may undesirably result in run-time delays.

Advantageously, techniques are provided herein for noise and margin management for Deep Neural Network (DNN) training on an analog RPU crossbar array when the dynamic range of the (noisy) analog system is limited and the runtime must be minimized. That is, as will be described in detail below, the present method scales the input signal of the RPU array relative to a worst case (i.e., the maximum possible output of any weight matrix given a particular input condition) that is used to estimate the scaling factor required to bring the input signal into the limited dynamic range of the analog crossbar system.

As shown in fig. 1, the DNN may be embodied in an analog crossbar array of RPU devices, where each parameter (weight Wij) of the algorithm (abstract) weight matrix 102 is mapped to a single RPU device (RPUij) on hardware, i.e., the physical crossbar array 104 of RPU devices 110. The crossbar array 104 has a set (first set) of conductive row lines 106 and a set (second set) of conductive column lines 108, the set of conductive column lines 108 intersecting and orthogonal to the set of conductive row lines 106. See fig. 1. The intersections of the sets of conductive row lines 106 and conductive column lines 108 are separated by RPU devices 110, forming a crossbar array 104 of RPU devices 110. Each RPU device 110 may include an active region between two electrodes (i.e., a double-ended device). The conductive state of the active region identifies the weight value of RPU device 110, which may be updated/adjusted by applying a programming signal to the electrodes.

Each RPU device 110 (RPUij) is uniquely identified based on its location (i.e., in the ith row and jth column of the crossbar array 104). For example, from top to bottom, left to right in crossbar array 104, RPU devices 110 located at the intersections of first conductive row lines 106a and first conductive column lines 108a are designated RPU11, RPU devices 110 located at the intersections of first conductive row lines 106a and second conductive column lines 108b are designated RPU12, and so on. The mapping of the weight parameters in the weight matrix 102 to the RPU devices 110 in the crossbar array 104 follows the same convention. For example, weight Wi1 of weight matrix 102 is mapped to RPUi1 of crossbar array 104, weight Wi2 of weight matrix 102 is mapped to RPUi2 of crossbar array 104, and so on.

The RPU devices 110 of the crossbar array 104 serve as weighted connections between neurons in the DNN. The resistance of RPU device 110 may be modified by controlling the voltage applied between the individual conductors of the set of conductive row lines 106 and the set of conductive column lines 108. How to alter the resistance of the RPU device 110 is how to store data in the crossbar array 104 based on, for example, a high resistance state or a low resistance state of the RPU device 110. The resistive state (high or low) of the RPU device 110 is read by applying (reading) a voltage to the corresponding conductive lines of the set of conductive row lines 106 and the set of conductive column lines 108 and measuring the current flowing through the target RPU device 110. All weighting related operations are performed by RPU device 110 in full parallel.

In machine learning and cognitive sciences, DNN-based models are a series of statistical learning models that inspiration from the biological neural network (particularly the brain) of an animal. These models can be used to estimate or approximate systems and cognitive functions that depend on a large number of inputs and generally unknown connection weights. DNNs are typically embodied as so-called "neuromorphic" systems of interconnected processor elements that act as simulated "neurons" exchanging "messages" between each other in the form of electronic signals. Connections in DNNs that carry electronic messages between analog neurons are provided with digital weights corresponding to the strength of a given connection. These digital weights may be adjusted and tuned based on experience so that the DNN adapts to the input and can learn. For example, a DNN for image classification is defined by a set of input neurons that can be activated by pixels of an input image. The activation of these input neurons is then passed on to other downstream neurons after weighting and transformation by a function determined by the network designer. This process is repeated until the output neurons are activated. The activated output neurons determine the classification of the image.

DNN training may be performed using a process such as random gradient descent (stochastic gradient descent, SGD), where backward propagation is used to calculate the error gradient for each parameter (weight Wij). The backward propagation is performed in three loops, namely a forward loop, a backward loop and a weight update loop, which are repeated a number of times until the convergence criterion is met.

The DNN-based model is composed of multiple processing layers that learn data representations with multiple levels of abstraction. For a single processing layer where N input neurons are connected to M output neurons, the forward loop involves computing a vector matrix multiplication (y=wx), where a vector x of length N represents the activity of the input neurons, and a matrix W of size mxn stores weight values between each pair of input and output neurons. The resulting vector y of length M is further processed by performing a nonlinear activation on each resistive memory element, which is then passed on to the next layer.

Once the information reaches the final output layer, the backward loop involves calculating the error signal and back-propagating the error signal through the DNN. The backward loop on the single layer also involves a weight matrix (z=w ^T δ), where the vector δ of length M represents the error calculated by the output neuron, and the vector z of length N is further processed using derivatives of the neuron nonlinearities, and then passed down to the previous layer.

Finally, in the weight update loop, the weight matrix W is updated by performing the outer product of the two vectors used in the forward loop and the backward loop. The outer product of these two vectors is typically represented as W+.W+η (δx) ^T ) Where η is the global learning rate. All of these operations performed on the weight matrix 102 during backward propagation may be implemented with the crossbar array 104 of RPU devices 110.

As described above, digital-to-analog converters (DACs) and analog-to-digital converters (ADCs) are used to convert the digital inputs of the RPU devices 110 of the crossbar array 104 to analog signals and the outputs of the RPU devices 110 of the crossbar array 104 back to digital signals, respectively. In the case of noise and boundary management for DNN training on an analog RPU crossbar array,

is an input vector in analog space, and

is a digital output vector, where f _DAC And f _ADC Representing the conversion of DAC and ADC, respectively, and α is the noise and margin management scaling factor. Operation by the RPU device 110 of the crossbar array 104 has a bounded output value from ADC saturation. That is, ADC is defined Within a certain range-b..b., where values below output limit-b or above output limit b are saturated for the respective limit. The relative information exceeding the limit value is lost due to clipping. Conventionally, if the output is calculated by simulationGreater than the limit, i.e.)>The calculation is iteratively repeated with a setting of α≡2α until the output is below the limit (limit management). However, iterative calculations have a negative impact on run time.

In contrast, according to the present technique, the inputs of the RPU devices 110 of the crossbar array 104 are scaled based on worst-case scenarios in order to mitigate the risk of clipping output values due to the limited dynamic range of the analog crossbar array 104 of RPU devices 110. The term "worst-case scenario" as used herein refers to the largest possible output (i.e., the largest weight) from a weight matrix, such as weight matrix 102, given a particular input vector. The physical conductance representing the weight in an RPU device is physically limited. Thus, it is assumed herein that the weights in the weight matrix are in the range of-wmax to wmax, where wmax corresponds to gmax (i.e., the maximum conductance achievable by the RPU device). As will be described in detail below, with the present technique, the absolute sum of the input signals and the assumption of a constant maximum weight are used to calculate the noise and margin management scaling factor (α) of the inputs and outputs in the digital peripheral (either before the DAC or after the ADC) to bring the inputs into the dynamic range of the RPU devices 110 of the crossbar array 104. Advantageously, the present worst case management process does not add variable run time, since there is no need to recalculate the result when the limits are clipped, since by referencing the worst case as a reference for determining the scaling factor, it is ensured that the output limits are never clipped.

Fig. 2 is a diagram illustrating an exemplary method 200 for noise and margin management, the exemplary method 200 scaling inputs to the RPU devices 110 of the crossbar array 104 relative to a worst case. As described above, the forward and backward loops performed on the weight matrix 102 each involve a vector matrix multiplication operation. In the analog crossbar array 104 of RPU devices 110, such vector matrix multiplication involves multiplying each input vector value (see below) with a corresponding weight value Wij (on a corresponding row) in the weight matrix 102 and adding the result. This process is also referred to herein as a "multiply-accumulate" operation of the analog crossbar array 104 of RPU devices 110. For each multiply-accumulate loop, the steps of method 200 are performed as a pre-calculation in terms of numbers to determine a scaling factor α for operation on the analog crossbar array 104 of RPU device 110. Notably, according to an exemplary embodiment, one or more steps of method 200, including the calculation of noise and margin management scaling factor a, scaling/rescaling of input/output values from crossbar array 104 of RPU device 110 by a factor a (see below), etc., are performed external to the RPU array hardware, e.g., by an apparatus such as apparatus 600 described below in connection with the description of fig. 6. Additionally, one or more elements of the present technology may optionally be provided as a service in a cloud environment. For example, by way of example only, training data for entering vector values (see below) may reside remotely on a cloud server. Further, any of the steps of method 200 may be performed on a dedicated cloud server to utilize the high performance CPU and GPU, after which the results are sent back to the local device.

In step 202, an input vector is obtained. The input vector comprises a digital value x. The digital value x in the input vector is also referred to herein as the "input vector value x". According to an exemplary embodiment, the input vector value x comprises data from a training data set. For example only, the training data set may be obtained from a database or other repository of DNN training data.

In step 204, the absolute maximum x of vector values is input _mx Is calculated as follows:

x _mx ＝max _i |x _i | (3)

thus, x _mx May also be referred to herein as absolute maximumA value is input. In step 206, the absolute maximum input value x _mx Assigned to the scaling factor a.

At this stage of the process, the weight values to be employed for vector matrix multiplication operations performed on the analog crossbar array 104 of the RPU device 110 (see above) are a priori unknown. However, as described above, the worst case assumption is that all analog weights are maximally positive for all positive input vector values and maximally negative for all negative input vector values. In this case, in step 208, the sum of all absolute values of the input vector value x is calculated. In step 210, the sum is assigned to the total input variable s, i.e., s= Σ|x _i |。

In the previous example, the input vector value x (negative and positive) is given as input to the analog crossbar array 104 of RPU device 110 in one pass. Alternatively, in another exemplary embodiment, the negative and positive input values of the input vector value x are given as inputs to the analog crossbar array 104 of the RPU device 110 in two separate passes, with the respective other input values (negative or positive) of the input vector set to zero, and the outputs of the two passes are sign-corrected and added accordingly to obtain the final result. For example, if a negative input value is given as input to the analog crossbar array 104 of the RPU device 110 in a first pass (while a positive input value is set to zero), then a positive input value is given as input to the analog crossbar array 104 of the RPU device 110 in a second pass (while a negative input value is set to zero), and vice versa. In this case, a corresponding worst case can be applied, where all positive (or negative) input vector values are assumed to reach the maximum positive (or negative) weight, and all other weights do not contribute to the output (and thus can be assumed to be zero), since the corresponding input values will be set to zero for the corresponding passes. In this case, the larger of the positive input vector value or the negative input vector value is assigned to the total input variable s.

That is, according to this alternative embodiment, in step 212, only the sum (s _p ) Is calculated as:

s _p ＝∑ _i I(x _i >0)·|x _i | (4)

wherein I (true) =1 and I (false) =0 indicates whether the condition is true. In step 214, only the sum of all absolute values of the negative input vector values (s _n ) Is calculated as:

s _n ＝∑ _i I(x _i <0)·|x _i | (5)

wherein I (true) =1 and I (false) =0 indicates whether the condition is true.

In step 216, two quantities s _p Sum s _n The larger of which is assigned to the total input variable s. That is, in this exemplary embodiment, the total input variable s is set as follows:

s＝max(s _n ,s _p ) (6)

let ω be the maximum weight of the weight matrix 102, i.e., ω is the assumed maximum weight. However, as will be described in detail below, ω can be reduced (e.g., to 50% of the assumed maximum weight) because worst case occurrence is highly unlikely. That is, for some impossible input vectors, the output may indeed be limited by the output limit, but for most cases this is not possible so that the DNN training or inference results will not change significantly.

Given ω is assuming a maximum weight, the expected worst case total output is ω times the total input s, i.e., ωs. As described above, this expected worst case total output is expected to be less than the output limit b. Thus, in step 218, the proposed scaling factor σ of the input vector value (in the worst case) is calculated as the product of ω times the total input s divided by the output limit b, i.e

As described above, ω is the assumed maximum weight. According to an exemplary embodiment, ω is a user-defined value based on a mapping corresponding to the maximum conductance value of RPU device 110 according to the "mathematical" weight value of DNN. For example, for example only, assume that a suitable value for the maximum weight ω may be 0.6. Thus, based on equation 7, if the actual total output from the analog crossbar array 104 of RPU devices 110 is indeed as large as the expected worst case total output (ωs), σ=1, and the total input s does not need to be scaled. In this way, even in the worst case, the scaling ensures that the output limit b is never reached.

However, since the worst case occurrence is highly unlikely (i.e., the actual total output from the analog crossbar array 104 of the RPU device 110 will be unlikely to be as large as the expected worst case total output), it may be desirable to adjust the proposed scaling factor σ by reducing the assumed maximum weight ω to, for example, 50% of the assumed maximum weight, and then re-calculating the proposed scaling factor in step 218. See fig. 2 (i.e., adjustment ω). In this case, the output in most cases is below the output limit b, although the expected worst case total output limit is actually reached. Adjusting σ in this manner typically results in a desired larger signal-to-noise ratio (SNR).

As described above, a digital-to-analog converter (DAC) will be used to convert the scaled digital input vector values (according to method 200) into analog signals for performing vector matrix multiplication operations on the analog crossbar array 104 of RPU devices 110. Specifically, the DAC converts the scaled digital input vector value into an analog pulse width. However, DAC resolution may be limited. That is, it is worth noting that the proposed scaling factor σ calculated according to equation 7 above may in fact be so large that any value in the input vector value divided by the proposed scaling factor σ will be less than the minimum interval (bin) of digital to analog conversion. In this case, the inputs to the analog crossbar array 104 of the RPU device 110 will all be zero after DAC conversion.

To avoid this, it may be desirable to use the value calculated from equation 7 above, or to use the value calculated as the absolute maximum input value x, in step 220 _mx Multiplying the variable (p) by the quantization interval width (r) of the DAC _DAC ) Is a substitute value for the proposed scaling factor sigma (i.eThe smaller one), to set an upper limit (cap) on the proposed scaling factor σ:

in the case of the input range (-1,., 1), the total range is 2. The DAC divides the total range into n steps (where n is the number of quantization steps (e.g., n=256 in the case of an 8-bit DAC)) to reach a quantization interval width (or simply "interval width"). Thus, in this example, quantization interval width Is->According to an exemplary embodiment, the variable ρ=0.25. The variable ρ is substantially 1 at the least significant bit resolution. Thus, for a value of ρ=0.25, only 4 different values (instead of 256) are allowed in the input range due to σ scaling.

In step 222, the noise and margin management scaling factor (α) is set to x _mx (according to equation 3 above) or the value of the proposed scaling factor sigma (according to equation 8 above), whichever is larger, i.e

α＝max(x _mx ,σ) (9)

This avoids having the maximum value of the scaled input vector value (see step 224 described below) be greater than 1 (which is assumed to be the maximum input vector value range of the DAC arbitrarily set to (-1,..1)), since the unwanted input value is clipped.

As described above, the above procedure is employed to pre-calculate the noise and margin management scaling factor α for each multiply-accumulate loop performed on the analog crossbar array 104 of RPU device 110. Thus, in step 224, each digital input vector value x is scaled by a noise and margin management scaling factor (α) (calculated in step 222), i.e.

x←x/α (10)

To provide scaled digital input vector values x ', x' that are converted to analog signals via a digital-to-analog converter (DAC). In step 226, analog calculations are then performed on the analog crossbar array 104 of the RPU device 110. As described above, the analog computation involves performing a vector matrix multiplication operation on the analog crossbar array 104 of the RPU device 110 by multiplying each scaled input vector value x' with a corresponding weight value in the weight matrix 102.

Also, in step 228, each analog output vector value obtained from the analog crossbar array 104 of RPU devices 110 is converted to a digital signal via an analog-to-digital converter (ADC) to provide a digital output vector value y'. In step 230, each digital output vector value y' is rescaled by a noise and margin management scaling factor (α) (calculated in step 222), i.e.

y←y′α (11)

To provide a rescaled digital output vector value y.

An exemplary embodiment of the present technology will now be described by referring to fig. 3 and 4. That is, fig. 3 is a schematic diagram illustrating forward loop operation being performed on the analog crossbar array 304 of the RPU device 310. As shown in fig. 3, a digital input vector value x (see "digital input x") is provided as an input to an analog crossbar array 304 of RPU devices 310. However, first, as described in connection with the description of the method 200 of FIG. 2 above, noise and margin management scaling factors are calculated (see "noise/margin management calculation α"). Each digital input vector value x is then scaled by a noise and margin management scaling factor (α), i.e., x≡x/α, to provide a scaled digital input vector value x '(see "scaled digital RPU input x'").

The scaled digital input vector value x' is converted into an analog signal via a digital-to-analog converter (see "DA converter"). The analog signal is provided as an analog pulse width 320 to the analog crossbar array 304 of the RPU device 310, where analog calculations are performed on the analog crossbar array 304 of the RPU device 310. As described above, such analog computation involves performing a vector matrix multiplication operation on the analog crossbar array 304 of the RPU device 310 by multiplying each scaled input vector value x' with a corresponding weight value in a corresponding weight matrix (not shown). The mapping of the weight matrix to the analog crossbar array of the RPU device is described in connection with the description of fig. 1 above.

As shown in fig. 3, the analog output vector value obtained from the operation performed on the analog crossbar array 304 of the RPU device 310 is provided to an integrated circuit 322 that includes an operational amplifier 324, the operational amplifier 324 having an inverting input connected (connected across) to the operational amplifier 324 and an output (V) of the operational amplifier 324 _out ) Capacitor (C) _int ). The non-inverting input of the operational amplifier 324 is grounded. The output of op-amp 324 (V _out ) Also connected to the input of an analog-to-digital converter (see "AD converter").

An analog-to-digital converter (AD converter) converts each analog output vector value obtained from the analog crossbar array 304 of the RPU device 310 into a digital signal to provide a digital output vector value y '(see "digital RPU output y'"). Each digital output vector value y' is then rescaled by a noise and bound management scaling factor (α) (see "noise/bound management using α"), i.e., y≡yα, to provide a rescaled digital output vector value y (see "rescaled digital output y"). As described above, processes such as calculation of noise and margin management scaling factor α, scaling/rescaling input/output values from the crossbar array of RPU devices by a factor of α, and the like may be performed external to the RPU array hardware, for example, by a device such as device 600 described below in connection with the description of fig. 6.

Fig. 4 is a schematic diagram illustrating a backward-cycling operation performed on the analog crossbar array 404 of the RPU device 410. The process is generally the same as the forward loop operation described above in connection with the description of fig. 3, except that the transposed analog RPU array 404 is used for the backward loop pass. "transpose" refers to exchanging inputs and outputs, i.e., the previous output now becomes the input and the previous input becomes the output. This is to (essentially) calculate x=w 'y, where W' is the transpose of matrix W. As shown in fig. 4, a digital input vector value x (see "digital input x") is provided as an input to the analog crossbar array 404 of RPU device 410. However, first, as described in connection with the description of the method 200 of FIG. 2 above, noise and margin management scaling factors are calculated (see "noise/margin management calculation α"). That is, as described above, for each forward and backward cycle, the steps of method 200 are performed as a numerical pre-calculation to determine the scaling factor α for operation on the analog crossbar array of the RPU device. Each digital input vector value x is then scaled by a noise and margin management scaling factor (α), i.e., x≡x/α, to provide a scaled digital input vector value x '(see "scaled digital RPU input x'").

The scaled digital input vector value x' is then converted into an analog signal via a digital-to-analog converter (see "DA converter"). The analog signal is provided as an analog pulse width 420 to the analog crossbar array 404 of the RPU device 410, wherein analog calculations are performed on the analog crossbar array 404 of the RPU device 410. As described above, such analog computation involves performing a vector matrix multiplication operation on the analog crossbar array 404 of the RPU device 410 by multiplying each scaled input vector value x' with a corresponding weight value in a corresponding weight matrix (not shown). The mapping of the weight matrix to the analog crossbar array of the RPU device is described in connection with the description of fig. 1 above.

As shown in fig. 4, the analog output vector value obtained from the operation performed on the analog crossbar array 404 of the RPU device 410 is provided to an integrated circuit 422 including an operational amplifier 424, the operational amplifier 424 having an inverting input connected (connected across) to the operational amplifier 424 and an output (V) of the operational amplifier 424 _out ) Capacitor (C) _intBACKWARD ). The non-inverting input of the op-amp 424 is grounded. The output of op-amp 424 (V _out ) Also connected to the input of an analog-to-digital converter (see "AD converter").

An analog-to-digital converter (AD converter) converts each analog output vector value obtained from the analog crossbar array 404 of RPU devices 410 into a digital signal to provide a digital output vector value y '(see "digital RPU output y'"). Each digital output vector value y' is then rescaled by a noise and bound management scaling factor (α) (see "noise/bound management using α"), i.e., y≡yα, to provide a rescaled digital output vector value y (see "rescaled digital output y"). As described above, processes such as calculation of noise and margin management scaling factor α, scaling/rescaling input/output values from the crossbar array of RPU devices by a factor of α, and the like may be performed external to the RPU array hardware, for example, by a device such as device 600 described below in connection with the description of fig. 6.

As described above, the present technique minimizes run time by scaling the input vector value x of the analog crossbar array of RPU devices based on the worst case. With the above embodiments, taking the worst case as a reference for determining the scaling factor ensures that the output limit is never clipped.

According to an alternative embodiment, the above-described method 200 is performed only when the limit is exceeded (i.e., calculating a worst-case noise and limit management scaling factor α, scaling/rescaling input/output values from the crossbar array of RPU devices by a factor of α, etc.). See, for example, exemplary method 500 of fig. 5.

In step 502, an input vector of digital values x, i.e. "input vector values x", is obtained. According to an exemplary embodiment, the input vector value x comprises data from a training data set. For example only, the training data set may be obtained from a database or other repository of DNN training data.

In step 504, the absolute maximum x of the input vector value is calculated according to equation 3 above _mx (also referred to herein as an absolute maximum input value). In step 506, the absolute maximum input value x _mx Assigning noise and margin management scaling factors α, i.e. α=x _mx 。

In step 508, each digital input vector value x manages the scaling factor α=x by noise and bounds _mx Scaling, i.e.

x′ _initial ←x/α (12)

To provide a scaled digital input vector value x 'that is converted to an analog signal via a digital-to-analog converter (DAC)' _initial . In step 510, analog calculations are then performed on the analog crossbar array 104 of the RPU device 110. As described above, the analog computation involves the computation of the input vector by scaling each of the scaled input vector values x' _initial Multiplication with corresponding weight values in the weight matrix 102 performs a vector matrix multiplication operation on the analog crossbar array 104 of the RPU device 110. Also, in step 512, each analog output vector value obtained from the analog crossbar array 104 of the RPU device 110 is converted to a digital signal via an analog-to-digital converter (ADC) to provide a digital output vector value y' _initial 。

In step 514, a digital output vector value y 'is determined' _initial Whether any of these has been clipped (limit management). For example, the digital output vector value y 'may be determined by sensing saturation at the output of the op amp' _initial Whether any of which has been clipped. There are some circuit methods for limiting clipping. However, one straightforward way is to simply use the maximum and minimum output values of the ADC, since the output will saturate to these values when the input of the ADC exceeds the ADC range. Thus, for example, if any one of the digital outputs is 255 (which is the highest output value of an 8-bit DAC) or 0 (which is the lowest output value of an 8-bit DAC), the bound is clipped and the calculation is repeated.

If the determination in step 514 is "no", i.e., none of the digital output vector values y' _initial Having been clipped, then in step 516, each digital output vector value y' _initial Managing scaling factor alpha = x by noise and margin _mx Rescaling, i.e.

y←y′ _initial α (13)

To provide a rescaled digital output vector value y and the process ends. On the other hand, if the determination in step 514 is "yes", i.e., the digital output vector value y' _initial Has been at least one of Clipping, then in step 518, the worst case noise and bound management scaling factor α is calculated as α=max (x _mx σ) (see equation 8 above) as described in connection with the description of method 200 of fig. 2 above.

In step 520, each digital input vector value x then manages the scaling factor α=max (x _mx Sigma), i.e

x′←x/α (14)

To provide a scaled digital input vector value x' that is converted to an analog signal via a DAC. In step 522, analog calculations are then performed on the analog crossbar array 104 of the RPU device 110. As described above, the analog computation involves performing a vector matrix multiplication operation on the analog crossbar array 104 of the RPU device 110 by multiplying each scaled input vector value x' with a corresponding weight value in the weight matrix 102.

Also, in step 524, each analog output vector value obtained from the analog crossbar array 104 of RPU devices 110 is converted to a digital signal via an ADC to provide a digital output vector value y'. In step 526, each digital output vector value y' manages the scaling factor α=max (x _mx Sigma), i.e.,

y←y′α (15)

To provide a rescaled digital output vector value y. In this second iteration, the output limit clipping is not tested, as in the usual worst case, the limit is typically not clipped. However, even if it is assumed that the maximum weight ω is changed (see above) and some clipping occurs, the clipping should be ignored and no limit is tested in the second iteration. Thus, in the worst case, the method 500 only requires two iterations, i.e., one initial determination that the output limit has been clipped, and then again manages the scaling factor α with the worst case noise and limit. In this way, the impact on the run time is minimal, i.e. the run time is increased by at most a factor of two.

The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to perform aspects of the present invention.

A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium would include the following: portable computer magnetic disk, hard disk, random access memory (random access memory, RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM (erasable programmable read-only memory) or flash memory), static random access memory (static random access memory, SRAM), portable compact disc read-only memory (compact disc read-only memory, CD-ROM), digital versatile disc (digital versatile disk, DVD), memory stick, floppy disk, mechanical coding device (such as a punch card with instructions recorded thereon, or a bump structure in a groove), and any suitable combination of the foregoing. As used herein, a computer-readable storage medium should not be construed as being a transitory signal itself, such as a radio wave or other freely propagating electromagnetic wave, an electromagnetic wave propagating through a waveguide or other transmission medium (e.g., an optical pulse through a fiber optic cable), or an electrical signal transmitted through a wire.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a corresponding computing/processing device or to an external computer or external storage device via a network (e.g., the internet, a local area network, a wide area network, and/or a wireless network). The network may include copper transmission cables, optical transmission fibers, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.

Computer readable program instructions for performing the operations of the present invention may be assembly instructions, instruction-set-architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object-oriented programming language such as Smalltalk, c++, or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (local area network, LAN) or a wide area network (wide area network, WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, electronic circuitry, including, for example, programmable logic circuitry, field-programmable gate array (FPGA) or programmable logic array (programmable logic array, PLA), may execute computer-readable program instructions by personalizing the electronic circuitry with state information for the computer-readable program instructions in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block of the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

As described above, according to an exemplary embodiment, one or more steps of method 200, including calculation of noise and margin management scaling factor α, scaling/rescaling input/output values from the crossbar array of RPU devices, etc., may be performed external to the RPU array hardware, e.g., by an apparatus such as apparatus 600 shown in fig. 6. Fig. 6 is a block diagram of an apparatus 600 for implementing one or more methods presented herein. For example only, the apparatus 600 may be configured to implement one or more steps of the method 200 of fig. 2.

The apparatus 600 includes a computer system 610 and a removable medium 650. Computer system 610 includes a processor device 620, a network interface 625, a memory 630, a media interface 635, and an optional display 640. Network interface 625 allows computer system 610 to connect to a network, while media interface 635 allows computer system 610 to interact with media, such as a hard disk drive or removable media 650.

The processor device 620 may be configured to implement the methods, steps, and functions disclosed herein. The memory 630 may be distributed or local and the processor device 620 may be distributed or singular. The memory 630 may be implemented as an electronic, magnetic or optical memory, or any combination of these or other types of storage devices. Furthermore, the term "memory" should be construed broadly enough to encompass any information capable of being read from or written to an address in the addressable space accessed by processor device 620. With this definition, information on a network accessible through network interface 625 is still within memory 630, as processor device 620 may retrieve information from the network. It should be noted that each distributed processor making up processor device 620 typically contains its own addressable memory space. It should also be noted that some or all of computer system 610 may be incorporated into an application specific or general purpose integrated circuit.

Optional display 640 is any type of display suitable for interacting with a human user of device 600. Typically, display 640 is a computer monitor or other similar display.

Referring to fig. 7 and 8, it should be appreciated that although the present disclosure includes a detailed description of cloud computing, embodiments of the teachings recited herein are not limited to a cloud computing environment. Rather, embodiments of the invention can be implemented in connection with any other type of computing environment, now known or later developed.

Cloud computing is a service delivery model for supporting convenient on-demand network access to a shared pool of configurable computing resources (e.g., networks, network bandwidth, servers, processes, memory, storage, applications, virtual machines, and services) that can be quickly deployed and released with minimal administrative effort or interaction with service providers. This cloud model may include at least five characteristics, at least three service models, and at least four deployment models.

The characteristics are as follows:

on-demand self-service: cloud consumers can unilaterally allocate computing power (such as server time and network storage) automatically as needed without manual interaction with the service provider.

Wide network access: the capabilities are available over a network and are mediated by standard mechanisms that facilitate the use of heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs).

And (3) resource pooling: the computing resources of the provider are pooled to serve multiple consumers using a multi-tenant model, wherein different physical and virtual resources are dynamically allocated and reallocated as needed. There is a sense of location independence because consumers often have no control or knowledge of the exact location of the provided resources, but can specify locations at a higher level of abstraction (e.g., country, state, or data center).

Quick elasticity: the functions may be quickly and flexibly deployed, in some cases automatically, to quickly zoom in and quickly release to quickly zoom out. The functions available for deployment generally appear to the consumer to be unlimited and any number may be purchased at any time.

Measurement service: cloud systems automatically control and optimize resource usage by leveraging metering capabilities at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage may be monitored, controlled, and reported to provide transparency to both the provider and consumer of the utilized service.

The service model is as follows:

software as a service (Software as a Service, saaS): the capability provided to the consumer is to use the provider's application running on the cloud infrastructure. Applications may be accessed from various client devices through a thin client interface such as a web browser (e.g., web-based email). Consumers do not manage or control underlying cloud infrastructure, including networks, servers, operating systems, storage, or even individual application capabilities, except perhaps for limited user-specific application configuration settings.

Platform as a service (Platform as a Service, paaS): the capability provided to the consumer is to deploy consumer created or acquired applications onto the cloud infrastructure using provider-supported programming languages and tools. The consumer does not manage or control the underlying cloud infrastructure, including networks, servers, operating systems, or storage, but can control the deployed applications and possibly the application hosting environment configuration.

Infrastructure as a service (Infrastructure as a Service, iaaS): the capability provided to the consumer is to allocate processing, storage, networking, and other basic computing resources in which the consumer can deploy and run any software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure, but can control the operating system, storage, deployed applications, and possibly limited control of selected networking components (e.g., host firewalls).

The deployment model is as follows:

private cloud: the cloud infrastructure is only an organization operation. It may be managed by an organization or a third party and may exist internally or externally.

Community cloud: the cloud infrastructure is shared by several organizations and supports specific communities of common interest (e.g., tasks, security requirements, policies, and compliance considerations). It may be managed by an organization or a third party and may exist internally or externally.

Public cloud: the cloud infrastructure is available to the general public or to a large industry community and is owned by an organization selling cloud services.

Mixing cloud: the cloud infrastructure consists of two or more clouds (private, community, or public) that keep unique entities but are bound together by standardized or proprietary technology to enable portability of data and applications (e.g., cloud explosion for load balancing between clouds).

Cloud computing environments are service-oriented, focusing on stateless, low-coupling, modular, and semantic interoperability. At the heart of cloud computing is an infrastructure that includes a network of interconnected nodes.

Referring now to FIG. 7, an illustrative cloud computing environment 50 is depicted. As shown, cloud computing environment 50 includes one or more cloud computing nodes 10 with which local computing devices used by cloud consumers, such as personal digital assistants (personal digital assistant, PDAs) or cellular telephones 54A, desktop computers 54B, laptop computers 54C, and/or automobile computer systems 54N, may communicate. Nodes 10 may communicate with each other. They may be physically or virtually grouped (not shown) in one or more networks, such as private, community, public or hybrid clouds as described above, or a combination thereof. This allows the cloud computing environment 50 to provide infrastructure as a service, platform as a service, and/or software as a service for which cloud consumers do not need to maintain resources on local computing devices. It should be appreciated that the types of computing devices 54A-54N shown in fig. 7 are merely illustrative, and that computing node 10 and cloud computing environment 50 may communicate with any type of computerized device over any type of network and/or network-addressable connection (e.g., using a web browser).

Referring now to FIG. 8, a set of functional abstraction layers provided by cloud computing environment 50 (FIG. 7) is shown. It should be understood in advance that the components, layers, and functions shown in fig. 8 are merely illustrative, and embodiments of the present invention are not limited thereto. As shown, the following layers and corresponding functions are provided:

the hardware and software layer 60 includes hardware components and software components. Examples of hardware components include: a mainframe 61; a server 62 based on RISC (reduced instruction set computer) architecture; a server 63; blade server 64; a storage device 65; and a network and networking component 66. In some embodiments, the software components include web application server software 67 and database software 68.

The virtualization layer 70 provides an abstraction layer from which the following examples of virtual entities may be provided: a virtual server 71; virtual storage 72; a virtual network 73 including a virtual private network; virtual applications and operating systems 74; and a virtual client 75.

In one example, the management layer 80 may provide the following functionality. Resource allocation 81 provides dynamic procurement of computing resources and other resources for performing tasks within the cloud computing environment. Metering and pricing 82 provides cost tracking as resources are utilized within the cloud computing environment and billing or invoicing for consumption of these resources. In one example, the resources may include application software permissions. Security provides authentication for cloud consumers and tasks, as well as protection for data and other resources. User portal 83 provides consumers and system administrators with access to the cloud computing environment. Service level management 84 provides cloud computing resource allocation and management such that the required service level is met. Service level agreement (Service Level Agreement, SLA) planning and fulfillment 85 provides for pre-arrangement and procurement of cloud computing resources according to which future demands for the cloud computing resources are expected.

Workload layer 90 provides an example of functionality that may utilize a cloud computing environment. Examples of workloads and functions that may be provided from this workload layer include: drawing and navigating 91; software development and lifecycle management 92; virtual classroom education delivery 93; a data analysis process 94; transaction processing 95; and scaling factor calculation and input/output scaling/rescaling 96.

Although illustrative embodiments of the present invention have been described herein, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be affected therein by one skilled in the art without departing from the scope of the invention.

Claims

1. A method for noise and boundary management, the method comprising:

obtaining an input vector value x of an analog crossbar array of the resistance processing unit RPU device, wherein the weight matrix is mapped to the analog crossbar array of the RPU device; and

the input vector value x is scaled based on a worst case scenario comprising a sum of an assumed maximum weight of the weight matrix multiplied by an absolute value from the input vector value x to provide a scaled input vector value x' for use as an input to an analog crossbar array of RPU devices.

2. The method of claim 1, further comprising:

calculating an absolute maximum input value x from said input vector value x _mx ；

Calculating the suggested scaling factor sigma asWhere ω is the assumed maximum weight of the weight matrix, s is the total input variable, and b is the output limit of the analog crossbar array of the RPU device;

setting a noise and limit management scaling factor alpha to said absolute maximum input value x _mx Or the proposed scaling factor sigma, whichever is greater; and

the noise and bound management scaling factor alpha is used to scale the input vector value x.

3. The method of claim 2, wherein the absolute maximum input value x _mx Is calculated as follows: x is x _mx ＝max _i |x _i |。

4. The method of claim 2, further comprising:

calculating the sum of all absolute values of the input vector value x; and

the sum of all absolute values of the input vector value x is assigned to the total input variable s.

5. The method of claim 2, further comprising:

calculating the sum s of all absolute values of only the positive input vector value _p ；

Calculating the sum s of all absolute values of the negative input vector value only _n The method comprises the steps of carrying out a first treatment on the surface of the And

will s _p Sum s _n Is assigned to the total input variable s.

6. The method of claim 2, further comprising:

setting the upper limit of the recommended scaling factor sigma to be calculated asThe value or calculated as +.>Is based on the lesser, where ρ is a variable and r _DAC Is the interval width of digital-to-analog quantization.

7. The method of claim 2, further comprising:

reducing an assumed maximum weight ω of the weight matrix; and

the proposed scaling factor sigma is recalculated.

8. The method of claim 2, further comprising:

converting the scaled input vector value x' into an analog signal; and

vector matrix multiplication is performed on an analog crossbar array of the RPU device.

9. The method of claim 8, wherein performing the vector matrix multiplication operation comprises:

each scaled input vector value x' is multiplied with a corresponding weight value in the weight matrix.

10. The method of claim 8, further comprising:

converting analog output vector values obtained from an analog crossbar array of RPU devices to digital signals to provide digital output vector values y'; and

the digital output vector value y' is rescaled using the noise and limit management scaling factor α to provide a rescaled digital output vector value y.

11. The method of claim 1, further comprising:

-inputting said absolute maximum input value x _mx Assigned to a scaling factor α;

scaling the input vector value x using the scaling factor α to provide a scaled input vector value x' _initial For use as an input to an analog crossbar array of RPU devices;

-scaling the scaled input vector value x' _initial Converting into an analog signal;

performing a vector matrix multiplication operation on an analog crossbar array of the RPU device;

converting analog output vector values obtained from an analog crossbar array of RPU devices to digital signals to provide digital output vector values y' _initial ；

Determining whether the digital output vector value y' _initial Has been limited; and

when the digital output vector value y' _initial When at least one of the input vector values x has been clipped, scaling the input vector value x based on the worst case.

12. An apparatus for noise and boundary management, comprising a processor coupled to a memory, the processor configured to:

obtaining an input vector value x of an analog crossbar array of the RPU device, wherein the weight matrix is mapped to the analog crossbar array of the RPU device; and

13. The apparatus of claim 12, wherein the processor is further configured to:

14. The apparatus of claim 13, wherein the processor is further configured to:

calculating the sum of all absolute values of the input vector value x; and

the sum of all absolute values of the input vector values x is assigned to the total input variable s.

15. The apparatus of claim 13, wherein the processor is further configured to:

computing all positive input vector values onlySum of absolute values s _p ；

will s _p Sum s _n The larger of which is assigned to the total input variable s.

16. The apparatus of claim 13, wherein the processor is further configured to:

17. A non-transitory computer program product for noise and boundary management, the computer program product comprising a computer-readable storage medium having program instructions embodied therein, the program instructions executable by a computer to cause the computer to:

18. The non-transitory computer program product of claim 17, wherein the program instructions further cause the computer to:

based on the input vector value xCalculating absolute maximum input value x _mx ；

19. The non-transitory computer program product of claim 18, wherein the program instructions further cause the computer to:

calculating the sum of all absolute values of the input vector value x; and

20. The non-transitory computer program product of claim 18, wherein the program instructions further cause the computer to: