CN117296062A

CN117296062A - Method and system for multiplier sharing in neural networks

Info

Publication number: CN117296062A
Application number: CN202180093831.6A
Authority: CN
Inventors: 饶朝林; 郑越洋; 吴旻烨; 娄鑫; 周平强; 虞晶怡
Original assignee: ShanghaiTech University
Current assignee: ShanghaiTech University
Priority date: 2021-04-21
Filing date: 2021-04-21
Publication date: 2023-12-26
Also published as: WO2022222068A1

Abstract

The present invention provides a method for performing a multiplication operation between weights and input numbers, the method being applicable to a computing device comprising a processor unit and a plurality of selection units. The method comprises the following steps: decomposing the weight into a plurality of elements, each element having a preset number of bits and corresponding to one selection unit; determining, by each selection unit, a partial sum associated with each element based on the plurality of multipliers, respectively; and determining a multiplication result between the input number and the weight based on the partial sum. In this method, the multiplication operation is implemented using shift operations and addition operations, thereby significantly reducing the computational complexity thereof. In addition, when the multiplication result between the input number and other weights is determined, a plurality of multipliers for the input number can be reused, thereby further improving the calculation efficiency.

Description

Method and system for multiplier sharing in neural networks

Technical Field

The present invention relates generally to the field of computer technology, and more particularly, to a method and system for multiplier sharing in a neural network.

Background

In the past few decades, neural Networks (NNs) have evolved into efficient frameworks for various applications such as image processing, object recognition, and natural language processing. Applications using NNs typically involve many multiplications and additions between the input of the NN and the weights, known as multiply-accumulate (MAC) operations, and hardware accelerators are often used as substitutes for CPUs and GPUs to do so due to their fast processing speeds and efficient energy consumption.

However, performing multiplication operations places high computational demands on the hardware accelerator. Accordingly, methods and apparatus to optimize multiplication operations are desired.

The information disclosed in this background section is only for aiding in the understanding of the background of the invention and therefore may contain information that is already known to one of ordinary skill in the art.

Disclosure of Invention

To address the limitations of the conventional computing techniques described above, the present invention proposes a method and system for multiplier sharing in neural networks.

One aspect of the invention relates to a method for performing a multiplication operation between weights and input numbers. The method may be applied to a computing device comprising a preprocessing unit and a plurality of selection units.

The method may include: decomposing the weights into a plurality of elements by a preprocessing unit, each element having a preset number of bits and corresponding to one selection unit; determining, by each selection unit, a partial sum associated with each element, wherein the partial sum is equal to a product of the corresponding element and the input number; and determining a multiplication result between the input number and the weight based on the partial sum.

In some embodiments, determining, by each selection unit, a partial sum associated with each element, respectively, includes: determining, for each element, a multiplier selected from the plurality of multipliers and a shifter for shifting the selected multiplier by a decoder of the respective selection unit; and shifting the selected multipliers by a shifter to determine the partial sums.

In some embodiments, the plurality of multipliers may be predetermined based on the number of inputs.

In some embodiments, multiple multipliers may be read from memory.

In some embodiments, decomposing the weights into the plurality of elements may include dividing bits of the weights into a plurality of bit groups that are consecutively adjacent to each other but do not overlap each other, each bit group having a preset number of bits.

In some embodiments, the weights may be binary numbers.

In some embodiments, the weight may be a 16-bit binary number and the preset number may be 4.

In some embodiments, the selection units may each have the same number of multipliers, and the number of multipliers in each selection unit may be determined by a preset number.

In some embodiments, determining the multiplication result between the input number and the weight based on the partial sum may include: shifting each partial sum according to the position of the corresponding element in the weight; and determining a multiplication result by summing the shifted partial sums.

In some embodiments, the weight may be a weight associated with a node of the neural network, and the input number may be an input of the node.

Another aspect of the invention relates to a method for computing a multiply-accumulate (MAC) result for use in a neural network. The neural network may include an input layer and an output layer. The input layer may have a plurality of input neurons, each of the plurality of input neurons having an input number, and the output layer may have a plurality of output neurons. Each input neuron and each output neuron may form an input-output pair with an associated weight.

The method may include, for each input neuron, determining a multiplication result between an input number of the input neuron and a weight associated with an input-output pair including the input neuron and a first output neuron of the plurality of output neurons. Each multiplication result may be determined by any one of the methods for performing the multiplication operations described above.

The method may further include determining a MAC result for the first output neuron by adding the multiplication results.

Another aspect of the invention relates to an apparatus. The apparatus may include a processor and a memory configured with computer instructions executable by the processor. The computer instructions, when executed by a processor, may cause the processor to perform any of the methods for performing the multiplication operations described above.

In the disclosed method for performing multiplication operations, the multiplication operations may be implemented using shift operations and addition operations. Therefore, the computational complexity for determining the multiplication result can be significantly reduced.

In addition, the multiplier for each input neuron may be reused (i.e., shared) when determining the multiplication result with other output neurons. Therefore, the calculation efficiency can be further improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles disclosed. It is obvious that these drawings only provide some embodiments of the present invention, and that it is possible for a person skilled in the art to obtain drawings of other embodiments from these drawings without inventive effort.

Fig. 1A and 1B show schematic diagrams of calculation processes in a convolutional layer and a fully-connected layer of a neural network, respectively.

FIG. 2 illustrates a flow chart of a method for performing a multiplication operation in accordance with various embodiments of the invention.

FIG. 3 illustrates a schematic diagram of a selection unit for computing a partial sum in a method for performing multiplication operations, in accordance with various embodiments of the invention.

Fig. 4 illustrates a flow chart of a multiplier sharing method suitable for use with a neural network, in accordance with various embodiments of the present invention.

Fig. 5A, 5B, 5C, and 5D illustrate diagrams of examples of applying multiplier sharing methods on a neural network according to various embodiments of the invention.

Detailed Description

Exemplary embodiments will now be described more fully with reference to the accompanying drawings. These exemplary embodiments may, however, be embodied in many forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the exemplary embodiments to others skilled in the art.

Furthermore, the described features, structures, and characteristics may be combined in any suitable manner in one or more embodiments. In the following detailed description, numerous specific details are set forth in order to provide a more thorough understanding of the invention. One skilled in the relevant art will recognize, however, that the various embodiments may be practiced without one or more of the specific details, or with other methods, components, materials, and so forth. In some instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the embodiments.

Neural Networks (NNs) have become an effective framework for various applications such as image processing, object detection, and natural language processing. NN may be composed of several convolution layers, pooling layers, full connection layers, and other computation layers. Applications using NNs typically include many multiplications and additions between inputs and weights associated with the nodes (i.e., neurons) of the NN. For example, the computation performed by the convolution layer is a high-dimensional convolution, which can be expressed as:

0≤n≤C _i ，0≤m≤C _o ，0≤y≤H _o ，0≤x≤W _o (1)

where O, I, W and B are matrices of output profile, input profile, weights (filters) and offsets, respectively. C (C) _i And C _o Is the number of input and output channels. H _o And W is _o The size of the output feature map is indicated.Is an activation function. s refers to the step size.

The calculations performed by the fully connected layer can be expressed as:

0≤v≤N _o (2)

wherein N is _i And N _o The number of input neurons and output neurons, respectively. Equations (1) and (2) show that the main computational cost of NN includes the costs associated with multiplication and addition operations.

Fig. 1A and 1B show schematic diagrams of calculation processes in a convolutional layer and a fully-connected layer of a neural network, respectively. For convolutional layers, as shown in fig. 1A, one input profile (IfMap) may be applied to different filters (i.e., kernels) to obtain output profiles (ofmaps) on different channels. For the fully connected layer, as shown in FIG. 1B, the number of inputs for one input neuron will be multiplied by the weights associated with different output neurons.

Based on the aforementioned characteristics of the computational process of NN, the present specification proposes a method for performing multiplication to reduce the computational complexity of the multiplication operation. The method may be adapted to calculations performed in a neural network.

FIG. 2 illustrates a flow chart of a method for performing multiplication operations according to various embodiments of the present description. A multiplication operation may be performed between the weights and the input numbers. The multiplication operations may be performed by a computing device comprising a preprocessing unit and a plurality of selection units.

As shown in fig. 2, a method for performing multiplication may include the following steps 202 to 206.

In step 202, the weights may be decomposed into a plurality of elements by a preprocessing unit. Each element may have the same number of bits.

In some embodiments, weight W may be decomposed into multiple elements W by dividing the bits of weight W into multiple groups of bits from the least significant bit to the most significant bit of weight W _i I=1, 2,..n (n is the total number of elements). The groups of bits may be contiguous adjacent to each other but non-overlapping with each other. Each bit group may correspond to element w _i One of which is a metal alloy. All groups of bits may have the same number of bits (i.e., element size).

In some examples, the total number of bits of the weight W may not be divisible by the element size. In this case, zero padding may be performed at the most significant bit of the weight W until the total number of bits is divisible by the element size. This ensures that all groups of bits have the same number of bits.

In one example, the weight W may be a binary number having a plurality of binary bits (i.e., bits having a value of 0 or 1). In one example, the weight W may be a 16-bit binary number and the element size may be 4. The element size may be predetermined according to specific needs, and is not limited in this specification.

In step 204, a partial sum associated with each element may be determined. Each element may be associated with a selection unit, and the portion associated with the element and may be determined by the selection unit associated with the element.

As shown in fig. 3, the selection unit 300 may include a decoder 302, a plurality of multipliers 304 (i.e., multiplier groups), and a shifter 306.

Decoder 302 may be configured toIs based on element w _i A selector and a shifter are generated. One multiplier from the plurality of multipliers 304 may be selected based on a selector. A shift operation may be performed on the selected multiplier based on shifter 306 to obtain an and element w _i The associated parts and.

At the input number in and element w _i The plurality of multipliers 304 may be determined based on the input number in and the element size, based on the multiplication result between any of the values of the plurality of multipliers and the shift operation representation.

In one example, the weight W is a 16-bit binary number, and may be expressed as: w= [ a ] ₁₅ a ₁₄ …a ₃ a ₂ a ₁ a ₀ ]Wherein a is _i Is the ith bit of weight W. The element size may be 4, and the weight w may be divided into four 4-bit elements: w (w) ₁ ＝[a ₃ a ₂ a ₁ a ₀ ]，w ₂ ＝[a ₇ a ₆ a ₅ a ₄ ]，w ₃ ＝[a ₁₁ a ₁₀ a ₉ a ₈ ]，w ₁ ＝[a ₁₅ a ₁₄ a ₁₃ a ₁₂ ]。

In one example, the plurality of multipliers 304 may include 8 multipliers m _i I=1,..8, which may be:

m ₁ ＝[0001]x in (i.e. m ₁ ＝in)，

m ₂ ＝[0011]X in (i.e. m ₂ ＝3in)，

m ₃ ＝[0101]X in (i.e. m ₃ ＝5in)，

m ₄ ＝[1001]X in (i.e. m ₄ ＝9in)，

m ₅ ＝[0111]X in (i.e. m ₅ ＝7in)，

m ₆ ＝[1011]X in (i.e. m ₆ ＝11in)，

m ₇ ＝[1101]X in (i.e. m ₇ =13 in), and

m ₈ ＝[1111]x in (i.e. m ₈ ＝15in)，

Where in is the input number.

Input number in and element w _i Can be multiplied by any value of (a) by a plurality of multipliers m _i Is shifted to represent one of them. For example, assume w ₁ ＝[0110]，w ₁ Multiplication with the input number in (i.e. w ₁ X in) can be obtained by combining m ₂ Left shift by one bit.

Decoder 302 may be based on element w _i A selector and a shifter are determined. One multiplier may then be selected from the plurality of multipliers based on the selector. By shifting the selected multipliers according to the shifter, the input number in and the element w can be achieved _i The multiplication result between, called partial sum p _i 。

Can be for each element w _i Repeating the above process to obtain the partial sum p for these elements _i 。

Referring to FIG. 2, in step 206, in obtaining w for each element _i Part of (c) and p _i Thereafter, the partial sum p can be based on _i The result of the multiplication between the input number in and the weight W is determined.

In step 206, the element w may be referred to as _i Sum each part with p _i Shift 0, 1 or more bits to obtain a partial sum of the shifts. The multiplication result between the input number in and the weight W can then be determined by summing all shifted partial sums.

May be based on the corresponding element W in the weight W _i Or more specifically by the element W in the weight W _i The position of the first bit (i.e., the least significant bit) of (i) to determine each portion and p _i Is a shift amount of (a). For example, in the example described above, where the weight W is divided into four 4-bit elements W _i I=1, 2,3,4, and the first element w ₁ The associated partial sum (i.e., p ₁ ) No shifting may be needed because of the element w ₁ Is the first bit (i.e., a ₀ ) And is also the first bit of the weight W. With the second, third and fourth elements (i.e. p ₂ 、p ₃ And p ₄ ) The associated part and may need to be shifted by 4, 8 and 12 bits, respectively, because of the element w ₂ 、w ₃ And w ₄ The first bit of (2) is the 4 th, 8 th and 12 th bits of the weight W, respectively.

In the disclosed method for performing multiplication between an input number and a weight, the weight W may be first decomposed into a plurality of elements W _i Each element has the same number of bits, and a plurality of multipliers may be determined based on the input number and the number of bits of the element. The partial sums may then be determined by performing a shift operation on a selected one of the multipliers for each element. The multiplication result may be obtained by performing another shift operation on each partial sum and adding the shifted partial sums together.

In the disclosed method for performing multiplication operations, the multiplication operations may be implemented using shift operations and addition operations. Therefore, the computational complexity of determining the multiplication result can be significantly reduced.

Based on the inventive concept disclosed in the method embodiment, the invention further provides a multiplier sharing method suitable for the neural network. Fig. 4 shows a flow chart illustrating a multiplier sharing method suitable for use in a neural network, in accordance with various embodiments of the present invention.

As shown in fig. 4, the multiplier sharing method may include the following steps 402 to 410. Multiplier sharing methods may be used to calculate a multiply-accumulate (MAC) result for each output neuron of the neural network.

In one example, the neural network may be a fully connected neural network that includes an input layer and an output layer. The input layer may include M input neurons, each input neuron associated with an input number in _j Associated, j=1, 2,..m. The output layer may include N output neurons. Other types of neural networks are contemplated, as the invention is not limited in this regard.

In a neural network, each input neuron and each output neuron may form an input-output pair. Each input-output pair may have an associated weight w _i，j I=1, 2,..n, j=1, 2,..m, where i and j represent the index of the output neuron and the index of the input neuron in the input-output pair, respectively.

The multiplier sharing method may be used to calculate the MAC result of the ith output neuron of the neural network, which may be expressed as:

wherein, MAC _i Is the MAC result on the ith output neuron, M is the total number of input numbers (i.e., the total number of input neurons), in _j Is the input number of the j-th input neuron, and w _i，j Is the weight associated with the input-output pair comprising the jth input neuron and the ith output neuron.

Referring to fig. 4, in step 402, one or more non-accessed input neurons in an input layer of a neural network may be selected. An unaccessed input neuron may refer to an input neuron on which a MAC result has not been calculated.

In step 404, a data value may be determined based on the number of inputs (in _j ) A plurality of multipliers is determined for each selected input neuron. The above method can be used to determine the input number (in _j ) To determine a plurality of multipliers. For details, reference is made to relevant parts of the above description, which will not be repeated here for the sake of brevity.

A plurality of multipliers may be determined for each selected input neuron. Different input neurons may correspond to different multiplier groups. In some embodiments, the multipliers may be stored in memory for reuse. In some embodiments, when determining the result of a multiplication between the input number of an input neuron and the weight associated with the first input-output pair that includes the input neuron, multiple multipliers for the input neuron may only need to be determined once. The determined multipliers may be stored in memory and may be reused in determining the result of the multiplication between the input neuron and other weights.

Input number of input neuron (in _j ) May be one or more inputs to a neural network. In one example, a neural network may be used for image classification, and inputs to the neural network may be usedAnd thus is an image feature used to classify images. Other types of input numbers are contemplated as well, as the present invention is not limited in this regard.

In step 406, the input number (in) of the selected input neuron may be determined based on a plurality of multipliers _j ) And a weight (w) associated with each input-output pair including the selected input neuron _i，j ) The result of the multiplication between them. The multiplication result may be determined using the method for performing multiplication operations disclosed above. For details, reference is made to relevant parts of the above description, which will not be repeated here for the sake of brevity. When the result of the multiplication between the input number of the selected input neuron and the weights on all the input-output pairs including the selected input neuron is determined, the input neuron may be marked as "accessed".

In this step, since the multiplication operation is implemented using the shift operation and the addition operation, the computational complexity of determining the multiplication result can be significantly reduced.

In addition, when determining the result of the multiplication between the input number of each input neuron and the weight associated with the first input-output pair including that input neuron, the multiplier for each input neuron may only need to be determined once. When determining the multiplication result with other weights, the determined multipliers may be reused (i.e., shared). Therefore, the calculation efficiency can be further improved.

In step 408, it may be determined whether all of the input neurons have been "accessed". If all of the input neurons have been accessed, step 410 may be performed. In step 410, the MAC result on each output neuron may be determined. The MAC result on each output neuron may be determined based on the multiplication result on the output neuron. In one example, the MAC result on an output neuron may be obtained by summing all multiplication results on that output neuron. Other methods of calculating the MAC result are contemplated, which this specification does not limit.

If there are still non-accessed input neurons, steps 402 through 406 may be repeated on the non-accessed input neurons until all of the input neurons are accessed.

Fig. 5A, 5B, 5C, and 5D illustrate diagrams of examples of applying multiplier sharing methods on a neural network according to various embodiments of the invention. This example is described in detail below with reference to these drawings.

As shown in fig. 5A, the neural network in the example may have an input layer 110 and an output layer 120. The input layer 110 may include M input neurons, and the output layer 120 may include N output neurons.

In the example shown in fig. 5A-5D, two input neurons may be selected at a time, and the result of the multiplication between the input number of the two input neurons and the weight associated with each input-output pair including the selected input neuron may be determined.

For each selected input neuron, all output neurons may be traversed, and the number of inputs (in _j ) And a weight (w) associated with each input-output pair including the selected input neuron _i,j ) As shown in fig. 5B.

Input neuron (in) _j ) The sum of the inputs including the input neurons (in _j ) Is associated with a weight (w _i,j ) The result of the multiplication therebetween may be determined using the method of performing the multiplication disclosed above.

Each selected input neuron may be associated with an input number (in _j ) To determine a plurality of multipliers. These multipliers may only need to be determined once for weights associated with the first input-output pair that includes the input neuron, and may be reused (i.e., shared) when determining multiplication results with other weights.

After the multiplication result of the first two selected input neurons has been calculated, the same process may be repeated for the next two input neurons that have not been selected until all input neurons are selected.

In one example, the input neurons may be selected sequentially according to their location in the input layer. That is, the first and second input neurons may be selected to calculate the multiplication result, followed by the third and fourth input neurons, and finally the last two input neurons (i.e., the (M-1) th and M-th input neurons) are selected, as shown in FIG. 5C.

In one example, for each selected input neuron, the output neurons may be traversed in order according to their position in the output layer to calculate the multiplication result. That is, the computation of the multiplication result will start from the first output neuron and end on the last (i.e., nth) output neuron, as shown in fig. 5D.

Other sequences for selecting input neurons and for traversing output neurons are contemplated, which is not limiting in this specification.

After all multiplication results are obtained, the MAC result for each of the output neurons may be obtained based on the multiplication results. In one example, the MAC result of an output neuron may be obtained by summing all multiplication results on that output neuron.

In the disclosed method for computing the overall MAC result, the multiplication may be implemented using shift operations and addition operations. Thus, the computational complexity of determining the multiplication result on the computing device may be significantly reduced. In addition, multipliers for the same input neuron may be reused (i.e., shared) when calculating multiplication results for other output neurons. Therefore, the calculation efficiency can be further improved.

The invention further provides equipment based on the inventive concept disclosed in the method embodiment. The device may include a processor and a memory. The memory may be configured with computer instructions executable by the processor. The computer instructions, when executed by a processor, may cause the processor to perform any of the method embodiments.

Certain embodiments are described herein as comprising logic or several components. A component may constitute a software component (e.g., code embodied on a machine-readable medium) or a hardware component (e.g., a tangible unit capable of performing certain operations that may be configured or arranged in some physical manner).

Although the present invention has been described with respect to examples and features of the disclosed principles, modifications, adaptations, and other implementations can be made without departing from the spirit and scope of the disclosed embodiments. Furthermore, the terms "comprising," having, "" containing, "and" including, "and other similar forms, are intended to be synonymous and open ended, as the term" one or more items following any one of these terms, are not intended to be an exhaustive list of such one or more items, or are intended to be limited to only the listed one or more items. It must also be noted that, as used herein and in the appended claims, the singular forms "a," "an," and "the" include plural referents unless the context clearly dictates otherwise.

Claims

1. A method for performing a multiplication operation between weights and input numbers, the method being adapted to a computing device comprising a preprocessing unit and a plurality of selection units, the method comprising:

decomposing the weight into a plurality of elements by the preprocessing unit, each element having a preset number of bits and corresponding to one of the selecting units;

determining, by each of said selection units, a partial sum associated with each of said elements, wherein said partial sum is equal to a product of said respective element and said input number; and

a multiplication result between the input number and the weight is determined based on the partial sums.

2. The method of claim 1, wherein determining, by each of the selection units, the partial sum associated with each of the elements, respectively, comprises:

determining, for each of the elements, a multiplier selected from a plurality of multipliers and a shifter for shifting the selected multiplier by a decoder of the respective selection unit; and

the partial sums are determined by shifting the selected multipliers by the shifter.

3. The method of claim 2, wherein the plurality of multipliers are predetermined based on the input number.

4. The method of claim 2, wherein the plurality of multipliers are read from memory.

5. The method of claim 2, wherein decomposing the weights into the plurality of elements comprises:

the bits of the weight are divided into a plurality of bit groups that are consecutively adjacent to each other but do not overlap each other, each of the bit groups having the preset number of bits.

6. The method of claim 5, wherein the weight is a binary number.

7. The method of claim 6, wherein the weight is a 16-bit binary number and the preset number is 4.

8. The method of claim 5, wherein the selection units each have the same number of multipliers, and the number of multipliers in each selection unit is determined by the preset number.

9. The method of claim 5, wherein determining a result of the multiplication between the input number and the weight based on the partial sum comprises:

shifting each of the partial sums according to the position of the corresponding element in the weight; and

the multiplication result is determined by summing the shifted partial sums.

10. The method of claim 5, wherein the weight is a weight associated with a node of a neural network and the input number is an input of the node.

11. A method for computing a multiply-accumulate (MAC) result suitable for use in a neural network, the neural network comprising an input layer having a plurality of input neurons each having an input number and an output layer having a plurality of output neurons, and wherein each input neuron and each output neuron form an input-output pair having an associated weight, the method comprising:

determining, for each of the input neurons, a multiplication result between the input number of the input neuron and the weight associated with the input-output pair comprising the input neuron and a first output neuron of the plurality of output neurons, wherein the multiplication result is determined by:

decomposing the weight into a plurality of elements, each element having a predetermined number of bits,

determining a partial sum associated with each of the elements, wherein the partial sum is equal to a product of the corresponding element and the input number, and

determining a multiplication result between the input number and the weight based on the partial sums; and

the MAC result for the first output neuron is determined by adding the multiplication results.

12. The method of claim 11, wherein determining the partial sum associated with each of the elements comprises:

determining, for each of the elements, a multiplier selected from a plurality of multipliers and a shifter for shifting the selected multiplier; and

the partial sums are determined by shifting the selected multipliers according to the shifter.

13. The method of claim 12, wherein the plurality of multipliers are predetermined based on the input number.

14. The method of claim 12, wherein the plurality of multipliers are read from memory.

15. The method of claim 12, wherein decomposing the weights into the plurality of elements comprises:

the bits of the weight are divided into a plurality of bit groups that are consecutively adjacent to each other but do not overlap each other, and each bit group has the preset number of bits.

16. The method of claim 15, wherein the weight is a binary number.

17. The method of claim 16, wherein the weight is a 16-bit binary number and the preset number is 4.

18. An apparatus for performing a multiplication operation, comprising:

a processor; and

a memory configured with computer instructions executable by the processor, wherein the computer instructions, when executed by the processor, cause the processor to perform operations comprising:

decomposing the weight into a plurality of elements, each element having a preset number of bits and corresponding to one selection unit;

determining, by each of said selection units, a partial sum associated with each of said elements, wherein said partial sum is equal to the product of said respective element and an input number; and

19. The apparatus of claim 18, wherein determining, by each of the selection units, the partial sum associated with each of the elements, respectively, comprises:

20. The apparatus of claim 19, wherein the plurality of multipliers are predetermined based on the input number.