WO2019135354A1 - 演算回路 - Google Patents
演算回路 Download PDFInfo
- Publication number
- WO2019135354A1 WO2019135354A1 PCT/JP2018/046495 JP2018046495W WO2019135354A1 WO 2019135354 A1 WO2019135354 A1 WO 2019135354A1 JP 2018046495 W JP2018046495 W JP 2018046495W WO 2019135354 A1 WO2019135354 A1 WO 2019135354A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- circuit
- value
- real
- lut
- values
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
- G06F7/5443—Sum of products
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/4806—Computations with complex numbers
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
Definitions
- the present invention relates to an arithmetic circuit in digital signal processing, and more particularly to an arithmetic circuit that performs product-sum operation.
- the main operation in digital signal processing is a product-sum operation in which data of digital signals is multiplied by coefficients and summed.
- Distributed arithmetic is known as a method for efficiently performing product-sum operations (see Non-Patent Document 1).
- FIG. 10 shows a configuration example of a product-sum operation circuit adopting distributed operation
- FIG. 11 shows a timing chart of the operation of the product-sum operation circuit of FIG.
- the coefficient is added to each data x [n]
- It is an arithmetic circuit that performs a product-sum operation in which multiplication is performed by multiplying c [n], that is, n n 1, ..., N (c [n] x x [n]).
- LUT 1001 Look-Up Table, hereinafter referred to as LUT
- LUT 1001 Only Memory
- the LUT 1001 of FIG. 10 includes memory areas of 2 N addresses.
- MSB Most Significant Bit
- LSB east Significant Bit
- each selector s [n, 2] to s [n, L] receives each bit x [n] of data x [n] according to a data valid signal indicating validity when data x [n] is input.
- n Select a value from [2] to x [n] [L].
- flip-flop xr [n, 1] takes in the value of the bit x [n] [1]
- the flip-flop xr [n, 2 ] To xr [n, L] take in the values of bits x [n] [2] to x [n] [L] output from the selectors s [n, 2] to s [n, L], respectively. Therefore, flip-flop xr [n, 1] takes in x [n] [1] which is the LSB of data x [n], and flip-flop xr [n, L] takes x [] which is the MSB of data x [n]. n] Import [L].
- each selector s [n, 2] to s [n, L] is processed by the flip-flops xr [n, 1] to xr [n, L- of the previous stage. Select the value output from 1]. Therefore, for each input of a clock pulse, flip-flops xr [n, 2] to xr [n, L] are held by flip-flops xr [n, 1] to xr [n, L-1] of the previous stage. It becomes the operation of taking in the value.
- the LUT 1001 When the read address a is input, the LUT 1001 outputs the value LUT [a] stored in the memory area of the read address a.
- the output value of the LUT 1001 and the output value of the doubling circuit 1003 that doubles the accumulated value held by the accumulated value register 1002 are input to the addition circuit 1004.
- the addition circuit 1004 adds the output value of the LUT 1001 and the output value of the doubling circuit 1003 and outputs the result as an addition value y.
- the accumulated value register 1002 holds the added value y as the updated accumulated value for each input of the clock pulse. Note that the value held by the accumulated value register 1002 is reset to 0 when the data valid signal becomes valid, so its initial value (the value at the time when the first clock pulse is input) is 0. .
- the doubling circuit 1003 that doubles the accumulated value held by the accumulated value register 1002 can be realized by a wiring that shifts the accumulated value expressed in binary to the left by one bit, processing for doubling the accumulated value is performed. Does not require logic gates. Therefore, no multiplication circuit is used in the product-sum operation circuit of FIG.
- the adder circuit 1004 outputs 0 to the LUT 1001 because the cumulative value held in the cumulative value register 1002 is 0.
- the addition circuit 1004 sets a value obtained by adding the output value of the LUT 1001 and the output value of the doubling circuit 1003 as an addition value y by the input of the second clock pulse from the time when the data valid signal becomes valid.
- the product-sum operation circuit adopting the distributed operation includes (L ⁇ N) flip flops for shifting the input value one bit at a time, and a selector for selecting whether the input value is taken into the flip flop or shifted.
- the conventional product-sum operation circuit adopting the above dispersion operation does not require a multiplication circuit, but after input of the data x [n], the value stored in the LUT for each bit position of the data x [n] Since it is necessary to repeat the search of the value stored in the LUT as many times as the bit width L of the data x [n], the throughput is low (from the data input to the next data input becomes possible). Time was long).
- the present invention has been made to solve the above-described problems, and an object of the present invention is to provide an arithmetic circuit capable of reducing the circuit size and power consumption and improving the throughput of the arithmetic operation.
- M is an integer of 2 or more
- N is an integer of 2 or more
- a LUT generation circuit configured to output a value calculated for each set of the N coefficients c [n] divided into two each, and N of the data set X [m]
- the value y [m] of the product-sum operation which is the result of multiplying each of the N coefficients c [n] by each of the data x [m, n] of the above, and summing, is calculated in parallel for each of the M sets.
- each distributed arithmetic circuit divides N data x [m, n] corresponding to its own circuit into two The two data x [m, n] based on the value obtained by dividing the N coefficients c [n] into two each and the value calculated by the LUT generation circuit.
- a plurality of binomies configured to calculate in parallel and output the value of the binomial product-sum operation obtained by multiplying and summing these coefficients c [n] respectively, for each of the two groups
- a dispersion operation circuit and a binomial dispersion operation result summing circuit configured to output a result obtained by adding up values calculated by the plurality of binomial dispersion operation circuits as the value y [m] of the product-sum operation It is characterized by being configured.
- the complex number coefficient C is applied to each of data corresponding to the self circuit among the complex number X [m], and the LUT generation circuit configured to calculate the value d_add of the sum of the real part value c_real and the imaginary part value c_imag
- the complex value Y [m] which is the result of multiplication and addition, is And M distributed
- the arithmetic circuit of the present invention does not repeat the LUT search while shifting the target bit position in the distributed operation of searching the LUT for each bit position of data, but performs parallel processing on all bit positions. Throughput can be improved. Further, since the arithmetic circuit of the present invention does not use the memory circuit for the LUT, the present invention is applied to product-sum operations in which the coefficient c [n] or the complex coefficient C fluctuates with the passage of time. Since updating can be performed simultaneously for all addresses instead of updating the addresses one by one in the LUT, throughput can be increased even if frequent updating of the LUT caused by fluctuation of the coefficient c [n] or the complex coefficient C occurs. It does not decline.
- the arithmetic circuit of the present invention does not speed up by copying one LUT to a plurality of memory circuits, but commonizes circuits that generate LUT element values as a LUT generation circuit, and searches for the LUT.
- a redundant circuit a copy of a circuit holding the same value
- the circuit scale does not increase.
- the problem that throughput is low when compared with the product-sum operation circuit using the multiplication circuit which the product-sum operation circuit adopting the conventional distributed operation has, and a circuit that holds the same value Can be solved in parallel to increase the circuit scale.
- the multiplication operation becomes unnecessary due to the distributed operation, the switching power during multiplication can be suppressed, so that the circuit size and the power consumption can be significantly reduced.
- FIG. 1 is a block diagram showing the configuration of an arithmetic circuit according to a first embodiment of the present invention.
- FIG. 2 is a block diagram showing the configuration of the distributed arithmetic circuit according to the first embodiment of the present invention.
- FIG. 3 is a block diagram showing the configuration of a two-term distributed arithmetic circuit according to the first embodiment of the present invention.
- FIG. 4 is a diagram for explaining the operation of the LUT index circuit in the first embodiment of the present invention.
- FIG. 5 is a block diagram showing the configuration of an arithmetic circuit according to a second embodiment of the present invention.
- FIG. 6 is a block diagram showing the configuration of an arithmetic circuit according to a third embodiment of the present invention.
- FIG. 1 is a block diagram showing the configuration of an arithmetic circuit according to a first embodiment of the present invention.
- FIG. 2 is a block diagram showing the configuration of the distributed arithmetic circuit according to the first embodiment of the present invention.
- FIG. 7 is a block diagram showing the configuration of a distributed arithmetic circuit according to a third example of the present invention.
- FIG. 8 is a diagram for explaining the operations of the real part calculation LUT index circuit and the imaginary part calculation LUT index circuit in the third embodiment of the present invention.
- FIG. 9 is a block diagram showing the configuration of an arithmetic circuit according to the fourth embodiment of the present invention.
- FIG. 10 is a block diagram showing a configuration example of a conventional product-sum operation circuit.
- FIG. 11 is a timing chart for explaining the operation of the conventional product-sum operation circuit.
- FIG. 12 is a diagram for explaining a look-up table of a conventional product-sum operation circuit.
- FIG. 13 is a block diagram showing a configuration example of a shift register of a conventional product-sum operation circuit.
- FIG. 1 is a block diagram showing the configuration of an arithmetic circuit according to a first embodiment of the present invention.
- the figure is
- the arithmetic circuit of FIG. 1 includes one LUT generation circuit 1 and M (M is an integer of 2 or more) distributed arithmetic circuits 2-1 to 2-M.
- the binomial distributed arithmetic circuit 20m-n ' forms a LUT in which the values 0, c [2 ⁇ n'-1], c [2 ⁇ n'], d [n '] are the numerical values of the respective elements, and
- the result of product-sum operation c [2 ⁇ n′-1] ⁇ x [m, 2 ⁇ n′ ⁇ 1] + c [2 n ′] ⁇ x [m, 2 ⁇ n ′] is calculated by dispersion operation using LUT. Acquire and output as y '[m, n'].
- the above description of the dispersion operation circuit 2-m is the case where N is an even number, and in the case where N is an odd number, c [N] ⁇ x [m, N] is calculated as shown in FIG.
- An auxiliary multiplication circuit 22m is added which outputs the result as y '[m, N' + 1].
- the binomial distributed arithmetic circuit 20m-n 'shown in FIG. 3 includes L LUT index circuits 200m-n'-l (selection circuits), a sign inverting circuit 202, and L multiple arithmetic circuits 203m-n'. 1 and a summing circuit 204.
- the LUT index circuit 200m-n'-l is a bit x [m, x at the bit position l corresponding to its own circuit among the data x [m, 2 ⁇ n'-1] and x [m, 2 ⁇ n '].
- One of four elements of the LUT is selected based on 2 ⁇ n′-1] [l] and x [m, 2 ⁇ n ′] [l], and the selected element value is Obtain as n'-l.
- each value of bits x [m, 2 ⁇ n′ ⁇ 1] [l] and x [m, 2 ⁇ n ′] [l] and the element value LUT # m of the LUT selected at that time are shown.
- the relationship with n'-l is shown.
- the relationship between the addresses and the stored values corresponding to the respective addresses shown in FIG. 12 in the case of (a two-term product-sum operation) is the same.
- the address a [0] in FIG. 12 corresponds to the bit x [m, 2 ⁇ n′-1] [l] in FIG. 4 and the address a [1] in FIG. 12 corresponds to the bit x [m, 12 corresponds to 2 ⁇ n ′] [l], c [0] in FIG. 12 corresponds to c [2 ⁇ n′-1] in FIG. 4, and c [1] in FIG. 12 corresponds to c [2 in FIG. Corresponds to x n '].
- the element value of the LUT selected for each bit position l (l 1,..., L) by the LUT index circuit 200m-n'-l in the binary term distributed arithmetic circuit 20m-n 'shown in FIG.
- the LUT # m n'-l is multiplied by 2 (l-1) by the multiple operation circuit 203 m n'-l, respectively.
- the summing circuit 204 sums the result of summing the values calculated by the L multiple arithmetic circuits 203m-n'-l and calculates the product-sum operation c [2 x n'-1] x x [m, 2 x n'- 1] It outputs as y '[m, n'] which is a result of + c [2n '] x x [m, 2 x n'].
- L clock pulses are processed by summing the LUT output value for bit position 1 with the value obtained by doubling the accumulated value by the l-th clock pulse input. It was possible to obtain the result of the product-sum operation by the input.
- the sign inversion of the selected element value LUT # m-n'-L is performed at the bit position L of the MSB.
- the element value LUT # m-n'-L is also applied to the bit position L of the MSB as in the other bit positions. It may be multiplied by 2 (L-1) by the multiple operation circuit 203m-n'-L as it is.
- the process of multiplying the selected element value LUT # m n'-l by 2 (l-1) for the bit position l described above is the element value LUT # m n'-l expressed in binary notation. Can be realized by shifting (l-1) bits to the left. Therefore, it is not necessary to use a multiplication circuit for the L multiple operation circuits 203m-n'-l, which can be realized by a simple circuit.
- the product-sum operation circuit of FIG. 10 realizes the multiplication of N numerical values of data and N coefficients and the product-sum operation of N terms, which is the summation thereof, by distributed arithmetic.
- the arithmetic circuit of this embodiment divides the product-sum operation of N terms into the product-sum operation of N 'two terms and realizes the product-sum operations of each two terms by parallel operations. By summing the results, the same result as the product-sum operation of N terms is obtained. The effect due to the difference in the above configuration will be described below.
- the product-sum operation circuit of FIG. 10 requires one LUT 1001 having 2 N elements in order to realize an N-term product-sum operation by distributed operation.
- a memory circuit having 2 N addresses is used for the LUT 1001 .
- the LUT 1001 implemented by the memory circuit generally has a smaller storage element area per bit compared to the LUT circuit implemented by the flip flop and the logic gate, and is stored in each of a large number of addresses.
- the processing of reading out the value stored in one address designated from the above value can be efficiently realized (in a circuit of high speed, low power consumption and small area). Due to this feature, implementing the LUT 1001 of FIG. 10 having a large number of elements with a memory circuit is faster and consumes less power and has a smaller area than implementing it with flip-flops or logic gates.
- N ′ N / 2
- LUT generation circuit 1 and LUT index circuit 200 m-n ′ The number of numerical values to be held in -l) is greatly reduced, and the circuit size does not pose a problem even when the LUT is formed by flip-flops or logic gates instead of memory circuits.
- the number of numerical values to be held can be reduced from 255 to 12 and therefore the circuit without using a memory circuit for LUT (LUT generation circuit 1 and LUT index circuit 200 m-n'-l) It can be configured such that the size is not a problem.
- LUT generation circuit 1 and LUT index circuit 200 m-n'-l LUT generation circuit 1 and LUT index circuit 200 m-n'-l
- it is necessary to add a binomial dispersion operation result summing circuit 21m for summing N '( N / 2) binomial product-sum operation results by dividing an N-term product-sum operation into a binomial product-sum operation
- the circuit scale of the binomial distributed operation result summing circuit 21m is sufficiently small, which does not cause a problem.
- the N-term product-sum operation is divided into the two-term product-sum operation to be held in the LUT (LUT generation circuit 1 and LUT index circuit 200 m-n'-l).
- the number of element values can be significantly reduced, but also the total scale of the LUT index circuit 200m-n'-l can be significantly reduced, as compared to the case of division into three-term product-sum operations. Indicated by.
- N / 2 pieces of LUT index circuits 200 m-n'-l which are 4: 1 selectors are required.
- N / 3 LUT lookup circuits, which are 8: 1 selectors are required.
- the 4: 1 selector can be composed of three 2: 1 selectors
- the 8: 1 selector can be composed of seven 2: 1 selectors. Therefore, in the case of dividing an N-term product-sum operation into a binomial product-sum operation, in the case of dividing into a 3-term product-sum operation, while (N ⁇ 1.5) 2: 1 selectors are required. Will require (N ⁇ 7/3) 2: 1 selectors.
- the bit width of the element value of the LUT is 1 bit width of the coefficients c [2 ⁇ n′ ⁇ 1] and c [2 ⁇ n ′].
- the bit value of the element value of the LUT is a coefficient because the element value of the LUT includes the sum of three coefficients. Bit width of +2 bits. For this reason, the number of 2: 1 selectors used in the LUT index circuit and the bit width of the 2: 1 selector are more for three-term product-sum operations than when N-term product-sum operations are divided into two-term product-sum operations. The division is larger. As described above, the arithmetic circuit of this embodiment has an effect of reducing the total size of the LUT index circuit 200m-n'-l by dividing the N-term product-sum operation into the binomial product-sum operation.
- the LUT (LUT generation circuit 1 and LUT index circuit 200 m-n'-l) is not a memory circuit, but a circuit that generates element values in advance and a logic gate such as a selector. And the circuit to be selected. If the LUT is a memory circuit as in the prior art, when attempting to increase throughput by parallelization in which the LUT is searched simultaneously for all bit positions of data, or parallelization of the product-sum operation circuit itself, every bit position of data For each product-sum operation circuit, it is necessary to provide a plurality of memory circuits in which the LUT is copied.
- a memory circuit is not used for the LUT, so a circuit (LUT generation circuit 1) that generates and holds each element value of the LUT in advance and a circuit (LUT index circuit 200m that selects the element value) It can be divided into -n'-l), and only the LUT index circuit 200m-n'-l is parallelized without parallelizing the LUT generation circuit 1.
- This makes it possible to prevent redundant circuit configuration, that is, parallelization (copying) of a circuit that holds each element value of the LUT, and it is possible to suppress an increase in circuit scale accompanying parallelization.
- the calculation result can not be obtained until input of the same number of clock pulses as the bit width of the input data is completed. Therefore, the time required from the data input time to the output of the result is a time proportional to the bit width of the input data.
- the throughput is compared with the product-sum operation circuit of FIG. 10 which requires the calculation time proportional to the bit width. Can be improved.
- the arithmetic circuit of this embodiment distributes each element value of the LUT from the LUT generation circuit 1 to the LUT (LUT generation circuit 1 and LUT index circuit 200 m-n'-l) without using a memory circuit. Is selected from among the element values in the distributed binomial computation circuit 20m-n '. Therefore, when the present embodiment is applied to a product-sum operation in which the coefficient c [n] fluctuates with the passage of time, the change of the coefficient c [n] can be immediately reflected in the LUT.
- FIG. 5 is a block diagram showing the configuration of an arithmetic circuit according to a second embodiment of the present invention.
- the same reference numerals as in FIG. 1 denote the same parts in FIG.
- This embodiment shows a configuration for improving the throughput of the arithmetic circuit shown in the first embodiment without increasing the circuit size and the power consumption.
- the figure is
- the arithmetic circuit of FIG. 5 includes one LUT generation circuit 1, one LUT latch circuit 3, and M (M is an integer of 2 or more) distributed arithmetic circuits 2-1 to 2-M. .
- the value d [n ′] is output to the LUT latch circuit 3 together with the coefficient c [n].
- the LUT latch circuit 3 can be realized by a flip flop that holds the value of each bit of the coefficient c [n] and the value d [n '] in synchronization with the clock.
- the method of calculating the value y [m] is the same as the method described in the first embodiment.
- the upper limit of the clock frequency of the system including the arithmetic circuit is restricted by the total time with the dispersion operation time Td which is the processing time to generate the numerical value y [m] corresponding to ⁇ , that is, (Td_LUT + Td).
- Td the processing time to generate the numerical value y [m] corresponding to ⁇
- Td_LUT the generation time
- the upper limit of the clock frequency of a system employing the arithmetic circuit of FIG. 1 is 1 / (Td_LUT + Td)
- the upper limit of the clock frequency of a system employing the arithmetic circuit of FIG. 5 is 1 / Td_LUT.
- 1 / Td whichever is smaller. That is, the arithmetic circuit of FIG. 5 operates faster than the arithmetic circuit of FIG.
- the LUT (LUT generation circuit 1 and LUT index circuit 200 m-n ') are configured so that the memory circuit does not have to configure the LUT.
- the number of elements in -l) has been reduced.
- the number of elements of the LUT 1001 is 2 N ⁇ 1, because the N-term product-sum operation is subjected to dispersion operation using one LUT 1001, but this embodiment or the first embodiment
- the number of elements is reduced to (N ⁇ 1.5) by dividing the N-term product-sum operation into N / 2 two-term product-sum operations.
- N 8
- the number of elements can be reduced from 255 to 12.
- the number of flip-flops to be added with the pipeline configuration can be significantly reduced as compared with the case where the pipeline configuration is based on the product-sum operation circuit of FIG. Throughput can be improved without increasing circuit size and power consumption.
- the circuit size and power consumption increase significantly with the pipeline configuration, but as described in this embodiment, the distributed operation circuit in M parallel with the LUT generation circuit 1
- the arithmetic circuit of this embodiment can improve the throughput without increasing the circuit size and power consumption.
- FIG. 6 is a block diagram showing the configuration of an arithmetic circuit according to a third embodiment of the present invention.
- the figure is
- Each of the M complex number values Y [m] corresponds to (C ⁇ X [m]). That is, the real part value y_real [m] corresponds to c_real x x real [m]-c imag x x imag [m].
- the imaginary part value y_imag [m] corresponds to c_imag ⁇ x_real [m] + c_real ⁇ x_imag [m].
- the arithmetic circuit of FIG. 6 includes one LUT generation circuit 1a and M (M is an integer of 2 or more) dispersion arithmetic circuits 2a-1 to 2a-M.
- the LUT generation circuit 1a receives the real part value c_real and the imaginary part value c_imag of the complex number coefficient C, and the value d_sub corresponding to the difference c_real-c_imag between the real part value c_real and the imaginary part value c_imag and the real part value c_real
- the value d_add corresponding to the sum c_real + c_imag of the imaginary part value c_imag is calculated, and the real part value c_real and the imaginary part value c_imag are output to the dispersion arithmetic circuits 2a-1 to 2a-M together with the values d_sub and d_add.
- the dispersion calculation circuit 2a-m is for real part calculation LUT having the value 0, c_real, -c_imag, d_sub as the numerical value of each element, and imaginary part calculation using the value 0, c_imag, c_real, d_add as the numerical value of each element Configure LUT and obtain the result of product-sum operation of real part c_real x x real [m]-c _ imag x x imag [m] by dispersion operation using LUT for real part calculation and output as y_real [m]
- the result of product-sum operation of imaginary part c_imag ⁇ x_real [m] + c_real ⁇ x_imag [m] is acquired by dispersion operation using LUT for imaginary part operation, and is output as y_imag [m].
- the distributed arithmetic circuit 2a-m shown in FIG. 7 includes L real part operation LUT index circuits 205m-l (real part operation selection circuits), sign inverting circuits 206 and 207, and L multiple operation circuits. 208m-1, addition circuit 209, L LUT lookup circuits for imaginary part calculation 210m-1 (imaginary part selection circuit), sign inverting circuit 211, L multiple arithmetic circuits 212m-1 It comprises the summing circuit 213.
- the real part operation LUT index circuit 205m-l is a bit x_real [m] [l], x_imag [m] [l] in the bit position l corresponding to its own circuit among the data x_real [m] and x_imag [m]. Then, one of the four element values of the real part operation LUT, that is, the value 0, c_real, -c_imag, and d_sub is acquired based on
- the LUT indexing circuit 210m-1 for imaginary part calculation is the bit x_real [m] [l], x_imag [m] [l] at the bit position l corresponding to its own circuit among the data x_real [m] and x_imag [m].
- c_imag, c_real, and d_add is acquired.
- FIG. 8 shows the relationship between the values of the bits x_real [m] [l] and x_imag [m] [l] and the element values of the real part calculation LUT and the imaginary part calculation LUT selected at that time.
- N 2 (product of two terms The relationship between the address and the stored value corresponding to each address shown in FIG. 12 in the case of the sum operation) is the same.
- the address a [0] in FIG. 12 corresponds to the bit x_real [m] [l] in the present embodiment
- the address a [1] in FIG. 12 corresponds to the bit x_imag [m] [l] in the present embodiment.
- the coefficient c [0] in FIG. 12 corresponds to c_real in the present embodiment
- the coefficient c [1] in FIG. 12 corresponds to -c_imag in the present embodiment.
- the coefficient c [0] in FIG. 12 corresponds to c_imag in this embodiment
- the coefficient c [1] in FIG. 12 corresponds to c_real in this embodiment. It corresponds.
- Elements of the real part calculation LUT selected for each bit position l (l 1,..., L) by the real part calculation LUT index circuit 205m-l in the distributed calculation circuit 2a-m shown in FIG.
- the values are respectively multiplied by 2 (l-1) by the multiple operation circuit 208m-l.
- the summing circuit 209 sums the values calculated by the L multiple operation circuits 208m-l, and the summing circuit 213 adds the values calculated by the L multiple operation circuits 212m-l.
- the sign inversion circuit 207 performs sign inversion on the element value selected by the real part calculation LUT index circuit 205m-L. After that, it is multiplied by 2 (L-1) by the multiple operation circuit 208m-L.
- the sign inverting circuit 211 performs sign inversion on the element value selected by the imaginary part operation LUT index circuit 210m-L, multiple operation is performed. It is multiplied by 2 (L-1) by the circuit 212m-L.
- the result of the summation performed by the summing circuit 209 for all bit positions is output as y_real [m] which is the real part value of the complex value Y [m] to be output by the arithmetic circuit of this embodiment. Ru. Further, the result of the addition performed by the addition circuit 213 is output as y_imag [m] which is an imaginary part value of the complex value Y [m].
- the sign inversion of the element value selected by the real part calculation LUT index circuit 205m-L and the element selected by the imaginary part calculation LUT index circuit 210m-L The sign is inverted.
- the LUT index circuit 205m for real part computation is also applied to the bit position L of the MSB as in the other bit positions.
- the element value selected by L is multiplied by 2 (L-1) by the multiple operation circuit 208m-L as it is, and the element value selected by the LUT index circuit 210m-L for imaginary part operation is directly multiplied by the multiple operation circuit 212m-L 2 (L-1) times should be multiplied.
- processing for multiplying the element value selected by the real part operation LUT index circuit 205m-1 and the imaginary part operation LUT index circuit 210m-1 by 2 (l-1) is a binary number. It can be realized by shifting the element value expressed by ⁇ to the left by (l ⁇ 1) bits. Therefore, it is not necessary to use a multiplication circuit for the L multiple operation circuits 208m-1 and the L multiple operation circuits 212m-1 and can be realized by a simple circuit.
- an LUT (LUT (LUT) of only four element values without division into N-term product-sum operation and binomial product-sum operation Distributed operation can be performed using the generation circuit 1a, the real part calculation LUT index circuit 205m-1 and the imaginary part calculation LUT index circuit 210m-1.
- the present embodiment utilizes the above-described feature regarding multiplication between complex numbers, and each LUT for obtaining each value of real part and imaginary part of the result of multiplying complex data by coefficients is not a memory circuit, The circuit for generating the element value of the LUT in advance and the circuit for selecting the element value using a logic gate such as a selector. With such a configuration, the same effect as that of the first embodiment can be obtained.
- a circuit (LUT generation circuit 1a) for generating and holding each element value of LUT in advance and a circuit for selecting each element value of LUT (actually It can be divided into a part operation LUT index circuit 205m-1 and an imaginary part operation LUT index circuit 210m-1), and the real part operation LUT index circuit 205m-1 and the real part operation LUT index circuit 205m-1 without parallelizing the LUT generation circuit 1a.
- Only the LUT indexing circuit 210m-1 for imaginary part calculation is parallelized. This makes it possible to prevent redundant circuit configuration, that is, parallelization (copying) of a circuit that holds each element value of the LUT, and it is possible to suppress an increase in circuit scale accompanying parallelization.
- the arithmetic circuit of this embodiment does not divide the LUTs for real part calculation and imaginary part calculation and generate and distribute the element values of each LUT, but does not Of the element values of the calculation LUT and the imaginary part calculation LUT, generation / distribution is made common for c_real which is a common element.
- c_imag which is an element value of the imaginary part calculation LUT
- -c_imag is included in the elements of the real part calculation LUT
- only c_imag is added by the sign inverting circuit 206 on the real part operation LUT index circuit 205m-l side. By reversing the sign, the number of wires in the circuit used for distribution is reduced.
- the circuit size and power consumption can be reduced as compared with the configuration in which the real part calculation LUT and the imaginary part calculation LUT are generated and distributed completely independently.
- FIG. 9 is a block diagram showing the configuration of an arithmetic circuit according to a fourth embodiment of the present invention.
- the same reference numerals as in FIG. 6 denote the same parts in FIG.
- This embodiment shows a configuration for improving the throughput of the arithmetic circuit shown in the third embodiment without increasing the circuit size and power consumption.
- Each of the M complex number values Y [m] corresponds to (C ⁇ X [m]). That is, the real part value y_real [m] corresponds to c_real x x real [m]-c imag x x imag [m].
- the imaginary part value y_imag [m] corresponds to c_imag ⁇ x_real [m] + c_real ⁇ x_imag [m].
- the arithmetic circuit shown in FIG. 9 includes one LUT generation circuit 1a, one LUT latch circuit 3a, and M (M is an integer of 2 or more) distributed arithmetic circuits 2a-1 to 2a-M. .
- the LUT generation circuit 1a receives the real part value c_real and the imaginary part value c_imag of the complex number coefficient C, and the value d_sub corresponding to the difference c_real-c_imag between the real part value c_real and the imaginary part value c_imag and the real part value c_real A value d_add corresponding to the sum c_real + c_imag of the imaginary part value c_imag is calculated, and the values d_sub and d_add are output to the LUT latch circuit 3a together with the real part value c_real and the imaginary part value c_imag.
- the LUT latch circuit 3a receives c_real, c_imag, d_sub, and d_add output from the LUT generation circuit 1a, latches each value of c_real, c_imag, d_sub, and d_add for each clock pulse input, and then performs the next clock. It is a circuit that holds up to the input of a pulse.
- the LUT latch circuit 3a can be realized by a flip flop that holds the value of each bit of each value of c_real, c_imag, d_sub, and d_add in synchronization with a clock. Then, the LUT latch circuit 3a outputs the held c_real, c_imag, d_sub and d_add to each of the dispersion calculation circuits 2a-1 to 2a-M.
- the real part value c_real and the imaginary part value c_imag distributed from L and the values d_sub and d_add are input, and the complex coefficient C is multiplied by each of the data corresponding to its own circuit among the complex numbers X [m], and the result is added.
- the dispersion calculation circuit 2a-m is for real part calculation LUT having the value 0, c_real, -c_imag, d_sub as the numerical value of each element, and imaginary part calculation using the value 0, c_imag, c_real, d_add as the numerical value of each element Configure LUT and obtain the result of product-sum operation of real part c_real x x real [m]-c _ imag x x imag [m] by dispersion operation using LUT for real part calculation and output as y_real [m]
- the result of product-sum operation of imaginary part c_imag ⁇ x_real [m] + c_real ⁇ x_imag [m] is acquired by dispersion operation using LUT for imaginary part operation, and is output as y_imag [m].
- the configuration of the distributed arithmetic circuit 2a-m is as described in the third embodiment.
- generation time Td_LUT which is a processing time for calculating d_sub and d_add from complex coefficient C in LUT generation circuit 1a, and y_real [m] and y_imag [m] in dispersion arithmetic circuit 2a-m.
- the upper limit of the clock frequency of the system including the arithmetic circuit is restricted by the total time with the distributed operation time Td, which is the processing time to generate H, that is, (Td_LUT + Td).
- the upper limit of the clock frequency of the system including the operation circuit is restricted by the generation time Td_LUT and the dispersion operation time Td.
- the upper limit of the clock frequency of a system employing the arithmetic circuit of FIG. 6 is 1 / (Td_LUT + Td)
- the upper limit of the clock frequency of a system employing the arithmetic circuit of FIG. 9 is 1 / Td_LUT. And 1 / Td, whichever is smaller. That is, the arithmetic circuit of FIG. 9 operates faster than the arithmetic circuit of FIG.
- flip-flops to be an issue in the pipeline configuration are only flip-flops used in a circuit that holds c_real, c_imag, d_sub, and d_add in synchronization with a clock in the LUT latch circuit 3a.
- the circuit size and power consumption increase significantly with the pipeline configuration, but as described in the present embodiment, the distributed arithmetic circuit 2a-in parallel with the LUT generation circuit 1a and M parallel
- the arithmetic circuit of this embodiment can improve the throughput without increasing the circuit size and power consumption. Equipped with
- the arithmetic circuits described in the first to fourth embodiments can be realized by, for example, an FPGA (Field Programmable Gate Array).
- FPGA Field Programmable Gate Array
- the present invention can be applied to arithmetic circuits.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Algebra (AREA)
- Complex Calculations (AREA)
- Nonlinear Science (AREA)
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US16/959,968 US11360741B2 (en) | 2018-01-05 | 2018-12-18 | Arithmetic circuit |
| CN201880085302.XA CN111615700B (zh) | 2018-01-05 | 2018-12-18 | 运算电路 |
| US17/643,507 US12386591B2 (en) | 2018-01-05 | 2021-12-09 | Arithmetic circuit |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2018000451A JP6995629B2 (ja) | 2018-01-05 | 2018-01-05 | 演算回路 |
| JP2018-000451 | 2018-01-05 |
Related Child Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US16/959,968 A-371-Of-International US11360741B2 (en) | 2018-01-05 | 2018-12-18 | Arithmetic circuit |
| US17/643,507 Division US12386591B2 (en) | 2018-01-05 | 2021-12-09 | Arithmetic circuit |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019135354A1 true WO2019135354A1 (ja) | 2019-07-11 |
Family
ID=67144123
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2018/046495 Ceased WO2019135354A1 (ja) | 2018-01-05 | 2018-12-18 | 演算回路 |
Country Status (4)
| Country | Link |
|---|---|
| US (2) | US11360741B2 (https=) |
| JP (1) | JP6995629B2 (https=) |
| CN (1) | CN111615700B (https=) |
| WO (1) | WO2019135354A1 (https=) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| RU2803254C1 (ru) * | 2022-12-14 | 2023-09-11 | Федеральное государственное бюджетное военное образовательное учреждение высшего образования "Черноморское высшее военно-морское ордена Красной Звезды училище имени П.С. Нахимова" Министерства обороны Российской Федерации (г. Севастополь) | Вероятностное устройство вычисления дисперсии |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP6863907B2 (ja) * | 2018-01-05 | 2021-04-21 | 日本電信電話株式会社 | 演算回路 |
| US11403068B2 (en) * | 2020-08-24 | 2022-08-02 | Xilinx, Inc. | Efficient hardware implementation of the exponential function using hyperbolic functions |
| CN114968172B (zh) * | 2022-04-13 | 2025-10-28 | 深圳云豹智能股份有限公司 | 查找表电路、芯片及计算机设备 |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2004171263A (ja) * | 2002-11-20 | 2004-06-17 | Sharp Corp | 演算装置 |
| JP2004265346A (ja) * | 2003-03-04 | 2004-09-24 | Sony Corp | 離散コサイン変換装置および逆離散コサイン変換装置 |
| US20050201457A1 (en) * | 2004-03-10 | 2005-09-15 | Allred Daniel J. | Distributed arithmetic adaptive filter and method |
| JP2012169926A (ja) * | 2011-02-15 | 2012-09-06 | Fujitsu Ltd | Crc演算回路 |
Family Cites Families (24)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US4970674A (en) * | 1988-03-10 | 1990-11-13 | Rockwell International Corporation | Programmable windowing FFT device with reduced memory requirements |
| US5226002A (en) * | 1991-06-28 | 1993-07-06 | Industrial Technology Research Institute | Matrix multiplier circuit |
| JP3473647B2 (ja) * | 1994-12-28 | 2003-12-08 | Necインフロンティア株式会社 | エコーサプレッサ回路 |
| JPH0981541A (ja) * | 1995-09-12 | 1997-03-28 | Matsushita Electric Ind Co Ltd | 累算器 |
| TW302578B (en) * | 1996-04-10 | 1997-04-11 | United Microelectronics Corp | The digital filter bank structure and its application method |
| CN1227608C (zh) * | 1998-02-05 | 2005-11-16 | 英泰利克斯公司 | 基于n元组或随机存取存储器的神经网络分类系统和方法 |
| JP3139466B2 (ja) * | 1998-08-28 | 2001-02-26 | 日本電気株式会社 | 乗算器及び積和演算器 |
| JP2000132539A (ja) * | 1998-10-28 | 2000-05-12 | Matsushita Electric Ind Co Ltd | 演算装置 |
| US6477203B1 (en) * | 1998-10-30 | 2002-11-05 | Agilent Technologies, Inc. | Signal processing distributed arithmetic architecture |
| US6989843B2 (en) * | 2000-06-29 | 2006-01-24 | Sun Microsystems, Inc. | Graphics system with an improved filtering adder tree |
| JP3820144B2 (ja) * | 2001-12-12 | 2006-09-13 | シャープ株式会社 | 信号評価装置および信号評価方法 |
| US7007052B2 (en) * | 2001-10-30 | 2006-02-28 | Texas Instruments Incorporated | Efficient real-time computation |
| JP4129618B2 (ja) * | 2002-05-22 | 2008-08-06 | 日本電気株式会社 | 演算装置及び方法 |
| US6982662B2 (en) * | 2003-03-06 | 2006-01-03 | Texas Instruments Incorporated | Method and apparatus for efficient conversion of signals using look-up table |
| JP4086868B2 (ja) * | 2005-09-06 | 2008-05-14 | Necエレクトロニクス株式会社 | 表示装置、コントローラドライバ、近似演算補正回路、及び表示パネルの駆動方法 |
| US8593483B2 (en) * | 2009-10-20 | 2013-11-26 | Apple Inc. | Temporal filtering techniques for image signal processing |
| KR20120077164A (ko) * | 2010-12-30 | 2012-07-10 | 삼성전자주식회사 | Simd 구조를 사용하는 복소수 연산을 위한 사용하는 장치 및 방법 |
| US20130185345A1 (en) * | 2012-01-16 | 2013-07-18 | Designart Networks Ltd | Algebraic processor |
| US8930433B2 (en) * | 2012-04-24 | 2015-01-06 | Futurewei Technologies, Inc. | Systems and methods for a floating-point multiplication and accumulation unit using a partial-product multiplier in digital signal processors |
| CN102681815B (zh) * | 2012-05-11 | 2016-03-16 | 深圳市清友能源技术有限公司 | 用加法器树状结构的有符号乘累加算法的方法 |
| KR101551641B1 (ko) * | 2015-04-02 | 2015-09-08 | 한석진 | 비선형 데이터의 평균 계산 장치 |
| US9489482B1 (en) * | 2015-06-15 | 2016-11-08 | International Business Machines Corporation | Reliability-optimized selective voltage binning |
| KR102359265B1 (ko) * | 2015-09-18 | 2022-02-07 | 삼성전자주식회사 | 프로세싱 장치 및 프로세싱 장치에서 연산을 수행하는 방법 |
| JP6863907B2 (ja) * | 2018-01-05 | 2021-04-21 | 日本電信電話株式会社 | 演算回路 |
-
2018
- 2018-01-05 JP JP2018000451A patent/JP6995629B2/ja active Active
- 2018-12-18 CN CN201880085302.XA patent/CN111615700B/zh active Active
- 2018-12-18 WO PCT/JP2018/046495 patent/WO2019135354A1/ja not_active Ceased
- 2018-12-18 US US16/959,968 patent/US11360741B2/en active Active
-
2021
- 2021-12-09 US US17/643,507 patent/US12386591B2/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2004171263A (ja) * | 2002-11-20 | 2004-06-17 | Sharp Corp | 演算装置 |
| JP2004265346A (ja) * | 2003-03-04 | 2004-09-24 | Sony Corp | 離散コサイン変換装置および逆離散コサイン変換装置 |
| US20050201457A1 (en) * | 2004-03-10 | 2005-09-15 | Allred Daniel J. | Distributed arithmetic adaptive filter and method |
| JP2012169926A (ja) * | 2011-02-15 | 2012-09-06 | Fujitsu Ltd | Crc演算回路 |
Non-Patent Citations (1)
| Title |
|---|
| YI, RU ET AL.: "Implementation Consideration of Linear-Phase Delay Digital Filter Using Distributed Arithmetic on FPGA", JOINT RESEARCH PRESENTATION OF TOCHIGI AND GUNMA BRANCHES OF THE INSTITUTE OF ELECTRICAL ENGINEERS OF JAPAN, 29 February 2012 (2012-02-29), pages 18 - 20 , 25-26, XP055619872, Retrieved from the Internet <URL:https://kobaweb.ei.st.gunma-u.ac.jp/news/pdf/2011/ETT-11-07ekijo.pdf> [retrieved on 20190319] * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| RU2803254C1 (ru) * | 2022-12-14 | 2023-09-11 | Федеральное государственное бюджетное военное образовательное учреждение высшего образования "Черноморское высшее военно-морское ордена Красной Звезды училище имени П.С. Нахимова" Министерства обороны Российской Федерации (г. Севастополь) | Вероятностное устройство вычисления дисперсии |
Also Published As
| Publication number | Publication date |
|---|---|
| JP6995629B2 (ja) | 2022-01-14 |
| JP2019121171A (ja) | 2019-07-22 |
| CN111615700B (zh) | 2023-12-08 |
| US11360741B2 (en) | 2022-06-14 |
| US12386591B2 (en) | 2025-08-12 |
| US20210064342A1 (en) | 2021-03-04 |
| CN111615700A (zh) | 2020-09-01 |
| US20220100472A1 (en) | 2022-03-31 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12386591B2 (en) | Arithmetic circuit | |
| US5255216A (en) | Reduced hardware look up table multiplier | |
| JPH03156531A (ja) | 除算装置 | |
| US20040153490A1 (en) | Logic circuit and method for carry and sum generation and method of designing such a logic circuit | |
| US4293922A (en) | Device for multiplying binary numbers | |
| Balajishanmugam | High-performance computing based on residue number system: a review | |
| WO2024253854A1 (en) | Architecture for number theoretic transform and inverse number theoretic transform | |
| CN111630509B (zh) | 执行积和运算的运算电路 | |
| JP3003467B2 (ja) | 演算装置 | |
| JP2822399B2 (ja) | 対数関数演算装置 | |
| JP2019121171A5 (https=) | ||
| US5493522A (en) | Fast arithmetic modulo divider | |
| CN109379191B (zh) | 一种基于椭圆曲线基点的点乘运算电路和方法 | |
| RU2559771C2 (ru) | Устройство для основного деления модулярных чисел | |
| US12531723B2 (en) | Architecture for number theoretic transform and inverse number theoretic transform | |
| RU2477513C1 (ru) | Ячейка однородной вычислительной среды, однородная вычислительная среда и устройство для конвейерных арифметических вычислений по заданному модулю | |
| JPH056263A (ja) | 加算器およびその加算器を用いた絶対値演算回路 | |
| Ramírez | Simple and Linear Fast Adder of Multiple Inputs and Its Implementation in a Compute-In-Memory Architecture | |
| US20260037218A1 (en) | Reconfigurable butterfly architecture | |
| US20260019085A1 (en) | Approximation values in look up tables | |
| US20260010490A1 (en) | Memory conflict resolution for dilithium cryptography | |
| Bello et al. | A MRC Based RNS to binary converter using the moduli set {22n+ 1-1, 2n-1, 22n-1} | |
| KR100632928B1 (ko) | 모듈라 곱셈장치 | |
| JP2508286B2 (ja) | 平方根演算装置 | |
| Bhattacharjee et al. | FPGA Optimized Pipelined Modulo Computation Architecture Leveraging Primitive Instantiation and Placement Constraints for Efficient Logic Packing |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18898668 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 18898668 Country of ref document: EP Kind code of ref document: A1 |