WO2023120403A1 - Cgraによる疎行列計算とマージソートに関する演算ユニット - Google Patents
Cgraによる疎行列計算とマージソートに関する演算ユニット Download PDFInfo
- Publication number
- WO2023120403A1 WO2023120403A1 PCT/JP2022/046353 JP2022046353W WO2023120403A1 WO 2023120403 A1 WO2023120403 A1 WO 2023120403A1 JP 2022046353 W JP2022046353 W JP 2022046353W WO 2023120403 A1 WO2023120403 A1 WO 2023120403A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- register
- data
- input
- address
- index value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/22—Arrangements for sorting or merging computer data on continuous record carriers, e.g. tape, drum, disc
- G06F7/24—Sorting, i.e. extracting data from one or more carriers, rearranging the data in numerical or other ordered sequence, and rerecording the sorted data on the original carrier or on a different carrier or set of carriers sorting methods in general
Definitions
- the present invention is based on a CGRA (Coarse-grained) algorithm having PEs (Processing Elements) arranged in a two-dimensional array as computational resources. Arithmetic units in reconfigurable architectures, especially for sparse matrix computations and merge sorts.
- the CGRA is an architecture originating from a systolic array, and is a reconfigurable architecture whose circuit can be changed any number of times like an FPGA (Field Programmable Gate Array).
- a CGRA is often composed of a memory and a basic unit consisting of an arithmetic unit and a register. The granularity of reconfiguration is at the gate level in the FPGA, whereas in the CGRA it is at the arithmetic unit level. Therefore, CGRA is superior to FPGA in operating frequency and circuit area when equivalent functions are realized with equivalent process rules. Further, CGRA is pipeline type vertical parallel processing, and it is possible to suppress the data supplied to the arithmetic unit group every cycle.
- Patent Literature 1 When processing large-scale parameter matrix calculations on a processor, such as neural network calculations, the amount of calculations increases and the processing load increases. , an accelerator that reduces required hardware resources is known (see Patent Literature 1).
- the accelerator of Patent Literature 1 inserts padding into each arrangement of non-zero elements of a sparse matrix according to the order in which the data loading unit accesses each element of the input vector stored in the storage unit in parallel.
- the sparse matrix-vector product in the accelerator of Patent Document 1 performs the fusion product sum of the input elements from the data buffer and the weight parameter sequence, which becomes the matrix product of dense matrix - sparse matrix, and sparse matrix - sparse matrix. It is not about matrix multiplication of matrices.
- the present invention provides an arithmetic unit for sparse matrix calculation and merge sort by CGRA that enables sparse matrix-sparse matrix multiplication and merge sort with effective use of memory space and a small amount of local memory. intended to provide
- an arithmetic unit is an arithmetic unit comprising a 3-input pipeline floating-point multiply-accumulator and a local memory, wherein the multiply-accumulator is a combination of an index register and a data register. a first input register and a second input register, and a third input register and an output register, which are data registers.
- the index value A and the index value B read from the local memory are stored in the first input register and the second input register of the sum-of-products arithmetic unit, and the data A and the data B read from the local memory are stored in the respective index registers. Stored in each data register.
- index value A and index value B are compared, and if the values match, the multiplied value of data A and data B and the data value of the third input register are added to the output register of the sum-of-products calculator. Stored and returned to the third input register.
- CGRA is a pipeline type vertical direction, it cannot be compressed in the vertical direction (vertical direction), so data is compressed by removing zero elements in the horizontal direction (horizontal direction).
- vertical direction vertical direction
- horizontal direction horizontal direction
- sparse matrix calculation using a small amount of local memory can be performed without interruption.
- the index value A is the column number of the sparse matrix
- the index value B is the column number of the transposed matrix of the sparse matrix.
- the first input register and the second input register of the first stage of the sum-of-products arithmetic unit receive the index value A and the index value B read from the local memory.
- Data A and data B stored in each index register and read from the local memory are stored in each data register.
- the index value A and the index value B are compared, and if the values are different, the constant C is stored in the output data register of the first stage of the sum-of-products calculator, and if the values match.
- data A and data B are multiplied and stored in the output data register of the first stage of the sum-of-products arithmetic unit.
- the addition result is propagated to the output data register of the second stage of the sum-of-products arithmetic unit, and at timing n+4 of the clock signal, the normalized data is transferred to the output data register of the third stage of the sum-of-products arithmetic unit.
- the sum-of-products operation result is propagated.
- the data is returned from the output data register of the third stage of the sum-of-products calculator to the data register, which is the third input register of the first-stage of the sum-of-products calculator.
- the arithmetic unit of the present invention comprises a first address calculator and a second address calculator including an input base address register and an input offset address register, the first and second address calculators each having an input base address.
- a first address calculation mechanism loads the address information into the register and the input offset address register, compares the index value A and the index value B read from the local memory, and the first address calculation mechanism stores the next element in the input base address register if A ⁇ B.
- a value required for reference is added, and the second address calculation mechanism adds a value required for next element reference to the input base address register if A ⁇ B, and returns the address addition result to the input base address register.
- the address information is loaded into each of the input base address register and the input offset address register at the first timing of the clock signal, and the index value read from the local memory at the second timing of the clock signal.
- A is compared with the index value B, the first address calculation mechanism adds the value necessary for the next element reference to the input base address register if A ⁇ B, and the second address calculation mechanism adds the value necessary for the next element reference if A ⁇ B.
- the input base address register is added with a value necessary for referring to the next element, the address addition result is propagated to the data holding register at the third and fourth timings of the clock signal, and at the next timing of the clock signal, Returns the contents of the data holding register to the input base address register.
- the first and second address calculation mechanisms have an address mask mechanism that divides the local memory into a plurality of spaces, and sets a value in the input offset address register at the first timing of the clock signal. , propagates the value of the input offset address register to the data holding register at the second timing of the clock signal, adds each address addition result and the value of the input offset address register at the third timing of the clock signal, and clocks
- the data is separated into different spaces by the address mask mechanism and stored in address latches for local memory reference, and at the fifth timing of the clock signal, the set of index and data read from the local memory is stored in the memory output register, and read out from the memory output register at the next timing of the clock signal to be used as the index value.
- An arithmetic method is an arithmetic method using an arithmetic unit composed of a 3-input pipeline floating-point multiply-accumulate arithmetic unit and a local memory. and a second input register, and a third input register and an output register, which are data registers.
- Index value A and index value B read from the local memory are stored in the first input register and the second input register of the sum-of-products calculator, and data A read from the local memory is stored in each index register. and data B are stored in each data register.
- index value A and index value B match, the multiplied value of data A and data B and the data value of the third input register are added and stored in the output register of the sum-of-products calculator. and back to the third input register.
- a data path is added to the CGRA type accelerator to perform a sparse matrix-sparse matrix multiplication operation using a small amount of local memory.
- 64-bit data in which the index value, which is the matrix element number, is stored in the upper 32 bits and the data, which is the matrix element value, in the lower 32 bits, is stored in the local memory, and the sparse matrix-sparse matrix matrix multiplication is performed. Focusing on the fact that the two inputs are sparse matrices, the zero elements are removed, the data is compressed, and the calculation is performed, and the product of sparse matrix - sparse matrix, and sparse matrix - dense matrix are efficiently processed with the same throughput. calculate. Since CGRA is a pipeline type vertical direction and cannot be compressed in the vertical direction (vertical direction), it is possible to reduce the local memory by removing horizontal (horizontal) zero elements, compressing data, and adding a data path. Continuously perform sparse matrix calculations using
- the calculation method according to the first aspect of the present invention is a calculation method for matrix multiplication of a sparse matrix A and a sparse matrix B, and preferably further includes the following steps 1-4) and 1-5).
- Index value A is the column number of the sparse matrix
- index value B is the column number of the transposed matrix of the sparse matrix
- zero-valued elements in each sparse matrix are compressed.
- 1-5) A step in which non-zero valued elements in each sparse matrix are stored in local memory as pairs of index values and data.
- Compressing zero-valued elements in a sparse matrix removes zero-valued ones, leaving only non-zero-valued ones, prepends the column number, the index value stores the original column number, and the data Contains the original column number value.
- the calculation method according to the first aspect of the present invention preferably further comprises steps 1-6) to 1-9) below.
- 1-6) Loading address information into each input base address register and input offset address register.
- 1-7) A step of comparing the index value A and the index value B read from the local memory.
- the first address calculation mechanism adds a value necessary for the next element reference to the input base address register if A ⁇ B, or the second address calculation mechanism if A ⁇ B adding a value necessary for the next element reference to the input base address register.
- an arithmetic method is a merge sort arithmetic method using an arithmetic unit composed of a 3-input pipeline floating-point multiply-accumulate arithmetic unit and a local memory, comprising an index register and a data register.
- a product-sum calculator comprising a first input register and a second input register that are a set of, a third input register that is a data register, and an output register, each step of 2-1) to 2-5) below Prepare.
- Index value A and index value B read from the local memory are stored in the first input register and the second input register of the sum-of-products calculator, and data A read from the local memory is stored in each index register.
- a calculation method adds a data path to a CGRA type accelerator, uses a small amount of local memory, sorts the entire data in ascending or descending order, and performs merge sort.
- merge-sort is performed according to the magnitude relationship of the reading results of the upper 32-bit data, which are the index values of the two inputs.
- the operation unit based on CGRA of the present invention it is possible to perform sparse matrix-sparse matrix multiplication and merge sort using a small amount of local memory.
- FIG. 1 a case will be described in which a matrix A of 4 rows and 8 columns is multiplied by a matrix B of 8 rows and 4 columns to calculate a matrix C of 4 rows and 4 columns.
- the row direction and the column direction are multiplied so as to enclose in a frame (see FIG. 1 (1)), but in this case Since the memory accesses are not in order, the transposed matrix BT of the matrix B is used to make the memory accesses in order (see FIG. 1(2)).
- both matrix A and matrix B are sparse matrices, i.e., matrices in which most of the matrix elements are composed of zero values
- the amount of calculation can be reduced by selecting only non-zero elements for calculation. .
- FIG. 1(3) in the calculation of the first row of the matrix A, only the non-zero values A00 and A01 need to be calculated, and the other zero-valued elements can be skipped.
- the 1st to 4th rows of the transposed matrix B T to be multiplied by A00, A01 in the first row of matrix A are B00, B10 in the first row, B01, B11, B21, B31 in the second row, and B01, B11, B21, B31 in the third row, respectively.
- B42, B52, B62, B72 and the fourth line are B63, B73.
- the matrix A and the transposed matrix B T shown in FIG. 1(4) are compressed.
- compression replaces non-zero valued elements in matrix A and transposed matrix B T with a set of index value (matrix element number) and data (element value), and removes all zero valued elements. remove.
- index value matrix element number
- data element value
- remove all zero valued elements. remove.
- the remaining index value and data pairs are set to negative values (eg, -1) as index values.
- the data obtained by multiplying the element A00 of the matrix A by the elements B00, B01, B02, and B03 of the matrix B flows in the vertical direction
- the element A01 of the matrix A in the latter stage is Data multiplied by elements B10, B11, B12, and B13 of matrix B, and data multiplied by elements A02 and A03 of matrix A, respectively, by elements B20 to B23 and B30 to B33 of matrix B. and finally become the elements C00, C01, C02, C04 of the matrix C.
- data addition is performed by arranging operation units in the same manner as in the vertical direction. cannot be chained together and executed.
- the compressed matrix A and the transposed matrix B T must refer to the index values inside one arithmetic unit, circulate the data of the sparse matrix, and add the data.
- the compressed sparse matrix A and the transposed matrix B T of the sparse matrix B are taken out from the external memory, and the index values are referred to inside one operation unit to obtain the data of the sparse matrix. will be circulated to add up the data.
- FIG. 5 shows a schematic diagram of the configuration of the arithmetic unit of the present invention.
- the arithmetic unit of the present invention comprises a first address calculation mechanism 1, a second address calculation mechanism and an arithmetic unit 3, each of which has internal local memories (15, 25, 35) connected to an external memory 5. ing.
- the calculator 10 of the first address calculation mechanism 1 receives the input base address A, the index values of the sparse matrix A and the transposed matrix BT , compares the respective index values, and if B ⁇ A, , the address of index value A is incremented and the value of the next element of sparse matrix A is referenced.
- the local memory 15 stores pairs of index values of the sparse matrix A and data.
- the memory output register 16 holds a set of index value and data read from the local memory 15 .
- the arithmetic unit 20 of the second address calculation mechanism 2 receives the input base address B, the index values of the sparse matrix A and the transposed matrix B T , compares the respective index values, and if B ⁇ A, , the address of the index value B is incremented and the value of the next element of the transposed matrix BT is referenced.
- each index value is compared, and if not B ⁇ A, the address of the index value B is maintained as it is and the address is not incremented.
- the value of the input offset address B is added to the address output by the calculator 20 .
- the local memory 25 stores pairs of index values and data of the transposed matrix BT .
- the memory output register 26 holds a set of index value and data read from the local memory 25 .
- the arithmetic unit 3 is composed of a 3-input pipeline floating-point multiply-accumulator 30 and a local memory 35 .
- a computing unit 30 receives three inputs from a first input register 31, a second input register 32 output from the first and second address calculation mechanisms, and an output result (third input register 33) of the previous process. I do.
- Local memory 35 is connected to external memory 5 .
- the 4-stage pipeline processing is a process in which the processing of one instruction is completed in 4 stages, each process is processed independently and simultaneously, and can operate in parallel.
- the index value A and the index value B read out from the local memory are stored in the first input register and the second input register of the first stage of the sum-of-products arithmetic unit.
- Data A and data B read from the memory are stored in each data register.
- the index value A and the index value B are compared.
- Data A and data B are multiplied and stored in the output data register of the first stage of the sum-of-products calculator.
- the addition result is propagated to the output data register of the second stage of the sum-of-products arithmetic unit.
- the normalized sum-of-products operation result is propagated to the output data register of the third stage of the sum-of-products calculator.
- a back-to-back data accumulation ring is constructed.
- Input base address register A and input offset address register A are loaded with address information at the first timing of the clock signal.
- the index value A and the index value B read from the local memory are compared, and a value (for example, +8 bytes) necessary for referring to the next element is added to the input base address register A, or Keep the value as it is (+0 byte).
- the third and fourth timings of the clock signal pass through and the back-to-back address accumulation ring is constructed.
- the arithmetic unit of the present invention comprises a 3-input pipeline floating-point multiply-accumulate calculator 30 and a local memory 35. , a second input register 32 which is a set of an index register 322 and a data register 321, a third input register 33 which is a data register, and an output register .
- the first input register 31 composed of 64-bit data
- the element number of the sparse matrix A is stored in the upper 32-bit index register 312
- the element value is stored in the lower 32-bit data register 311 .
- the upper 32-bit index register 322 stores the element number of the transposed matrix B T of the sparse matrix B
- the lower 32-bit data register 321 stores the element value. .
- the first input register 31 and the second input register 32 of the sum-of-products calculator 30 store an index value A and an index value B read from a local memory (matrix product A, matrix product B) (not shown). is stored in each index register (312, 322), and data A and data B are stored in each data register (311, 321). Then, the index value A and the index value B are compared, and if the values match, the multiplied value of the data A and the data B and the data value of the third input register 33 (data result of the previous process) are added. is stored in the output register 34 of the sum-of-products calculator 30 and returned to the third input register 33 . If the values of the index value A and the index value B do not match, the data A and the data B are skipped without being multiplied, and the data value of the third input register 33 (the data result of the previous process) is directly output to the output register. 34.
- the address calculator of the present invention comprises a first address calculator and a second address calculator including input base address registers (18,28) and input offset address registers (19,29).
- the first address calculation mechanism loads address information into the input base address register A18 and the input offset address register A19 at the first timing of the clock signal.
- the index value A read from the local memory 15 is compared with the index value B, and if A ⁇ B, the value (+8 bytes) necessary for referring to the next element is added to the input base address register A18. do.
- the address addition result is propagated to the data holding register A14, and at the next timing of the clock signal, the contents of the data holding register A14 are returned to the input base address register A18.
- the second address calculation mechanism loads address information into the input base address register B28 and the input offset address register B29 at the first timing of the clock signal.
- the index value A and the index value B read from the local memory 25 are compared, and if A ⁇ B, the value (+8 bytes) necessary for referring to the next element is added to the input base address register B28. do.
- the address addition result is propagated to the data holding register B24, and at the next timing of the clock signal, the contents of the data holding register B24 are returned to the input base address register B28.
- the first and second address calculation mechanisms each comprise an address mask mechanism (101, 201) that divides the local memory into multiple spaces.
- a value is set in the input offset address register (19, 29) at the first timing of the clock signal.
- the values of the input offset address registers (19, 29) are propagated to the data holding registers (not shown).
- Each address addition result and the value of the input offset address register are added at the third timing of the clock signal.
- the addresses are separated into different spaces by the address mask mechanism and stored in address latches for local memory reference.
- the set of index value and data read from the local memory is stored in the memory output registers (16, 26).
- it is read out from the memory output registers (16, 26) and used as an index value.
- FIGS. 8 to 12 An embodiment of an arithmetic unit relating to sparse matrix calculation by CGRA that operates in units of 4 clocks will be described in detail.
- numerals are attached to the figures, which correspond to the register numbers in the following description (the n-th register corresponds to the circled number n in the figures). ).
- a plurality of basic units consisting of sets of operation units, multiple address generators (address calculation mechanisms), and local memories are connected.
- the first stage of each address generator stores the value of the first register (input base address registers 18 and 28 in FIG. 7) holding the base address and the address calculation result generated by itself in the final stage. and the values of the ninth register (data holding registers 14 and 24 in FIG. 7) held in the local memory, and the third and fourth registers ( Input the value of the memory output register 16, 26).
- the first stage of each address generator uses the value of the first register (input base register 18, 28) as the first operand of addition and the constant 0 as the second operand of addition, and outputs the addition result as the second operand. 2 register.
- the first stage of each address generator transfers the value of the ninth register (data holding registers 14 and 24) of each address generator to the first operand, the third and fourth registers ( A constant set by the comparison result of each high-order bit of the memory output registers 16, 26) is used as the second operand, and the addition result is stored in the second register of each address generator.
- the second stage of each address generator uses the value of the second register as the first operand of addition and the value of the fifth register (input offset address registers 19 and 29) holding the offset address as the second operand of addition. Addition is performed, and the addition result is stored in the sixth register, and at the same time, the contents of the second register are stored as they are in the seventh register.
- the third stage of each address generator performs a mask operation using the value of the sixth register as the first operand and the mask value as the second operand of the mask operation for local memory reference, and outputs the mask result to the eighth stage of each address generator.
- the contents of the seventh register are stored as they are in the ninth registers (data holding registers 14 and 24) of each address generator.
- the lower 32-bit data A and B of the tenth and eleventh registers are input to the arithmetic unit 30 as arithmetic operands.
- FIGS. 13 to 15 to be referred to in the following description numerals are attached to the figures, which correspond to the register numbers in the following description (the n-th register corresponds to the circled number n in the figures). ).
- Merge sort is an operation for sorting and merging the entire data in ascending or descending order. As shown in FIG. 13, calculation is performed in the same manner as the sparse matrix calculation of the first embodiment, and the upper 32-bit data of each of the 10th and 11th registers (the first and second input registers 31 and 32 in FIG.
- Merge-sort is performed by updating one side of different addresses A and B in the one-dimensional array according to the magnitude relationship of the read results. Note that, in the arithmetic unit for sparse matrix calculation described in the first embodiment, the upper 32 bits are compared as data in the arithmetic unit for merge sort in this embodiment, similarly to the comparison of the index values of the upper 32 bits. In the case of merge sort, the lower 32 bits of each of the tenth and eleventh registers are simply stored as accompanying information of the data.
- each address generator stores the value of the first register (input base address registers 18 and 28 in FIG. 7) holding the base address and the address calculation result generated by itself in the final stage. and the values of the ninth register (data holding registers 14 and 24 in FIG. 7) held in the local memory, and the third and fourth registers ( Input the value of the memory output register 16, 26).
- the first stage of each address generator uses the value of the first register (input base registers 18 and 28) as the first operand for addition and the constant 0 as the second operand for addition. is stored in the second register.
- the first stage of each address generator uses the value of the ninth register (data holding registers 14 and 24) of each address generator as the first operand, the third and fourth A constant set by the comparison result of each high-order bit of the register (memory output registers 16 and 17) is used as the second operand, and the addition result is stored in the second register of each address generator.
- the second stage of each address generator uses the value of the second register as the first operand of addition and the value of the fifth register (input offset address registers 19 and 29) holding the offset address as the second operand of addition. Addition is performed, and the addition result is stored in the sixth register, and at the same time, the contents of the second register are stored as they are in the seventh register.
- the third stage of each address generator performs a mask operation using the value of the sixth register as the first operand and the mask value as the second operand of the mask operation for local memory reference, and sends the mask result to each address generator.
- the contents of the seventh register are stored in the ninth registers (data holding registers 14 and 24) of each address generator.
- Addresses A and B and two read data are sent to the subsequent arithmetic unit 30 according to the magnitude relationship of the results of reading the upper 32-bit data of the tenth and eleventh registers (first and second input registers 31 and 32). is sent, and any data is stored in accordance with the magnitude relationship between the address and the data, thereby storing the result of sorting for one stage out of the LogN stages of the entire sorting in the local memory.
- the storage destination address increases monotonously.
- the succeeding arithmetic unit 30 reads out the previous execution result from the local memory of the previous stage and takes charge of the stage after the LogN stage, so that pipeline execution using the local memory as a double buffer is possible as a whole. of.
- this operation unit After performing the sparse matrix multiplication operation shown in Embodiment 1, this operation unit sends addresses A and B and two read data to the subsequent operation unit, and according to the size relationship between the address and the data. , is stored, it can be used as a merge sort operation.
- FIG. 14 shows the actual data flow mapped to 4 logic columns of CGRA operating in units of 4 clocks.
- FIG. 15 shows a program code actually using the C language to show the commonality of the address calculation part in the arithmetic units of the first and second embodiments.
- the expression of the program code is the same for both the sparse matrix multiplication operation (code portion A in FIG. 15) and the merge sort operation (code portion B in FIG. 15). , there is commonality in the address calculation part in both the sparse matrix multiplication operation and the merge sort operation, and it can be seen that the address is updated based on the comparison result.
- the arithmetic unit used is composed of a 3-input pipeline floating-point multiply-accumulate calculator and a local memory. It has a third input register and an output register which are data registers.
- the index values A and B are stored in the index registers of the first and second input registers of the sum-of-products calculator, and the data A and data B are stored in the data registers. (step S01).
- the index value A is the column number of the sparse matrix
- the index value B is the column number of the transposed matrix of the sparse matrix
- the zero-valued elements in each sparse matrix are compressed (step S02).
- a non-zero value element in each sparse matrix stores an index value and data as a set in a local memory (step S03).
- the first address calculation mechanism adds a value necessary for referring to the next element to the input base address register (step S07). If the index value A is greater than or equal to the index value B (A ⁇ B), the second address calculation mechanism adds a value necessary for next element reference to the input base address register (step S08). In the case of steps S07 and S08, after that, the address addition result is returned to the input base address register (step S09). These steps are repeated to complete the operation until the sparse matrix product is complete.
- the arithmetic unit used is composed of a 3-input pipeline floating-point multiply-accumulate calculator and a local memory. It has a third input register and an output register which are data registers.
- index value A and index value B are stored in each index register of the first and second input registers of the sum-of-products arithmetic unit, and data A and data B are stored in each data register. (Step S11). Next, the index value A and the index value B are compared (step S12).
- step S13 the result of sorting for one stage is stored in the local memory (step S14).
- step S15 The store destination address of the local memory is monotonically increased (step S15), and the processing of steps S11 to S15 is repeated up to the LogN stages of the entire merge sort to complete the merge sort operation.
- the present invention is useful for ultra-compact AI accelerators with limited memory capacity.
- first address calculation mechanism 2 second address calculation mechanism 3 arithmetic unit 5 external memory 31, 32, 33 input register 34 output register 15, 25, 35 local memory
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Computer Hardware Design (AREA)
- Complex Calculations (AREA)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2023569388A JPWO2023120403A1 (https=) | 2021-12-23 | 2022-12-16 |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2021209979 | 2021-12-23 | ||
| JP2021-209979 | 2021-12-23 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023120403A1 true WO2023120403A1 (ja) | 2023-06-29 |
Family
ID=86902605
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2022/046353 Ceased WO2023120403A1 (ja) | 2021-12-23 | 2022-12-16 | Cgraによる疎行列計算とマージソートに関する演算ユニット |
Country Status (2)
| Country | Link |
|---|---|
| JP (1) | JPWO2023120403A1 (https=) |
| WO (1) | WO2023120403A1 (https=) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117851744A (zh) * | 2024-03-07 | 2024-04-09 | 北京象帝先计算技术有限公司 | 矩阵运算电路、处理器、集成电路系统、电子组件及设备 |
| CN121478225A (zh) * | 2026-01-08 | 2026-02-06 | 蓝芯算力(深圳)科技有限公司 | 用于乘加单元的矩阵分配和计算方法、系统及存储介质 |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2008234076A (ja) * | 2007-03-16 | 2008-10-02 | Fujitsu Ltd | 演算処理装置 |
| US20190042538A1 (en) * | 2017-12-13 | 2019-02-07 | Intel Corporation | Accelerator for processing data |
-
2022
- 2022-12-16 WO PCT/JP2022/046353 patent/WO2023120403A1/ja not_active Ceased
- 2022-12-16 JP JP2023569388A patent/JPWO2023120403A1/ja active Pending
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2008234076A (ja) * | 2007-03-16 | 2008-10-02 | Fujitsu Ltd | 演算処理装置 |
| US20190042538A1 (en) * | 2017-12-13 | 2019-02-07 | Intel Corporation | Accelerator for processing data |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117851744A (zh) * | 2024-03-07 | 2024-04-09 | 北京象帝先计算技术有限公司 | 矩阵运算电路、处理器、集成电路系统、电子组件及设备 |
| CN117851744B (zh) * | 2024-03-07 | 2025-03-18 | 北京象帝先计算技术有限公司 | 矩阵运算电路、处理器、集成电路系统、电子组件及设备 |
| CN121478225A (zh) * | 2026-01-08 | 2026-02-06 | 蓝芯算力(深圳)科技有限公司 | 用于乘加单元的矩阵分配和计算方法、系统及存储介质 |
Also Published As
| Publication number | Publication date |
|---|---|
| JPWO2023120403A1 (https=) | 2023-06-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12340219B2 (en) | FPGA specialist processing block for machine learning | |
| CN106598545B (zh) | 沟通共享资源的处理器与方法及非瞬时计算机可使用媒体 | |
| US11809798B2 (en) | Implementing large multipliers in tensor arrays | |
| US20230259578A1 (en) | Configurable pooling processing unit for neural network accelerator | |
| CN114970807A (zh) | Softmax和指数在硬件中的实施 | |
| CN114930311B (zh) | Fpga重复单元之间的级联通信 | |
| EP3769208B1 (en) | Stochastic rounding logic | |
| WO2023120403A1 (ja) | Cgraによる疎行列計算とマージソートに関する演算ユニット | |
| US20210326111A1 (en) | FPGA Processing Block for Machine Learning or Digital Signal Processing Operations | |
| EP4231134A1 (en) | Method and system for calculating dot products | |
| WO2023065983A1 (zh) | 计算装置、神经网络处理设备、芯片及处理数据的方法 | |
| CN110858137B (zh) | 除以整数常数的浮点除法 | |
| Lu et al. | A reconfigurable DNN training accelerator on FPGA | |
| GB2614327A (en) | Configurable pooling process unit for neural network accelerator | |
| Elangovan et al. | Ax-BxP: Approximate blocked computation for precision-reconfigurable deep neural network acceleration | |
| GB2614705A (en) | Neural network accelerator with configurable pooling processing unit | |
| Park et al. | TMA: Tera‐MACs/W neural hardware inference accelerator with a multiplier‐less massive parallel processor | |
| WO2025020975A1 (zh) | 计算装置、方法、设备、芯片及系统 | |
| CN111882050A (zh) | 基于fpga的用于提高bcpnn速度的设计方法 | |
| Chidambaram et al. | Accelerating the inference phase in ternary convolutional neural networks using configurable processors | |
| Brooks et al. | Processing data in bits and pieces | |
| US20250258889A1 (en) | Convolution processing method and electronic apparatus performing the same | |
| US20250028943A1 (en) | Activation accelerator for neural network accelerator | |
| Gustafsson et al. | Basic arithmetic circuits | |
| Nie et al. | SpFlow: Memory-Driven Data Flow Optimization for Sparse Matrix-Matrix Multiplication |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22911107 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023569388 Country of ref document: JP |
|
| DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) | ||
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 22911107 Country of ref document: EP Kind code of ref document: A1 |