WO2021217502A1 - 一种计算架构 - Google Patents
一种计算架构 Download PDFInfo
- Publication number
- WO2021217502A1 WO2021217502A1 PCT/CN2020/087814 CN2020087814W WO2021217502A1 WO 2021217502 A1 WO2021217502 A1 WO 2021217502A1 CN 2020087814 W CN2020087814 W CN 2020087814W WO 2021217502 A1 WO2021217502 A1 WO 2021217502A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- block
- calculation
- dependent
- blocks
- Prior art date
Links
- 238000000034 method Methods 0.000 claims abstract description 34
- 230000008569 process Effects 0.000 claims abstract description 28
- 238000012545 processing Methods 0.000 claims abstract description 21
- 230000005540 biological transmission Effects 0.000 claims abstract description 19
- 238000004364 calculation method Methods 0.000 claims description 155
- 230000001419 dependent effect Effects 0.000 claims description 115
- 230000008521 reorganization Effects 0.000 claims description 56
- 238000004422 calculation algorithm Methods 0.000 claims description 43
- 238000005215 recombination Methods 0.000 claims description 12
- 238000012546 transfer Methods 0.000 claims description 8
- 230000006798 recombination Effects 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 description 130
- 230000008030 elimination Effects 0.000 description 25
- 238000003379 elimination reaction Methods 0.000 description 25
- 238000010586 diagram Methods 0.000 description 16
- 230000008707 rearrangement Effects 0.000 description 8
- 238000012360 testing method Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 239000013598 vector Substances 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 230000001788 irregular Effects 0.000 description 4
- 238000004088 simulation Methods 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 238000013500 data storage Methods 0.000 description 3
- 238000012804 iterative process Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000000903 blocking effect Effects 0.000 description 2
- 238000000354 decomposition reaction Methods 0.000 description 2
- 230000001934 delay Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000739 chaotic effect Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000011056 performance test Methods 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/0207—Addressing or allocation; Relocation with multidimensional access, e.g. row/column, matrix
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0862—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7867—Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0813—Multiuser, multiprocessor or multiprocessing cache systems with a network or matrix configuration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0877—Cache access modes
- G06F12/0879—Burst mode
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
- G06F15/781—On-chip cache; Off-chip memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/34—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
- G06F9/355—Indexed addressing
- G06F9/3555—Indexed addressing using scaling, e.g. multiplication of index
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3838—Dependency mechanisms, e.g. register scoreboarding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3887—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0804—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with main memory updating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
- G06F2212/1021—Hit rate improvement
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
- G06F2212/1024—Latency reduction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1048—Scalability
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/45—Caching of specific data in cache memory
- G06F2212/454—Vector or matrix data
Definitions
- the present disclosure belongs to the technical field of processing large-scale data, and particularly relates to a computing architecture.
- computing data cannot be completely stored in an on-chip cache (such as a multi-level Cache), so data transfer between on-chip storage and off-chip storage (such as DDR memory) is required.
- on-chip cache such as a multi-level Cache
- off-chip storage such as DDR memory
- the data volume is 64MB, which is much larger than the overhead that the on-chip storage can afford.
- the characteristics of data access in solving equations and matrix operation problems are: 1) poor data locality, 2) irregular data access patterns, and 3) online random reorganization of data structures.
- the present disclosure provides a computing architecture, including: off-chip memory, on-chip cache unit, transmitting unit, pre-recombination network, post-recombination network, main computing array, data dependent controller and global scheduler; among them,
- the off-chip memory is used to store all large-scale data in a block format, wherein the large-scale data is divided into multiple blocks of equal size;
- the on-chip cache unit is used to store part of the data of the block to be calculated and the dependent data required for calculation;
- the transmitting unit is used to read the data of the corresponding block from the on-chip buffer unit according to the order specified by the scheduling algorithm and send it to the pre-reorganization network;
- the main calculation array is used to complete the calculation of the data of the main block
- the pre-reorganization network is used to perform arbitrary data reorganization on the data of the block before calculating the data of the block;
- the post-reorganization network is used to perform arbitrary data reorganization on the data of the block after the data of the block is calculated;
- the data dependence controller is used to process the data dependence relationship between the data of the block
- Global scheduler used to execute preset scheduling algorithms, control block data prefetching, transmission, calculation, data reorganization, and data dependency processing; the above technical solutions change the data storage mode and calculation strategy of matrix operations To improve the locality of memory access, and to dynamically complete data reorganization by adding multi-functional data paths, reduce the impact of irregular data structures and data rearrangements on computing efficiency, and maximize the on-chip cache and computing unit. Utilization rate, improve calculation speed.
- the computing architecture can improve data utilization and increase data processing flexibility, thereby reducing Cache Miss and reducing memory bandwidth pressure.
- the beneficial effects brought by this technical solution are embodied in the following three aspects:
- the large-scale matrix is divided into multiple tiles, and the tiles are used as the smallest granularity data for matrix operations.
- the data of each block is continuously stored in the memory, so the utilization of the cache can be effectively improved.
- multiple reuse of blocks can be realized, thereby further improving the utilization of the cache and reducing the performance bottleneck caused by the memory bandwidth.
- multiple blocks are allowed to complete flexible data reorganization and exchange in the data path, so that the data structure can be reorganized according to the computing requirements, so that it can maximize the computing requirements of the computing array and the format requirements of the storage unit.
- the block data can be arranged for the deployment of the computing array, so as to maximize the efficiency of the computing array.
- any global row and column exchange in the matrix can be efficiently completed, and this operation is completed during data transmission without consuming additional storage space and Delay, thus effectively improving the efficiency of random row and column exchange in the matrix.
- any global matrix reorganization can be completed through a limited number of data reorganizations within and between blocks. This greatly improves the scalability and adaptability of the computing system to irregular matrix operations.
- High reuse rate is the key to improving computing performance.
- the locality of data is usually weak. This is because there is generally a global data dependency between each iteration, so it is difficult to achieve localization. Repeated and iterative use of data will directly lead to on-chip and off-chip data handling becoming a key bottleneck.
- This technical solution analyzes the dependency relationship between each block in different iterations, and realizes the maximum reuse rate that conforms to the dependency relationship by means of block grouping, and ensures that the matrix operation after block division has good data locality.
- FIG. 1 is a schematic structural diagram of a computing architecture provided in an embodiment of the present disclosure
- 2(a) to 2(c) are the block division and block grouping of the original matrix in an embodiment of the present disclosure, and the distribution diagram of the data of each block in the off-chip storage;
- FIG. 3 is a diagram of changes produced by multiple blocks after passing through a pre-reorganization network in an embodiment of the present disclosure
- FIG. 4 is a diagram of operand input and result output of the main calculation array in an embodiment of the present disclosure
- Figures 5(a) to 5(d) are diagrams illustrating examples of data dependence in an embodiment of the present disclosure
- FIG. 6 is a dependency relationship diagram between block groups in an embodiment of the present disclosure.
- FIG. 7 is a schematic structural diagram of another computing architecture provided in an embodiment of the present disclosure.
- Fig. 8 is a schematic flow chart of the overall calculation process of a block in an embodiment of the present disclosure.
- Figure 9 is a schematic diagram of producer-consumer block groups divided according to block dependency in an embodiment of the present disclosure.
- FIG. 10 is a schematic diagram of the work flow of a data-dependent controller in an embodiment of the present disclosure.
- Figure 11 is a schematic diagram of the BENES data exchange network structure in an embodiment of the present disclosure.
- FIG. 12 is an example diagram of a work flow of a data reorganization network module in an embodiment of the present disclosure
- FIG. 13 is a schematic diagram of matrix global data reorganization in an embodiment of the present disclosure.
- FIG. 14 is a schematic diagram of the block dependency relationship in the GJE-based matrix inversion calculation in an embodiment of the present disclosure
- Fig. 15 is a complete calculation flow chart of matrix inversion in an embodiment of the present disclosure.
- FIG. 16 is a comparison diagram of the speedup ratio of matrix inversion operation of this architecture compared with other computing platforms in an embodiment of the present disclosure
- FIG. 17 is a comparison diagram of the calculation speedup ratio of solving linear equations of the present architecture compared with other computing platforms in an embodiment of the present disclosure.
- a computing architecture including: off-chip memory, on-chip cache unit, transmission unit, pre-recombination network, post-recombination network, main computing array, and data-dependent controller And the global scheduler; among them,
- the off-chip memory is used to store all large-scale data in a block format, wherein the large-scale data is divided into multiple blocks of equal size;
- the on-chip cache unit is used to store part of the data of the block to be calculated and the dependent data required for calculation;
- the transmitting unit is used to read the data of the corresponding block from the on-chip buffer unit according to the order specified by the scheduling algorithm and send it to the pre-reorganization network;
- the main calculation array is used to complete the calculation of the data of the main block
- the pre-reorganization network is used to perform arbitrary data reorganization on the data of the block before calculating the data of the block;
- the post-reorganization network is used to perform arbitrary data reorganization on the data of the block after the data of the block is calculated;
- the data dependence controller is used to process the data dependence relationship between the data of the block
- Global scheduler used to execute preset scheduling algorithms, control block data prefetching, transmission, calculation, data reorganization, and data dependency processing; the above technical solutions change the data storage mode and calculation strategy of matrix operations To improve the locality of memory access, and to dynamically complete data reorganization by adding multi-functional data paths, reduce the impact of irregular data structures and data rearrangements on computing efficiency, and maximize the on-chip cache and computing unit. Utilization rate, improve calculation speed.
- the off-chip memory is used to store all large-scale data in a block format.
- the off-chip storage device is a large-capacity storage device, such as DDR, which is characterized by slower access speed and larger storage capacity.
- all large-scale matrix data are stored in off-chip storage.
- the large-scale matrix is divided into multiple equal-sized tiles in advance and stored in the off-chip memory.
- Block is the smallest granularity data of matrix operation, and it is also the smallest unit of transmission, operation and control.
- Each block is a partial M*N sub-matrix of the original data, and the element data inside each block is continuously stored in the memory.
- the data of different blocks is usually stored continuously in a block group, that is, a group of blocks composed of multiple blocks are stored in a continuous storage address space.
- the edge will be extended to meet the N*N sub-block division method.
- Figures 2(a) to 2(c) show the block division and block grouping of the original matrix, and the distribution of the data of each block in the off-chip storage. In the examples of Fig. 2(a), Fig. 2(b), and Fig.
- each block is a sub-matrix with a size of 3*2.
- the original matrix is divided according to the size of 3*2. If the size of the original matrix does not constitute an integer multiple of M*N, 0 is added at the edge (as shown in Figure 2(b)). It can be seen that the various elements within each block are continuously stored in the memory, and different blocks are stored continuously according to block groups. In addition, for vectors that need to be operated on with the matrix, these vectors are also stored in M*N blocks and managed in a unified manner with the matrix blocks. As shown in Figure 2(c).
- the present disclosure is designed for large-scale matrix operations, it can handle matrices of any size when computing resources and storage resources are sufficient.
- the value of the block size M and N should match the scale of the computing array. According to the current mainstream computing architecture scale and storage device scale, the reasonable value of M and N should be between 4-32, the dimension of the matrix to be processed It can be between 4-50000.
- the block refers to a sub-matrix at a specific position in the matrix, and the block is a concept relative to the matrix.
- a matrix is divided into multiple blocks, that is, the range of the sub-matrix area corresponding to each block is determined.
- the data of a block refers to all the elements in the sub-matrix area contained in a block. Therefore, the entity involved in the calculation is the block data instead of the block. After the block data is calculated, the value of this part of the data may be changed. Therefore, in the matrix calculation, the block data is constantly updated.
- the block (as the range of a sub-matrix) is constant.
- the on-chip cache unit is an embedded on-chip storage device that provides faster read and write access speed, but has a lower storage capacity.
- the on-chip cache is used to store part of the blocks to be calculated and the dependent data required for calculation. Among them, some of the blocks to be calculated refer to the complete data of several blocks. If the on-chip cache unit is large enough, all the blocks of the original matrix can be stored. If the on-chip cache unit is not large enough, the blocks stored therein are only part of the multiple blocks divided by the matrix to be calculated.
- the block is read from the off-chip storage unit to the on-chip cache unit and the calculation is completed, and then written back to the off-chip storage unit.
- the data that the calculation depends on refers to the information and values other than the block element itself that the block in the on-chip storage unit needs when performing the calculation. There is a detailed explanation about the dependent data later.
- the transmitting unit is used to read the data of the corresponding block from the on-chip cache unit and send it to the pre-reorganization network according to the order specified by the global scheduler module.
- the transmitting unit can read multiple blocks of data from the on-chip cache unit at a time, usually 2-4.
- the transmitting unit is also used to add a corresponding tag bit to each block when it is transmitted. These tag bits follow the block data packet to flow through all subsequent processing procedures. With the help of the tag bit, the transmitting unit can accurately control the behavior of the transmitted block in the entire calculation process. There is a detailed explanation about the tag bit in the following text.
- the pre-reorganization network is a non-blocking data exchange network with a data width of k*N*N. This network is used to process the k blocks sent by the transmitting unit, and is responsible for the processing of the blocks before these blocks enter the main computing array.
- the data undergoes data reorganization. Data reorganization can occur within a single block or between multiple blocks, and its form can be any row exchange, column exchange, data rearrangement in any order, data multicast, etc.
- Figure 3 illustrates several types of changes that occur after multiple blocks have passed through the pre-reorganized network. As shown in Figure 3, the network input data is a collection of single or multiple block elements, which are expanded according to a one-dimensional vector and sent to the pre-recombination network.
- the output of the pre-recombination network is also a one-dimensional vector of the same length as the input, and this vector is the element of each block of the output.
- Data reorganization can be completed between the various elements within the block, and the elements of multiple blocks can be exchanged and rearranged.
- the operations that the network can perform on the input data are not limited to the examples listed in Figure 3.
- the pre-reorganization network can be realized by selecting different data exchange networks according to specific reorganization requirements.
- the BENES network is used as the pre-switching network, and its structure and specific introduction are shown below.
- the main calculation array is used to complete the calculation of the data of the main block and generate the calculation result.
- the host computer array includes a parallel calculation array, which can perform calculations on the input block data in parallel.
- the operand of the calculation array also includes the dependent data required for the calculation.
- the dependent data will be described in detail later.
- the host computer array After the host computer array performs operations on the input block, it will use the calculation result to update the value of the corresponding element in the block, and for some algorithms, it will also generate other calculation results. Therefore, the final output data of the main computing array includes updated block data.
- the example of Fig. 4 shows the operand input and result output of the main calculation array. Note that Fig. 4 is only a possible scale and calculation mode of the main calculation array.
- the post-reorganization network is used to perform arbitrary data reorganization on the calculation results generated by the main calculation array after the data calculation of the block, that is, the updated block data; its reorganization function is similar to the pre-reorganization network.
- the data dependence controller is used to process the data dependence relationship between the data of the block.
- the data dependence relationship is generated by the calculations and operations required by the block. In many cases, the calculations required by the block cannot be done solely by the elements of the block itself, but other information and values are needed. These extra elements besides the block itself are the dependent data for the calculation of the block.
- the dependent data can be the values of all the elements of other blocks, the values of some elements, or the intermediate values calculated from other block elements.
- the existence of dependent data means that there is a dependency relationship between different blocks.
- the dependency is divided into direct dependency and indirect dependency. If a certain operation requires all elements of multiple blocks to participate simultaneously, then these blocks are directly dependent on each other, because they must all directly participate in the operation.
- the dependent data of a certain block is part of the elements of one or several other blocks, or the intermediate calculation result derived from these blocks, then this dependence is an indirect dependence.
- the block that generates the dependent data is the "producer block”
- the block that uses the dependent data is the "consumer block”.
- Figures 5(a) to 5(d) list several examples of data dependence: Figure 5(a) is the addition of block A and block B, and block A and block B are directly composed Dependency relationship; Figure 5(b) shows that block A and block B need to exchange arbitrary rows, and block A and block B constitute a direct dependency relationship; Figure 5(c) shows that each row of block A needs to be subtracted A row of elements of block B, block A and block B form an indirect dependency, where B is the "producer block” and A is the "consumer block”; Figure 5(d) is block C multiplied by A row of elements after the addition of blocks A and B, A block and B/C block constitute an indirect dependency, B/C block is a "producer block”, and A is a "consumer block”.
- the block groups can be further defined, as well as the dependencies between multiple block groups.
- a block group refers to a collection of multiple blocks. There may be a dependency relationship between multiple blocks in the same group. This kind of dependency data between different blocks in the same group is called “local dependency data”. In addition, some blocks in one block group may form a dependency relationship with some blocks in another block group. This kind of cross-block group dependency data is called “global dependency data”. The block group that generates the “global dependency data” is called the “producer block group”, and the block group that uses the “global dependency data” is called the “consumer block group”. This constitutes a dependency relationship between block groups.
- Figure 6 shows an example.
- blocks A, B, C, and D are divided into bit block group 1, and E, F, and G are divided into block group 2.
- A is the producer block
- B, C, and D are consumer blocks
- the dependency data between them is the local dependency data of block group 1.
- block E generates local dependency data in block group 2.
- the A block also generates the dependent data required in the block group 2. Since the data crosses the block group, it is the global dependent data. Since the global dependency data is generated by block group 1, a dependency relationship is formed between block group 2 and block group 1.
- block group 1 is the "producer block group”
- block group 2 is the "consumer block group”.
- the global scheduler is the core control module of this architecture. It is used to execute preset scheduling algorithms and control block data prefetching, transmission, calculation, data reorganization, and data dependency processing. Specifically, the global scheduler instructs the transmitting module to read and transmit the blocks in the on-chip buffer according to a certain scheduling sequence, and set different tag bits for different blocks according to the instructions of the global scheduler.
- the tag bit of each block indicates the required processing and operation of each module such as the subsequent pre-switching network, main computing array, post-switching network, and data dependent controller.
- the global scheduler determines the transmission sequence of the blocks and the operations that the blocks need to complete based on the dependencies between each block and between each block group.
- the scheduling principle is that the producer block precedes the consumer block, and the producer block group precedes the consumer block group.
- a possible scheduling sequence is: A->B->C->D->E->F->G.
- the global scheduler can be implemented in many forms, including state machines, dynamic lookup tables, MCU processors, and so on.
- the global scheduler is also responsible for notifying the prefetch module in advance to carry out the block transfer between the off-chip storage unit and the on-chip storage unit according to the processing sequence of the blocks.
- the global scheduler is responsible for block prefetching, calculation, data reorganization, and dependency processing according to a preset scheduling algorithm.
- the global scheduler reads data blocks into the on-chip cache by prefetching, and performs calculations in units of blocks.
- the transmitting module is responsible for reading the corresponding data block from the on-chip cache and sending it to the subsequent processing flow according to the order specified by the global scheduler.
- the module reads and sends k blocks at a time (k>1). K blocks can pass through all arithmetic processing flows in parallel.
- a block switching network is used to reorganize the data structure.
- the pre-recombination network and the post-recombination network are both non-blocking data exchange BENES switching networks with a data width of k*N*N. These two networks can perform arbitrary data reorganization on k blocks before and after calculation.
- the main calculation array is a set of parallel fixed-point/floating-point arithmetic units, and the operation type is common fixed-point/floating-point.
- the host computer array is a pipeline design, and k*N*N elements can be input per cycle, and arithmetic addition (add), multiplication (multiply) or multiplication and addition (mac) operation can be completed.
- the data dependency module is responsible for handling the data dependency relationships that may exist between different blocks.
- the data dependent module manages dependent data, and it can call an auxiliary calculation array to perform calculations dependent on the data.
- the auxiliary calculation array is a set of parallel fixed-point/floating-point arithmetic units, and its array size and operation type depend on the specific matrix algorithm.
- the utilization rate of the on-chip cache is very high.
- the dependency-based block grouping and scheduling algorithm used in this embodiment, as well as the management module for dependent data, can reduce the coupling between blocks to the greatest extent, increase the multiplexing rate of the blocks, and reduce the number of off-chips.
- the access pressure of storage devices greatly reduces the performance bottleneck caused by memory access delays, thereby providing high-performance, low-latency matrix calculations.
- the disclosed computing architecture further includes:
- the prefetch unit is used to complete the transfer of block data between off-chip storage and on-chip cache;
- the write-back cache unit is used to write the data of the block back to the on-chip cache unit after the data of the block is calculated;
- the auxiliary calculation array is used to assist the data dependent controller in the extraction, preprocessing and calculation of dependent data.
- the prefetch unit is used to complete the transfer of block data between the off-chip storage and the on-chip cache according to the order specified by the global scheduler module.
- This module performs simple data transfer between two storage devices.
- the address and length of the transferred data are specified by the global scheduler module.
- the existing data handling technology can be used to realize the function of this module.
- the data of the block is continuously stored in the memory.
- the data of each block is continuously stored in the memory, so the utilization rate of the cache can be effectively improved.
- the elements of each part of each block are always stored in a continuous address.
- the data of different blocks are usually stored continuously in block groups, that is, a group of blocks composed of multiple blocks in a continuous storage address space . There can be multiple block groups.
- the transmitting unit is also used to add corresponding tag bits to each block when transmitting.
- these tag bits follow the block data packet to flow through all subsequent processing procedures.
- the transmitting unit can accurately control the behavior of the transmitted block in the entire calculation process.
- the processing flow of the block is shown in Figure 8. It can be seen from Fig. 8 that the block carries different types of flag bits when it is transmitted. These flag bits indicate the processing mode of the block in different modules, and are discarded after completing a specific operation.
- the tag bits indicate the calculation tasks that the block needs to perform, data dependency information, and block data reorganization information.
- Table 2 is only a case of tag bit setting, and the specific tag bit content and setting method need to be determined according to the actual calculation task.
- the data dependence relationship includes direct dependence and indirect dependence;
- the direct dependence means that multiple blocks are required to directly participate in the calculation, and the obtained calculation result is directly used to update the block, or as intermediate dependent data;
- the indirect dependence means that the calculation of a certain block needs to be completed with the help of data of other blocks.
- the block scheduling algorithm aims to analyze the dependency relationship between different blocks and optimize the reuse efficiency of the blocks. Specifically, the scheduling sequence and scheduling strategy of each block depends on the dependencies between the blocks.
- Indirect dependence means that the calculation of a certain block needs to be completed by the data information of other blocks.
- the block used is called the leading block, and the data information used is called the dependent data.
- the dependent data is used as the intermediate data of the calculation, which can be stored in the on-chip cache and read during the calculation of the relevant block.
- Direct dependence refers to the need for multiple blocks to directly participate in the calculation, and the obtained calculation result is directly used to update the block, or as intermediate dependent data.
- the various blocks involved constitute a direct dependency on each other. For example, for data exchange between multiple blocks, these blocks will form a direct dependency. For another example, when searching for the maximum value of a certain column of the matrix, the block to which this column belongs will form a direct dependency.
- the producer block group will generate two types of dependent data during the calculation process: one type is "local dependent data", which is only used for the calculation of the blocks in the group and is not shared with other block groups.
- the other type is "global dependency data”. This type of data is not only used for the calculation of blocks in this group, but also needs to be provided to the corresponding "consumer block group" for use.
- the bottom “global dependency data” may be the upper "local dependency data”.
- the producer block and the consumer block can be effectively decoupled.
- the iterative process of matrix calculation it is no longer necessary to repeatedly load the producer block and consumer block multiple times, which can greatly improve the reuse rate of the block cache on the chip.
- the producer block can continuously complete multiple calculation iterations on the chip and store the corresponding global cache data. Consumer blocks that are subsequently loaded can also complete multiple iterations continuously on the chip.
- the block scheduling algorithm is based on the following principles: (1) Starting from the bottom "producer-consumer" dependency relationship, the blocks in the producer block group are selected and launched first. (2) All blocks with direct dependencies are continuously transmitted. (3) Repeated transmission and calculation of the existing blocks in the on-chip cache until the dependent conditions are no longer satisfied. (4) Pre-judge the block group required for the follow-up, and prefetch it into the on-chip cache in advance.
- the global scheduler is set as a state machine, used to control block prefetching, transmission, and calculation at each moment, and determines the data-dependent operations that need to be performed. These behaviors are completed through the control interface between the global scheduler and the prefetch module, the transmission module, and the data-dependent controller module.
- the data-dependent controller is also used to: 1) determine whether the current block contains dependent data on which subsequent blocks depend, and if it contains, extract, calculate and save the dependent data, Among them, the calculation of dependent data depends on the auxiliary calculation array; 2) Determine whether the current block operation depends on the previously stored block data, if so, read the relevant dependent data and provide it to the main calculation array for Perform calculations on the current block.
- the specific functions of the data-dependent controller are as follows: (1) Manage the storage, reading and clearing of all global dependent data and local dependent data. (2) For each block currently transmitted, if its calculation requires dependent data, the data dependent controller reads the corresponding dependent data from the on-chip cache and sends it to the main computing array. (3) For each block currently transmitted, if the block needs to generate dependent data, the data dependence controller is responsible for caching the corresponding block data and extracting the required dependent data. The extraction of dependent data can be done with the aid of an auxiliary calculation array.
- the workflow of the data dependent controller is shown in Figure 10.
- the data dependency controller After receiving the flag bit carried by the transmitting block, the data dependency controller first judges: (1) whether the block corresponding to the tag needs to rely on data to complete the calculation; (2) whether the block will have a dependency that needs to be saved data. Note that the above two operations may exist at the same time. Therefore, the data-dependent controller implements two sets of parallel logic that handle data reading and data storage respectively. For the former, the controller calculates the data-dependent read address, reads it from the on-chip cache, and sends it to the host computer array for calculation. For the latter, the controller needs to further determine whether the dependent data can be directly obtained from the current block data, for example, the value of a certain row/column or a certain element in the block.
- the controller will call the auxiliary calculation array to complete the corresponding calculations and save the calculation results to the on-chip cache.
- the dependent data includes local dependent data and global dependent data;
- the local dependent data refers to intermediate data generated by a certain block group and used only in the calculation of the block group;
- the global dependent data refers to intermediate data that is generated by a certain block group and needs to be used in the calculation of this block group and other block groups.
- this type of data does not need to be shared with other block groups. Therefore, the local dependent data is only saved in the calculation phase of the corresponding block group, and is discarded after the calculation is completed.
- Global dependent data refers to the intermediate data generated by a certain block group and need to be used in the calculation of this block group and other block groups (ie, the corresponding "consumer block group"). This type of data needs to be stored in the on-chip cache for a long time, and the global dependent data can not be discarded until all related dependent blocks have been calculated.
- the data dependent controller cooperates with the global scheduler to manage the above two types of dependent data. Specifically, the global scheduler determines the data dependence relationship between the blocks, and indicates the data dependence operation that the block needs to complete through the tag when the corresponding block is transmitted. After the data-dependent controller receives the flag bit carried by the block, it completes the operation on the dependent data according to the instruction of the flag bit. An example of the flow of this process can be seen in Figure 10.
- the pre-recombination network and the post-recombination network are data exchange networks.
- the network can be a BENES network or other networks with data exchange functions, such as a Batcher-Banyan network.
- a pre-data reorganization network and a post-data reorganization network, which are respectively deployed before and after the main computing array.
- These two networks are responsible for completing complex data reorganization tasks within each block or between multiple blocks, including row exchange, column exchange, transposition, and other necessary data rearrangements.
- the data reorganization network adopts the BENES network with k*N*N input.
- the schematic diagram of the BENES network is shown in Figure 11.
- the BENES network consists of several levels of switching units, each of which can complete the direct connection or exchange of two input signals.
- control signals By applying control signals to the BENES network, arbitrary data rearrangement from input port to output port can be realized. These control signals are called "control words".
- N-input BENES network can be used as two independent N/2-input BENES networks.
- an 8-input BENES can be used as two independent 4-input BENES networks.
- the BENES network input by k*N*N can not only complete any data reorganization between k blocks, but also complete data reorganization for only one or several networks.
- control words are stored in the on-chip ROM, and can be read by the pre-data reorganization network and the post-data reorganization network.
- the tag bits of the block respectively record the control word ID corresponding to the pre-rearrangement and post-rearrangement operations required by the block.
- the data reorganization of a block can be completed only within a single block, or it can be completed between multiple blocks that are transmitted in parallel (up to k). For complex data reorganization that requires multiple blocks to complete together, the involved blocks need to be cached in the write-back cache module first, and then the post-data reorganization network processes them in a specified order.
- Figure 12 shows an example.
- the blocks that need to exchange data form a direct dependence on each other.
- the blocks (9, 10, 13, 14) need to be exchanged at the same time, so they constitute a direct dependency of the four blocks.
- (1, 2) and (5, 6) need to complete column exchange, (11, 12) and (15, 16) need to complete row exchange, and these blocks all constitute a direct dependency relationship.
- the global scheduler sets the transmission sequence as shown in FIG. 13 according to its dependency relationship.
- the blocks launched at the same time complete row/column exchange in the data reorganization network.
- arbitrary data reorganization includes: row swap, column swap, transpose, and data rearrangement.
- the on-chip cache unit is partitioned into block data, local dependent data, and global dependent data.
- the size of the partition is preset according to resource constraints and algorithm requirements during system design.
- the data-dependent controller manages all read and write operations on locally dependent data and global dependent data.
- the calculation architecture can efficiently complete the matrix inversion and linear equation system solving algorithm based on Gauss-Jordon Elimination (Gauss-Jordon Elimination, hereinafter referred to as the GJE algorithm).
- GJE algorithm Gauss-Jordon Elimination
- the GJE algorithm is a classic algorithm in linear algebra and one of the algorithms often used in scientific computing.
- the GJE algorithm is selected by many parallel computing systems as the basic algorithm for computing linear equations, matrix inversion, and LU decomposition due to its good computing parallelism and relatively simple computing operations.
- the purpose of the GJE algorithm is to transform any square matrix into an identity matrix through a series of iterative elementary row transformations. For a matrix A with a size of N*N, the GJE algorithm requires a total of N iterations. At the i-th iteration, GJE will convert the i-th column of the matrix A into an identity matrix. For the i-th iteration, the process is as follows:
- Pivot row exchange swap the position of the pivot row (that is, the k-th row) and the i-th row of the A matrix. Now the pivot row becomes the i-th row of the A matrix.
- a and the identity matrix I of the same size can be combined into an enhanced matrix [A
- A is eliminated as an identity matrix
- I is transformed into an inverse matrix A -1 .
- the matrix is divided into 8 ⁇ 8 blocks.
- Each row of blocks serves as a block group.
- the following types of dependent data are involved in the calculation process: pivot row elements, pivot elements, and pivot column elements.
- the main element column elements are used to calculate the elimination coefficient of each row of the matrix.
- each block column we divide all the blocks in each block column into a block group.
- the dependency relationship between blocks can be obtained, as shown on the right side in FIG. 14.
- the direct dependency and two indirect dependencies are identified: local data dependency and global data dependency.
- the local data inside each block group depends on the pivot row element from the block group in this column.
- each block group needs to use the elimination coefficient calculated by the pivot element and pivot column to complete the elimination operation. Therefore, the block group where the pivot element is located assumes the role of the "producer" block group, and the elimination coefficient generated in the calculation is saved as global dependent data and used by other block groups.
- the global scheduler will follow the scheduling principle to determine the transmission order of each block. That is, the producer block takes precedence over the consumer block, and blocks with direct dependencies are continuously launched. Therefore, the final transmission sequence of the blocks in Figure 14 is: (11,12)->(3,7)->(12,16)->(4,8)->(9,10,13,14) ->(1,2)->(5,6).
- the producer block group ⁇ 3, 7, 11, 15> can continuously complete the elimination iterations of columns 9-12, and then other block groups can continuously complete multiple elimination iterations.
- the number of multiplexing for each block is 4.
- This block group can complete 8 consecutive elimination iterations.
- the multiplexing factor of each block rises to 8.
- the optimal block multiplexing times can be set according to factors such as on-chip computing power and off-chip main memory bandwidth, and then the size of the block group can be set.
- the access time to the off-chip main memory can be completely covered within the on-chip calculation time, and theoretically, it can reach close to 100% of the computing array utilization.
- the main calculation process is block transmission-elimination-data reorganization-write-back cache.
- the block transmission module can transmit up to two blocks in each cycle. According to the scheduling strategy, the same block group can be transmitted multiple times, thereby realizing the multiplexing of block calculations.
- the main control process includes data dependence control and global scheduling control.
- Dependent data control mainly focuses on the elimination coefficient corresponding to the pivot row data and pivot column.
- the main metadata is locally dependent data, which is extracted and saved at the beginning of each block group calculation, and discarded after the block group calculation ends.
- the elimination coefficient is globally dependent data and needs to be stored in the cache for a long time.
- the calculation of the elimination coefficient depends on the value of the pivot element column and the value of the pivot element, and needs to be pre-calculated in the iterative process. That is, during the iteration of the elimination of the kth column, the pivot element and elimination coefficient of the k+1th column are pre-calculated.
- the data-dependent controller needs to determine whether the block contains the pivotal column corresponding to the next iteration (that is, the k+1th column, which becomes the next pivotal column in the figure). If it does, you need to cache the next pivot column and search for the largest element as the pivot element. After that, the data-dependent controller also calls the auxiliary calculation array to calculate the elimination coefficient corresponding to the next iteration. Finally, the elimination coefficient is stored in the cache as global dependent data. It should be noted that the above dependent data extraction and calculation process is parallel to the main calculation process and will not block the main calculation process.
- the workflow of Figure 15 also describes the workflow of the global scheduler.
- the global scheduler is responsible for generating the transmission order of the blocks and the prefetching order. As described above, in this embodiment, the blocks in each column are divided into a block group.
- the scheduling strategy of the global controller mainly includes the following factors:
- (1) and (2) only depend on the matrix size and system resource constraints, and are set offline.
- (3) and (4) are generated by online dynamic calculation.
- (3) and (4) both depend on the process of local principal element selection, that is, the row exchange situation of the A matrix. Therefore, the global scheduler needs to obtain the row exchange information of the A matrix in time, and determine the column exchange order of the inverse matrix A -1 to be completed subsequently based on this information.
- the global scheduler will integrate row switching and column switching requirements to generate block transmission and prefetch sequences. This process can be seen in the flowchart in Figure 15.
- the performance test in this embodiment is completed by simulation.
- the simulation experiment is based on RTL code, IP simulation model of DDR/SRAM, and IP model of floating point arithmetic unit.
- the system parameters of this embodiment are as follows: operating frequency: 800MHz; block size: 8x8; main computing array size: 128x 32-bit FP MAC Unit; auxiliary computing array size: 8x 32-bit FP Division Unit; on-chip cache size: 776KB; BENES network size: 128x32-bit input.
- the working frequency is obtained by the Synopsys Design Compiler (DC) tool to synthesize the RTL code and the IP simulation model of the DDR/SRAM which can be synthesized, and the IP model of the floating-point arithmetic unit, which can be regarded as a practical working frequency.
- DC Synopsys Design Compiler
- the test set is a matrix of random floating-point numbers of different sizes.
- the matrix inversion and linear equation system solving operations are respectively performed on the test set matrix, and the operation delay is recorded.
- the control group of the test is the current mainstream and commonly used high-performance large-scale matrix computing libraries: MKL, LAPACK and CUBLAS.
- MKL version 3.8.0
- LAPACK version 3.8.0
- CUBLAS version 10.1
- Table 3 The parameter tables of different platforms in this experiment are shown in Table 3.
- the test set tests the performance of the matrix range 32-2048.
- the test set tested the performance of the matrix range of 32-2048.
- the size of Y will also affect the overall performance, so we tested different The impact of Y size on performance.
- the size of Y is N*8, N*32 and N*64 respectively.
- Table 4 lists the delay (unit: second) for completing the matrix inversion operation on different platforms on the matrix of each size
- Figure 16 lists the speedup ratio of this computing architecture compared to other control groups.
- the ordinate in Figure 16 is "the acceleration multiple of this computing architecture compared to other platforms". In other words, the ordinate is the ratio of the calculation time of other platforms to the calculation time of this computing architecture.
- the calculation time of MKL, LAPACK and CUBLAS are 47.8 times of the calculation time of this computing architecture, respectively. 128 times and 69 times.
- Table 5 lists the delays (unit: seconds) for completing the matrix inversion operation on different platforms on matrices of various sizes.
- Figure 17 lists the speedup ratio of the present invention compared to other control groups.
- this embodiment is significantly better than other computing platforms on matrices of various scales, and still has a high speedup ratio in the calculation of large-scale matrices.
- MKL is currently the best high-performance scientific computing library.
- This computing architecture can stably obtain twice the speedup compared to MKL in large-scale matrix operations.
- the resource consumption of this embodiment is much lower than that of other computing platforms.
- the on-chip cache of this embodiment is only 1/30 of Intel CPU, and the DDR bandwidth is also much lower than other platforms. This comparison further illustrates that this architecture can achieve high-efficiency use of on-chip cache resources, thereby achieving performance far superior to traditional computing methods with fewer resources.
- any matrix calculation can design its scheduling strategy by analyzing the dependencies between its blocks, and then deploy it into this computing architecture. It should be noted that for different matrix algorithms, the required data-dependent calculation methods and block calculation methods may be quite different. Therefore, the corresponding calculation modules and pipelines need to be customized according to different matrix algorithms. However, the overall structure and calculation process, scheduling strategy algorithm, and functions of each module of this architecture will not change.
- the support of this architecture for large-scale matrices depends on the amount of on-chip storage resources and the scale of the computing array. In actual deployment, suitable storage resources and computing arrays can be customized according to the actual algorithm situation and matrix size.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
Description
Matrix Order | 本计算架构 | LAPACK | MKL | CUBLAS |
32x 32 | 0.0007 | 0.093 | 0.034 | 0.050 |
64x 64 | 0.0043 | 0.319 | 0.061 | 0.217 |
128x 128 | 0.0286 | 1.244 | 0.144 | 1.018 |
256x 256 | 0.2034 | 8.281 | 0.615 | 4.75 |
512x 512 | 1.4878 | 61.91 | 3.267 | 32.64 |
1024x 1024 | 11.534 | 497.21 | 26.375 | 268.40 |
2048x 2048 | 92.274 | 3920.8 | 195.91 | 2213.90 |
Claims (10)
- 一种计算架构,包括:片下存储器、片上缓存单元、发射单元、预重组网络、后重组网络、主计算阵列、数据依赖控制器和全局调度器;其中,片下存储器,用于以区块格式存储全部的大规模的数据,其中,所述大规模的数据被划分为多个大小相等的区块;片上缓存单元,用于存储部分的待计算区块的数据以及计算所需的依赖数据;发射单元,用于根据所述的调度算法所指定的顺序,由片上缓存单元中读取相应的区块的数据并发送给预重组网络;主计算阵列,用于完成主要的区块的数据的计算;预重组网络,用于在区块的数据计算前对区块的数据进行任意数据重组;后重组网络,用于在区块的数据计算后对区块的数据进行任意数据重组;数据依赖控制器,用于处理区块的数据之间的数据依赖关系;全局调度器,用于执行预设的调度算法,控制区块的数据的预取、发射、计算、数据重组、和数据依赖关系处理。
- 根据权利要求1所述的计算架构,还包括:预取单元,用于完成区块的数据在片下存储与片上缓存之间的搬运;写回缓存单元,用于在区块的数据计算后将区块的数据写回片上缓存单元;辅助计算阵列,用于协助数据依赖控制器进行依赖数据的提取、预处理和计算。
- 根据权利要求1所述的计算架构,其中,所述区块的数据在内存中存储。
- 根据权利要求1所述的计算架构,其中,发射单元,还用于在发射每个区块时为其添加相应的标签位。
- 根据权利要求4所述的计算架构,所述标签位指示了区块所需要执行的计算任务、数据依赖信息以及区块的数据重组信息。
- 根据权利要求1所述的计算架构,其中,所述数据依赖关系包括直接依赖和间接依赖;所述直接依赖指需要多个区块的数据直接参与运算,得到的运算结果直接用于更新区块的数据,或者作为中间依赖数据;所述间接依赖指某个区块的数据的计算需要借助其他区块的数据完成。
- 根据权利要求1所述的计算架构,其中,数据依赖控制器,还用于:判断当前的区块运算是否依赖之前存储的区块的数据,如果是,则读取相关的依赖数据, 并将其提供给主计算阵列以进行当前区块的数据的运算。
- 根据权利要求2所述的计算架构,其中,数据依赖控制器,还用于:判断当前的区块中是否包含后续区块所依赖的依赖数据,如果包含,则对该依赖数据进行提取、计算和保存,其中对依赖数据的计算依靠辅助计算阵列来完成。
- 根据权利要求1所述的计算架构,其中,所述依赖数据包括本地依赖数据和全局依赖数据;所述本地依赖数据是指由多个区块组成的某个区块组产生,且仅在本区块组运算中需要使用的中间数据;所述全局依赖数据是指由多个区块组成的某个区块组产生的,且在本区块组和其他区块组运算中都需要使用的中间数据。
- 根据权利要求1所述的计算架构,其中,片上缓存单元被实现分区为区块的数据、本地依赖数据和全局依赖数据。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/864,014 US11886347B2 (en) | 2020-04-27 | 2022-07-13 | Large-scale data processing computer architecture |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010343215.9 | 2020-04-27 | ||
CN202010343215.9A CN111522776B (zh) | 2020-04-27 | 2020-04-27 | 一种计算架构 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/864,014 Continuation US11886347B2 (en) | 2020-04-27 | 2022-07-13 | Large-scale data processing computer architecture |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021217502A1 true WO2021217502A1 (zh) | 2021-11-04 |
Family
ID=71910852
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/087814 WO2021217502A1 (zh) | 2020-04-27 | 2020-04-29 | 一种计算架构 |
Country Status (3)
Country | Link |
---|---|
US (1) | US11886347B2 (zh) |
CN (1) | CN111522776B (zh) |
WO (1) | WO2021217502A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023088119A1 (en) * | 2021-11-18 | 2023-05-25 | International Business Machines Corporation | Automatic data domain identification |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118409874B (zh) * | 2024-07-02 | 2024-10-18 | 支付宝(杭州)信息技术有限公司 | 基于gpu片上内存的数据处理方法、装置及系统 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106126372A (zh) * | 2016-06-16 | 2016-11-16 | 上海天玑科技股份有限公司 | 一种针对Oracle Exadata一体机的异构容灾装置及方法 |
CN108596331A (zh) * | 2018-04-16 | 2018-09-28 | 浙江大学 | 一种细胞神经网络硬件架构的优化方法 |
CN108769684A (zh) * | 2018-06-06 | 2018-11-06 | 郑州云海信息技术有限公司 | 基于WebP图像压缩算法的图像处理方法以及装置 |
CN109447241A (zh) * | 2018-09-29 | 2019-03-08 | 西安交通大学 | 一种面向物联网领域的动态可重构卷积神经网络加速器架构 |
US20190392297A1 (en) * | 2016-12-30 | 2019-12-26 | Intel Corporation | Deep learning hardware |
CN110727911A (zh) * | 2018-07-17 | 2020-01-24 | 展讯通信(上海)有限公司 | 一种矩阵的运算方法及装置、存储介质、终端 |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7451297B2 (en) * | 2005-06-01 | 2008-11-11 | Microsoft Corporation | Computing system and method that determines current configuration dependent on operand input from another configuration |
CN100557581C (zh) * | 2008-05-15 | 2009-11-04 | 中国人民解放军国防科学技术大学 | 一种面向数据流的Cache管理方法 |
US10795815B2 (en) * | 2016-05-27 | 2020-10-06 | Arm Limited | Method and apparatus for maintaining data coherence in a non-uniform compute device |
US10146738B2 (en) * | 2016-12-31 | 2018-12-04 | Intel Corporation | Hardware accelerator architecture for processing very-sparse and hyper-sparse matrix data |
CN107729990B (zh) * | 2017-07-20 | 2021-06-08 | 上海寒武纪信息科技有限公司 | 支持离散数据表示的用于执行正向运算的装置及方法 |
CN108958801B (zh) * | 2017-10-30 | 2021-06-25 | 上海寒武纪信息科技有限公司 | 神经网络处理器及使用处理器执行向量最大值指令的方法 |
US11636327B2 (en) * | 2017-12-29 | 2023-04-25 | Intel Corporation | Machine learning sparse computation mechanism for arbitrary neural networks, arithmetic compute microarchitecture, and sparsity for training mechanism |
-
2020
- 2020-04-27 CN CN202010343215.9A patent/CN111522776B/zh active Active
- 2020-04-29 WO PCT/CN2020/087814 patent/WO2021217502A1/zh active Application Filing
-
2022
- 2022-07-13 US US17/864,014 patent/US11886347B2/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106126372A (zh) * | 2016-06-16 | 2016-11-16 | 上海天玑科技股份有限公司 | 一种针对Oracle Exadata一体机的异构容灾装置及方法 |
US20190392297A1 (en) * | 2016-12-30 | 2019-12-26 | Intel Corporation | Deep learning hardware |
CN108596331A (zh) * | 2018-04-16 | 2018-09-28 | 浙江大学 | 一种细胞神经网络硬件架构的优化方法 |
CN108769684A (zh) * | 2018-06-06 | 2018-11-06 | 郑州云海信息技术有限公司 | 基于WebP图像压缩算法的图像处理方法以及装置 |
CN110727911A (zh) * | 2018-07-17 | 2020-01-24 | 展讯通信(上海)有限公司 | 一种矩阵的运算方法及装置、存储介质、终端 |
CN109447241A (zh) * | 2018-09-29 | 2019-03-08 | 西安交通大学 | 一种面向物联网领域的动态可重构卷积神经网络加速器架构 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023088119A1 (en) * | 2021-11-18 | 2023-05-25 | International Business Machines Corporation | Automatic data domain identification |
Also Published As
Publication number | Publication date |
---|---|
CN111522776A (zh) | 2020-08-11 |
CN111522776B (zh) | 2022-04-05 |
US11886347B2 (en) | 2024-01-30 |
US20220350745A1 (en) | 2022-11-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liang et al. | Evaluating fast algorithms for convolutional neural networks on FPGAs | |
CN111291859B (zh) | 通用矩阵-矩阵乘法数据流加速器半导体电路 | |
WO2019128404A1 (zh) | 矩阵乘法器 | |
Kim et al. | FPGA-based CNN inference accelerator synthesized from multi-threaded C software | |
CN106846235B (zh) | 一种利用NVIDIA Kepler GPU汇编指令加速的卷积优化方法及系统 | |
US11886347B2 (en) | Large-scale data processing computer architecture | |
CN114970294B (zh) | 基于神威架构的三维应变仿真pcg并行优化方法及系统 | |
Yamazaki et al. | One-sided dense matrix factorizations on a multicore with multiple GPU accelerators | |
CN111859277B (zh) | 一种稀疏矩阵向量乘法向量化实现方法 | |
CN110086602A (zh) | 基于gpu的sm3密码散列算法的快速实现方法 | |
CN116710912A (zh) | 一种矩阵乘法器及矩阵乘法器的控制方法 | |
CN117539546A (zh) | 基于非空列存储的稀疏矩阵向量乘加速方法及装置 | |
CN117992396B (zh) | 流式张量处理器 | |
Shahbahrami et al. | FPGA implementation of parallel histogram computation | |
CN115965067B (zh) | 一种针对ReRAM的神经网络加速器 | |
CN109948787B (zh) | 用于神经网络卷积层的运算装置、芯片及方法 | |
CN111475205A (zh) | 一种基于数据流解耦合的粗粒度可重构阵列结构设计方法 | |
CN116431562A (zh) | 一种基于加速处理器的多头注意力机制融合计算分配方法 | |
CN116227615A (zh) | 面向超级计算的量子搜索模拟方法及系统 | |
CN112052941B (zh) | 一种应用于cnn网络卷积层的高效存算系统及其运算方法 | |
CN115170381A (zh) | 一种基于深度学习的视觉slam加速系统及方法 | |
CN111340224B (zh) | 适用于低资源嵌入式芯片的cnn网络的加速设计方法 | |
CN113177877B (zh) | 一种面向slam后端优化的舒尔消除加速器 | |
CN118171710B (zh) | 一种稀疏矩阵乘法的npu加速方法 | |
US11714649B2 (en) | RISC-V-based 3D interconnected multi-core processor architecture and working method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20933399 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20933399 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20933399 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19.05.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20933399 Country of ref document: EP Kind code of ref document: A1 |