WO2023197438A1 - 用于硬件加速器的存储—计算模块互联电路自动设计方法 - Google Patents

用于硬件加速器的存储—计算模块互联电路自动设计方法 Download PDF

Info

Publication number
WO2023197438A1
WO2023197438A1 PCT/CN2022/099082 CN2022099082W WO2023197438A1 WO 2023197438 A1 WO2023197438 A1 WO 2023197438A1 CN 2022099082 W CN2022099082 W CN 2022099082W WO 2023197438 A1 WO2023197438 A1 WO 2023197438A1
Authority
WO
WIPO (PCT)
Prior art keywords
interconnection
space
storage
hardware accelerator
multicast
Prior art date
Application number
PCT/CN2022/099082
Other languages
English (en)
French (fr)
Inventor
梁云
贾连成
Original Assignee
北京大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京大学 filed Critical 北京大学
Publication of WO2023197438A1 publication Critical patent/WO2023197438A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/30Circuit design
    • G06F30/32Circuit design at the digital level
    • G06F30/33Design verification, e.g. functional simulation or model checking
    • G06F30/3308Design verification, e.g. functional simulation or model checking using simulation
    • G06F30/331Design verification, e.g. functional simulation or model checking using simulation with hardware acceleration, e.g. by using field programmable gate array [FPGA] or emulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/30Circuit design
    • G06F30/34Circuit design for reconfigurable circuits, e.g. field programmable gate arrays [FPGA] or programmable logic devices [PLD]
    • G06F30/343Logical level

Definitions

  • the present invention relates to hardware accelerator design technology, and in particular, to an automatic design method for a storage-computing module interconnection circuit of a hardware accelerator for tensor applications.
  • Tensor algebra is a common application in computer programs and is suitable for a wide range of fields such as machine learning and data analysis.
  • users need to rely on dedicated hardware accelerators on different platforms, such as Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array, FPGA), Coarse-Grained Reconfigurable Array (CGRA) and other embedded devices, such accelerators are collectively referred to as tensor application accelerators.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • CGRA Coarse-Grained Reconfigurable Array
  • Most tensor application accelerators are mainly composed of a series of isomorphic computing units (Processing Element, PE), on-chip interconnection network (Network-on-chip) and on-chip storage system.
  • Tensor application accelerators can provide huge parallelism because a large number of PEs can work simultaneously at high frequencies. At the same time, low-cost communication can be carried out between PEs, thereby enabling efficient data reuse.
  • researchers have proposed a large number of tensor application accelerator designs that can adopt various hardware structures, including systolic arrays, multicast networks, and tree-based interconnect structures.
  • the on-chip storage system is an essential component of the Tensor Application Accelerator. Efficient on-chip memory designs can save energy, reduce space and bandwidth requirements, and provide better performance for accelerators.
  • On-chip storage systems are usually implemented using Scratchpad Memory (SPM) and interconnected with PE arrays.
  • SPM Scratchpad Memory
  • the hardware structure of SPM also has a large-scale design space, including data shapes, data mapping, multi-array partitions, and interconnection networks between other modules. For example, Eyeriss and Shidiannao use a multicast interconnection topology, and Maeri uses a tree-based interconnection structure.
  • the design of on-chip memory systems is complex and can significantly impact accelerator performance.
  • HLS high-level synthesis
  • TensorLib uses spatiotemporal transformation analysis to generate PE arrays, but does not generate the complete SPM hardware structure. Autosa and Susy support automatic generation of memory hierarchies, but cannot fully explore data reuse within SPM, resulting in unnecessary data copying. Literature Parashar, Angshuman, et al. "Timeloop: A systematic approach to dnn accelerator evaluation.” In ISPASS 2019. Only memory-level data reuse based on storage hierarchy is discussed, but no actual hardware code implementation is carried out.
  • the present invention provides an efficient storage interconnection circuit design method for hardware accelerators.
  • FPGA Field Programmable Gate Array
  • PE Processing Element
  • An automatic design method for storage-computing module interconnection circuits for hardware accelerators using the Space-Time Transformation (STT) method to analyze the expected behavior of data in the hardware accelerator storage module and reuse the data in the storage module Classify; then based on the results of spatio-temporal transformation analysis, automatically select the optimal storage-computing module interconnection circuit method and implement it; including the following steps:
  • the contents of the configuration file include: calculation codes for tensor operations and space-time transformation matrices;
  • RS is a subspace in the space-time space of the hardware accelerator. At all points in the subspace, the storage array coordinates accessed by the hardware accelerator at each point's space-time coordinates are zero; RS is expressed as the solution space of the following matrix equation:
  • the interconnection methods include: multicast interconnection and rotating interconnection; specifically: For each basis vector v:
  • circuits are designed separately for implementation.
  • the output data of the SPM memory is not directly interconnected with the PE. Instead, the array formed by all SPM output data is translated by a length of R, and the overflow part is added to the end of the array, and then the rotated The result is connected from the SPM's output port to the PE's input port.
  • the present invention adopts two circuit implementation modes: combinational logic mode and cascade mode.
  • the combinational logic mode directly completes the variable-length rotation in one cycle, which consumes less cycles, but the combinational logic of the variable-length rotation is more complex.
  • the cascade mode realizes rotations of different lengths in multiple cycles, and then selects one of the results for output based on the input rotation length signal.
  • the combination logic is simple but consumes more register resources.
  • the present invention provides a tensor algebra-oriented hardware accelerator storage-computing module interconnection structure design method. Compared with the existing technology that only uses fixed multicast interconnection methods, the present invention can support multicast interconnection and rotating interconnection at the same time, provides different implementation methods of rotating interconnection, automatically selects the appropriate interconnection method according to the hardware execution mode, and realizes A combination of different interconnection methods. Implementation shows that the present invention can effectively improve the interconnection efficiency of hardware storage-computing modules and reduce the consumption of storage resources.
  • Figure 1 is a flow chart of a method for determining the interconnection type based on basis vectors.
  • Figure 2 is a schematic diagram of two ways to implement a rotating interconnect circuit, including combinational logic mode and cascade mode.
  • Figure 3 is a schematic diagram of multiple methods of generating a complete interconnection structure based on the combination of different interconnection types.
  • Figure 4 is a flow chart of a method for automatically designing and generating a hardware accelerator storage-interconnect circuit provided by the present invention.
  • the present invention provides an automatic design method for a storage-computing module interconnection circuit for a hardware accelerator. It uses Space-Time Transformation (STT) to analyze the expected behavior of data in the storage module of the hardware accelerator, and calculates the expected behavior of the data in the storage module. Reuse within modules and classify data reuse. Then, based on the results of the space-time transformation analysis, the optimal storage-computing module interconnection circuit method is automatically selected and implemented.
  • STT Space-Time Transformation
  • the present invention uses the Chisel high-level language recorded in the literature (Bachrach, Jonathan, et al. Chisel: constructing hardware in a scala embedded language. In DAC 2012.). This language is used for hardware design and supports register-level Performance optimization and development efficiency of high-level languages.
  • Figure 4 shows the method flow of automatically designing and generating a hardware accelerator storage-interconnect circuit provided by the present invention.
  • the automatic design method of the storage-computing module interconnection circuit of the hardware accelerator of the present invention includes the following steps:
  • the configuration file includes calculation codes for tensor operations and spatio-temporal transformation matrices; the calculation codes define the input operands and output operands of the hardware accelerator, as well as the algorithm for calculating the output operands from the input operands;
  • circuits are designed separately for implementation.
  • the contents of the configuration file include: calculation codes for tensor operations and space-time transformation matrices;
  • the calculation code of tensor operation is specifically the calculation code of the tensor algorithm corresponding to the intelligent application in the user input file, which defines the input operands and output operands of the hardware accelerator, and the calculation of the input operands to obtain the output operands.
  • Algorithms represented by multi-level loops.
  • the execution of the hardware accelerator occurs at different times at different computing unit PE positions, forming a high-dimensional logical space composed of multi-dimensional physical space and multi-dimensional time.
  • the physical space refers to the position coordinates of the accelerator computing unit PE.
  • Time refers to the different moments of execution of the accelerator. In such a high-dimensional logical space, any different point is a space-time vector and can be assigned different computing tasks.
  • the spatio-temporal transformation matrix in the user input file is obtained by mapping the calculation cycle subscript vector to the spatio-temporal vector during the execution of the hardware accelerator; the mapping method can be expressed as matrix-vector multiplication.
  • the operation code is expressed as a multi-level loop, and the loop variables of each level of loop constitute the calculation loop subscript vector I.
  • the access matrix A maps the calculation loop subscript vector I to the memory address (i.e., a location in the SPM storage unit), representing the multi-dimensional array coordinate vector of data storage; A[i,j] represents the j-th layer loop subscript pair A
  • the magnification brought by the i-th dimension address subscript; the access matrix A can be directly obtained by the vector operation calculation expression in the user input file.
  • This mapping can be expressed as the following matrix-vector multiplication:
  • D represents the multi-dimensional coordinate vector stored in the memory; according to the calculation expression, each operand participating in the operation is distinguished as an input operand or an output operand.
  • RS is a subspace in the space-time space of the hardware accelerator. At all points in the subspace, the storage array coordinates accessed by the hardware accelerator in the space-time coordinates are zero; RS is expressed as the solution space of the following matrix equation:
  • the number of reuses is a given basis vector v and any initial point x, the number of points in the space-time space defined by x+kv that can be effectively calculated;
  • each storage unit is interconnected with a specific computing unit PE. If the data stored in the SPM is the data of the input operand, the output port of the SPM is connected to the input port of the PE. Otherwise, the PE's output port is connected to the SPM's input port.
  • the output data of SPM is not directly interconnected with PE. Instead, the array formed by all SPM output data is translated by R units (the range is 0 to the PE array length). At this time, there will be R units.
  • the data overflows the PE array (array), and the overflowed part is added to the end of the PE array, and then the rotation result after the translation and complement is connected from the output port of the SPM to the input port of the PE.
  • the present invention proposes two circuit implementation methods to realize the rotating interconnection structure: combinational logic mode and cascade mode. As shown in Figure 2, the combinational logic mode contains a variable-length rotation module, which directly rotates within 1 cycle.
  • variable-length rotation consumes fewer cycles, but the combinational logic of variable-length rotation is more complex.
  • the cascade mode implements rotations of different lengths in multiple cycles, using multiple modules that rotate 1 unit data length. Each register stores the results of rotations of different lengths, and then selects one of them based on the input rotation length signal. Result output, combinational logic is simple but consumes more register resources. Users can choose one of them based on hardware implementation efficiency.
  • step 5 According to the interconnection method of each basis vector in the base V, step 5) is performed, and finally the overall storage-computing module interconnection circuit structure is automatically generated.
  • step 5 the interconnection method corresponding to each basis vector is obtained.
  • the number of basis vectors in each base obtained in step 4
  • the interconnection method it is divided into the following 4 types, as shown in Figure 3. This invention only considers the case where the number of basis vectors is 1 or 2, and can cover most tensor calculation requirements.
  • Rotation + multicast interconnection There are two basis vectors, one of which takes rotational interconnection and the other takes multicast interconnection.
  • Multicast interconnection There is only one basis vector and a multicast interconnection structure is used.
  • Multicast + multicast interconnection There are two basis vectors, and both adopt multicast interconnection.
  • step 5 ensures that there will not be more than one rotating interconnection. So far, the present invention has completely realized the storage-computing module interconnection circuit design of the hardware accelerator.
  • the present invention is used to design and generate a hardware accelerator storage-computing module interconnection circuit, which can be used in hardware accelerators in the fields of various intelligent applications (including image processing, object detection, decision analysis, recommendation systems, natural language processing, and scientific data analysis).
  • the invention automatically designs storage-computing module interconnection circuits according to the hardware accelerator calculation mode specified by the user, can support different interconnection methods, optimizes memory utilization efficiency, and reduces resource waste caused by data redundant storage.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Geometry (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Logic Circuits (AREA)
  • Design And Manufacture Of Integrated Circuits (AREA)

Abstract

一种用于硬件加速器的存储—计算模块互联电路自动设计方法,通过时空变换STT对数据在硬件加速器存储模块中的预期行为进行分析,对存储模块中的数据重用进行计算并分类,进一步自动选择最优存储-计算模块互联电路方式并实现组播互联或旋转互联。本方法能够有效地提升硬件存储—计算模块的互联效率,减少对存储资源的消耗。

Description

用于硬件加速器的存储—计算模块互联电路自动设计方法 技术领域
本发明涉及硬件加速器设计技术,尤其涉及一种用于张量应用的硬件加速器的存储—计算模块互联电路的自动设计方法。
背景技术
张量代数是一种计算机程序中常见的应用,适用于机器学习、数据分析等广泛的领域中。为高效实现大规模的张量代数应用程序,用户需要依赖于不同平台上的专用的硬件加速器,例如专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、粗粒度可重构阵列(Coarse-Grained Reconfigurable Array,CGRA)和其他嵌入式设备,这类加速器被统称为张量应用加速器。大多数张量应用加速器主要由一系列同构的计算单元(Processing Element,PE)、片上互联网络(Network-on-chip)和片上存储系统组成。张量应用加速器可以提供巨大的并行性,因为大量的PE可以以高频率同时工作。同时,PE之间可以进行低成本通信,从而能够实现高效的数据重用。研究者已经提出了大量的张量应用加速器设计方案,它们可以采用各种硬件结构,包括脉动阵列、组播网络,以及基于树的互联结构。
片上存储系统是张量应用加速器的基本组件。高效的片上存储设计可以节省能量,降低空间和带宽要求,并为加速器提供更好的性能。片上存储系统通常用便笺存储器(Scratchpad Memory,SPM)实现,并与PE阵列互连。SPM的硬件结构同样具有大规模的设计空间,包括数据形状、数据映射、多阵列分区,以及其他模块之间的互连网络。例如,Eyeriss和Shidiannao使用组播互连拓扑,以及Maeri使用基于树的互连结构。片上存储系统的设计复杂,并且会显著影响加速器性能。
考虑到各种不同张量应用以及高效开发加速器的需求,近期相当多的研究工作对张量应用加速器的自动设计方法进行了探索,例如:使用多面体模型来分析张量数据流,采用高层次综合(HLS)生成硬件架构(Wang,Jie,Licheng Guo,and Jason Cong."Autosa:A polyhedral compiler for high-performance systolic arrays on fpga."In FPGA 2021,Jia,Liancheng,et al."Tensorlib:A spatial accelerator generation framework for tensor algebra."In DAC 2021;Lu,Liqiang,et al."Tenet:A framework for modeling tensor dataflow based on relation-centric notation."In ISCA 2021);创建特定于域的语言(DSL)和编译器以使用高层次语言自动设计 硬件(Lai,Yi-Hsiang,et al."Susy:A programming model for productive construction of high-performance systolic arrays on fpgas."In ICCAD 2020,Lai,Yi-Hsiang,et al."HeteroCL:A multi-paradigm programming infrastructure for software-defined reconfigurable computing."In FPGA 2019.)。用户不需编写复杂的硬件代码,只需对硬件行为进行高层次描述,编译器可以自动生成底层硬件代码。TensorLib使用时空变换分析来生成PE阵列,但没有生成完整的SPM硬件结构。Autosa和Susy支持自动生成内存层次结构,但无法完全探索SPM内部的数据重用,导致了不必要的数据复制。文献Parashar,Angshuman,et al."Timeloop:A systematic approach to dnn accelerator evaluation."In ISPASS 2019.只讨论基于存储层次结构的内存级数据重用,但没有进行实际硬件代码实现。
然而,前期工作主要关注对PE阵列的架构及其内部互连的分析,但对SPM与PE阵列的互连电路缺乏设计和优化,通常只支持SPM与PE阵列之间的一对一互联,但这种互联方式会导致以下问题:虽然不同PE之间存在数据重用,但是它们因为没有对应的SPM-PE数据通路而被迫在不同的SPM模块中重复保存数据,导致了存储空间的浪费。本发明针对这一问题,提出了高效的解决方案。
发明内容
为了克服上述现有技术的不足,本发明提供一种高效的面向硬件加速器的存储互联电路设计方法。
为方便起见,本发明采用以下术语定义:
FPGA(Field Programmable Gate Array)现场可编程门阵列
SPM(Scratchpad Memory)便笺式存储器
STT(Space-Time Transformation)时间—空间变换
RS(Reuse Space)重用空间
PE(Proecssing Element)计算单元
IO(Input-Output)输入输出
RTL(Register Transistor Level)寄存器转换级电路
本发明的技术方案是:
一种用于硬件加速器的存储—计算模块互联电路自动设计方法,使用时空变换方法(Space-Time Transformation,STT)对数据在硬件加速器存储模块中的预期行为进行分析,对存储模块中的数据重用进行分类;再根据时空变换分析的结果,自动选择最优的存储—计算 模块互联电路方式并进行实现;包括如下步骤:
1)读取用户输入的用于表示加速器行为的配置文件;
配置文件的内容包括:张量运算的计算代码和时空变换矩阵;
2)根据用户输入的配置文件中的张量运算的计算代码,生成每一个操作数的访问矩阵;
3)根据访问矩阵和配置文件中的时空变换矩阵,计算每个操作数的重用空间RS,并得到该空间的基底V;
RS是在硬件加速器时空空间中的一个子空间,在子空间中的所有点,硬件加速器在每个点时空坐标访问的存储数组坐标均为零;RS表示为如下矩阵方程的解空间:
AT -1x=0
其中,A为操作数的访问矩阵;T为时空变换矩阵;x为硬件加速器时空空间中的点,可表示为<s 1,s 2,…s m,t 1,t 2,…t n>。s 1,s 2,…s m是基向量v的空间分量,t 1,t 2,…t n是基向量v的时间分量。所有的x构成操作数的重用空间RS。
4)对重用空间RS的基底V中的每一个基向量v,判断基向量是否是采用模块内部实现,或设置基向量采用的互联方式;互联方式包括:组播互联和旋转互联;具体是:对每一个基向量v:
(a)如果基向量v的时间分量t 2~t n中有非0元素,且s 1~s n均为0,则采用“模块内部实现”,无需设置SPM-PE(存储单元—计算单元)互联结构,也不计入后续步骤中基向量数量的计算,令该重用空间基向量的数量减去1;转到步骤5);
(b)如果t 1~t n均为0,则采取组播互联;转到步骤5);
(c)如果基底V中已存在其他基向量采用旋转互联,则当前的基向量v采取组播互联;转到步骤5);
(d)如果基向量v的重用次数小于PE阵列长度,则采取旋转互联;转到步骤5);
给定基向量v和任意初始点x,使x在v方向上平移形成一条时空空间中的直线,表示为x+kv,k是任意倍数;重用次数指在该直线上的所有点中进行有效计算的点的个数;
(e)如果t 2~t n均为0,则采取组播互联;否则采取旋转互联;
5)对于组播互联和旋转互联,分别设计电路进行实现。
其中,对于旋转互联结构,SPM存储器的输出数据不直接与PE互联,而是将所有SPM的输出数据形成的数组均进行长度为R的平移,将溢出的部分补到数组末尾,再将旋转后的结果由SPM的输出端口连接到PE的输入端口。
对于旋转互联结构,本发明采用两种电路实现方式:组合逻辑模式和级联模式。组合逻辑模式直接在1个周期内完成可变长度的旋转,消耗周期较少,但可变长度旋转的组合逻辑较为复杂。级联模式在多个周期内分别实现不同长度的旋转,再由输入的旋转长度信号选择其中一个结果输出,组合逻辑简单但是消耗寄存器资源较多。
6)根据硬件加速器时空空间中每个基底的互联方式,生成硬件加速器最终的整体互联结构。
与现有技术相比,本发明的有益效果:
本发明提供一种面向张量代数的硬件加速器存储—计算模块互联结构的设计方法。相比于现有技术只使用固定的多播互联方法,本发明可同时支持多播互联和旋转互联,提供了旋转互联的不同实现方法,根据硬件执行方式自动选择合适的互联方式,并实现了不同互联方法的组合。实施表明本发明能够有效地提升硬件存储—计算模块的互联效率,减少了对存储资源的消耗。
附图说明
图1是根据基向量确定互联类型的方法流程框图。
图2是实现旋转互联电路的两种方式的示意图,包括组合逻辑模式和级联模式。
图3是根据不同互联类型的组合生成完整互联结构的多种方法的示意图。
图4是本发明提供的自动设计生成硬件加速器存储—互联电路的方法流程框图。
具体实施方式
下面结合附图,通过实施例进一步描述本发明,但不以任何方式限制本发明的范围。
本发明提供一种用于硬件加速器的存储—计算模块互联电路自动设计方法,使用时空变换(Space-Time Transformation,STT)对数据在硬件加速器存储模块中的预期行为进行分析,计算出数据在存储模块中的重复使用情况,并对数据重用进行分类。再根据时空变换分析的结果,自动选择最优的存储—计算模块互联电路方式并进行实现。
具体实施时,本发明使用文献(Bachrach,Jonathan,et al.Chisel:constructing hardware in a scala embedded language.In DAC 2012.)记载的Chisel高层次语言,该语言用于硬件设计,同时支持寄存器级别的性能优化以及高层次语言的开发效率。
图4所示是本发明提供的自动设计生成硬件加速器存储—互联电路的方法流程。具体来说,本发明用于硬件加速器的存储—计算模块互联电路自动设计方法包括如下步骤:
1)读取用户输入的用于表示加速器行为的配置文件。配置文件包括张量运算的计算代码和时空变换矩阵;其中的计算代码定义了硬件加速器的输入操作数、输出操作数,以及由输入操作数进行计算得到输出操作数的算法;
2)根据用户输入的配置文件中的张量运算的代码生成硬件加速器的每一个输入操作数的访问矩阵A;
3)根据访问矩阵和配置文件中的时空变换矩阵,计算每个操作数的重用空间RS,并得到该空间的基底V。
4)对基底V中的每一个基向量v,判断其互联方式。
5)对于组播互联和旋转互联,分别设计电路进行实现。
6)根据每个基底的互联方式,生成最终的整体互联结构。
具体实施时,包括如下过程:
1)读取用户输入的用于表示加速器行为的配置文件。
配置文件的内容包括:张量运算的计算代码和时空变换矩阵;
张量运算的计算代码具体是用户输入文件中与智能应用相应的张量算法的计算代码,其中定义了硬件加速器的输入操作数、输出操作数,以及由输入操作数进行计算得到输出操作数的算法,用多层循环表示。硬件加速器的执行发生在不同计算单元PE位置的不同时间,构成一个由多维物理空间和多维时间组成的高维逻辑空间。物理空间指的是加速器计算单元PE的位置坐标。时间指的是加速器执行的不同时刻。在这样一个高维逻辑空间中,任意不同的点都是一个时空向量,可以被安排不同的计算任务。
用户输入文件中的时空变换矩阵由计算循环下标向量到硬件加速器执行过程中时空向量的一一映射得到;映射的方式可表示为矩阵—向量乘法。
2)根据用户输入文件中的张量运算的代码生成每一个操作数的访问矩阵A。
运算代码表示为多层循环,每层循环的循环变量组成计算循环下标向量I。访问矩阵A将计算循环下标向量I映射到内存地址(即SPM存储单元中的一个位置)中,表示数据存储的多维数组坐标向量;A[i,j]表示第j层循环下标对A的第i维地址下标带来的倍率;可由用户输入文件中的向量运算计算表达式直接得到访问矩阵A。该映射可表示为如下的矩阵—向量乘法:
AI=D
其中,D是内存中表示存储的多维坐标向量;根据计算表达式,区分每个参与运算的操 作数是输入操作数还是输出操作数。
3)根据访问矩阵和配置文件中的时空变换矩阵,计算每个操作数的重用空间RS,并得到该空间的基底V。
RS是在硬件加速器时空空间中的一个子空间,在子空间中的所有点,硬件加速器在时空坐标访问的存储数组坐标均为零;RS表示为如下矩阵方程的解空间:
AT -1x=0
其中,A为操作数的访问矩阵;T为时空变换矩阵;x为硬件加速器时空空间中的点,可表示为<s 1,s 2,…s m,t 1,t 2,…t n>。s 1,s 2,…s m是基向量v的空间分量,t 1,t 2,…t n是基向量v的时间分量。所有的x构成操作数的重用空间RS。
4)对重用空间RS的基底V中的每一个基向量v,按照图1所示流程判断其互联方式。
(a)如果基向量v的时间分量t 2~t n中有非0元素,且s 1~s n均为0,则采用“模块内部实现”,无需设置SPM-PE(存储单元—计算单元)互联结构,也不计入后续步骤中基向量数量的计算,令该重用空间基向量的数量减去1;转到步骤5);
(b)如果t 1~t n均为0,则采取组播互联;转到步骤5);
(c)如果基底V中已存在其他基向量采用旋转互联,则当前的基向量v采取组播互联;转到步骤5);
(d)如果基向量v的重用次数小于PE阵列长度,则采取旋转互联;转到步骤5);
其中重用次数是给定基向量v和任意初始点x,x+kv所定义的时空空间中进行有效计算的点的个数;
(e)如果t 2~t n均为0,则采取组播互联;否则采取旋转互联;
5)对于组播互联和旋转互联,分别设计电路进行实现,包括组播互联电路和旋转互联电路。
对于组播互联结构,每个存储单元(SPM)与特定的计算单元PE进行互联。如果SPM内保存的数据是输入操作数的数据,则SPM的输出端口连接到PE的输入端口。否则,PE的输出端口连接到SPM的输入端口。
对于旋转互联结构,SPM的输出数据不直接与PE互联,而是将所有SPM输出数据形成的数组均进行R个单位(范围是0到PE阵列长度)数据的平移,此时会有R个单位的数据溢出PE阵列(数组),将溢出的部分补到PE阵列末尾,再将平移和补之后的旋转结果由SPM的输出端口连接到PE的输入端口。对于旋转互联结构,本发明提出两种电路实现方式实现 旋转互联结构:组合逻辑模式和级联模式,如图2所示,组合逻辑模式包含一个可变长度的旋转模块,直接在1个周期内完成可变长度的旋转,消耗周期较少,但可变长度旋转的组合逻辑较为复杂。级联模式在多个周期内分别实现不同长度的旋转,使用多个旋转1个单位数据长度的模块进行实现,每个寄存器分别保存旋转不同长度的结果,再由输入的旋转长度信号选择其中一个结果输出,组合逻辑简单但是消耗寄存器资源较多。用户可根据硬件实现效率任意选择其中一种。
6)根据基底V中每个基向量的互联方式,均执行步骤5),最终自动地生成整体存储—计算模块互联电路结构。
在步骤5)中得到了每个基向量对应的互联方式。根据每个基底中基向量的数量(在步骤4)中得到)和互联方式分为以下4种类型,如图3所示。本发明只考虑基向量数量为1或2的情况,可以覆盖大部分张量计算的需求。
6a)旋转互联。只有一个基向量,且采用旋转互联结构。
6b)旋转+组播互联。有两个基向量,其中一个采取旋转互联另一个采取组播互联。
6c)组播互联。只有一个基向量,且采用组播互联结构。
6d)组播+组播互联。有两个基向量,且均采取组播互联。
步骤5)的流程保证了不会出现多于一个的旋转互联。至此,本发明完整的实现了硬件加速器的存储—计算模块互联电路设计。
利用本发明设计生成硬件加速器存储—计算模块互联电路,可用于各类智能应用(包括图像处理、物体检测、决策分析、推荐系统、自然语言处理、科学数据分析)领域的硬件加速器中。本发明根据用户规定的硬件加速器计算模式,自动设计存储—计算模块互联电路,可支持不同的互联方式,优化了存储器的利用效率,减少了数据冗余存储带来的资源浪费。
需要注意的是,公布实施例的目的在于帮助进一步理解本发明,但是本领域的技术人员可以理解:在不脱离本发明及所附权利要求的范围内,各种替换和修改都是可能的。因此,本发明不应局限于实施例所公开的内容,本发明要求保护的范围以权利要求书界定的范围为准。

Claims (8)

  1. 一种用于硬件加速器的存储—计算模块互联电路自动设计方法,通过时空变换STT对数据在硬件加速器存储模块中的预期行为进行分析,对存储模块中的数据重用进行计算并分类,进一步自动选择最优存储-计算SPM-PE模块互联电路方式并实现;包括如下步骤:
    1)读取用户输入的用于表示加速器行为的配置文件;配置文件包括张量运算的计算代码和时空变换矩阵T;计算代码定义硬件加速器的输入操作数、输出操作数,以及由输入操作数进行计算得到输出操作数的算法;
    2)根据用户输入配置文件中的张量运算的计算代码,生成硬件加速器每一个输入操作数的访问矩阵A;
    3)根据访问矩阵和配置文件中的时空变换矩阵,计算每个操作数的重用空间RS,并得到重用空间的基底V;
    RS表示为如下矩阵方程的解空间:
    AT -1x=0
    其中,x为硬件加速器时空空间中的点,表示为<s 1,s 2,…s m,t 1,t 2,…t n>,s 1,s 2,…s m是基向量v的空间分量,t 1,t 2,…t n是基向量v的时间分量;所有的x构成操作数的重用空间RS;
    4)对重用空间RS的基底V中的每一个基向量v,判断基向量是否采用模块内部实现,或设置基向量采用的互联方式;互联方式包括:组播互联、旋转互联,且不多于一个旋转互联;
    具体是:对每一个基向量v:
    a)如果基向量v的时间分量t 2~t n中有非0元素,且s 1~s n均为0,则采用模块内部实现,无需设置存储单元-计算单元SPM-PE互联结构,不计入后续步骤中基向量数量计算,重用空间基向量的数量减去1;转到步骤5);
    b)如果t 1~t n均为0,则采用组播互联;转到步骤5);
    c)如果基底V中已存在其他基向量采用旋转互联,则当前的基向量v采取组播互联;
    d)如果基向量v的重用次数小于PE阵列长度,则采取旋转互联;转到步骤5);
    其中重用次数为:给定基向量v和任意初始点x,能够使得x+kv所定义的时空空间中的点进行有效计算的k的个数;PE阵列长度指PE阵列在s 1~s n所定义的方向上的PE数量;
    e)如果t 2~t n均为0,则采取组播互联;否则采取旋转互联;
    5)分别设计电路实现组播互联和旋转互联;
    旋转互联具体是:将所有SPM存储器输出数据形成的数组均进行R个单位数据的平移,此时R个单位的数据溢出PE阵列;将溢出的部分补到PE阵列末尾,R的范围是0到PE阵列长度;再将平移和补之后的旋转结果由SPM的输出端口发送到PE的输入端口;
    6)根据硬件加速器时空空间中每个基底的互联方式,生成硬件加速器的互联结构。
  2. 如权利要求1所述的用于硬件加速器的存储—计算模块互联电路自动设计方法,其特征是,用户输入文件中的张量运算的计算代码具体是用户输入文件中与智能应用相应的张量算法的计算代码,其中定义硬件加速器的输入操作数、输出操作数,以及由输入操作数进行计算得到输出操作数的算法,用多层循环表示;用户输入文件中的时空变换矩阵具体是由计算循环下标向量到硬件加速器执行过程中时空向量进行一一映射得到;映射方式采用矩阵—向量乘法。
  3. 如权利要求1所述的用于硬件加速器的存储—计算模块互联电路自动设计方法,其特征是,访问矩阵A具体是将计算循环下标向量I映射到SPM存储单元的内存地址中,表示数据存储的多维数组坐标向量;A[i,j]表示第j层循环下标对A的第i维地址下标带来的倍率,由用户输入文件中的向量运算计算表达式直接得到。
  4. 如权利要求1所述的用于硬件加速器的存储—计算模块互联电路自动设计方法,其特征是,重用空间RS具体是在硬件加速器时空空间中的一个子空间;对于子空间中的所有点,硬件加速器在每个点时空坐标访问的存储数组坐标均为零。
  5. 如权利要求1所述的用于硬件加速器的存储—计算模块互联电路自动设计方法,其特征是,组播互联结构具体是将每个存储单元SPM与特定的计算单元PE进行互联;如果SPM内保存的数据是输入操作数的数据,则SPM的输出端口连接到PE的输入端口;否则,PE的输出端口连接到SPM的输入端口。
  6. 如权利要求1所述的用于硬件加速器的存储—计算模块互联电路自动设计方法,其特征是,步骤5)中,采用组合逻辑模式或级联模式实现旋转互联结构,其中:
    组合逻辑模式是只包含一个可变长度的旋转模块,直接在1个周期内完成可变长度的旋转;级联模式是在多个周期内分别实现不同长度的旋转,每个寄存器分别保存旋转不同长度的结果,再由输入的旋转长度信号选择其中一个结果输出。
  7. 如权利要求6所述的用于硬件加速器的存储—计算模块互联电路自动设计方法,其特征是,每个基向量对应的互联方式分为以下4种类型:
    6a)旋转互联类型:只有一个基向量,且采用旋转互联;
    6b)旋转+组播互联类型:有两个基向量,其中一个采取旋转互联,另一个采取组播互联;
    6c)组播互联类型:只有一个基向量,且采用组播互联结构;
    6d)组播+组播互联类型:有两个基向量,且均采取组播互联。
  8. 如权利要求1所述的用于硬件加速器的存储—计算模块互联电路自动设计方法,其特征是,所述方法使用Chisel高层次语言实现。
PCT/CN2022/099082 2022-04-12 2022-06-16 用于硬件加速器的存储—计算模块互联电路自动设计方法 WO2023197438A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210379803.7A CN114462340B (zh) 2022-04-12 2022-04-12 用于硬件加速器的存储—计算模块互联电路自动设计方法
CN202210379803.7 2022-04-12

Publications (1)

Publication Number Publication Date
WO2023197438A1 true WO2023197438A1 (zh) 2023-10-19

Family

ID=81418579

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/099082 WO2023197438A1 (zh) 2022-04-12 2022-06-16 用于硬件加速器的存储—计算模块互联电路自动设计方法

Country Status (2)

Country Link
CN (1) CN114462340B (zh)
WO (1) WO2023197438A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114462340B (zh) * 2022-04-12 2022-07-01 北京大学 用于硬件加速器的存储—计算模块互联电路自动设计方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108090560A (zh) * 2018-01-05 2018-05-29 中国科学技术大学苏州研究院 基于fpga的lstm递归神经网络硬件加速器的设计方法
US20180174036A1 (en) * 2016-12-15 2018-06-21 DeePhi Technology Co., Ltd. Hardware Accelerator for Compressed LSTM
CN108596331A (zh) * 2018-04-16 2018-09-28 浙江大学 一种细胞神经网络硬件架构的优化方法
CN113220630A (zh) * 2021-05-19 2021-08-06 西安交通大学 一种硬件加速器的可重构阵列优化方法及自动调优方法
CN113901746A (zh) * 2021-10-09 2022-01-07 北京大学 一种用于向量代数的硬件加速器的设计方法
CN114462340A (zh) * 2022-04-12 2022-05-10 北京大学 用于硬件加速器的存储—计算模块互联电路自动设计方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11175957B1 (en) * 2020-09-22 2021-11-16 International Business Machines Corporation Hardware accelerator for executing a computation task

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180174036A1 (en) * 2016-12-15 2018-06-21 DeePhi Technology Co., Ltd. Hardware Accelerator for Compressed LSTM
CN108090560A (zh) * 2018-01-05 2018-05-29 中国科学技术大学苏州研究院 基于fpga的lstm递归神经网络硬件加速器的设计方法
CN108596331A (zh) * 2018-04-16 2018-09-28 浙江大学 一种细胞神经网络硬件架构的优化方法
CN113220630A (zh) * 2021-05-19 2021-08-06 西安交通大学 一种硬件加速器的可重构阵列优化方法及自动调优方法
CN113901746A (zh) * 2021-10-09 2022-01-07 北京大学 一种用于向量代数的硬件加速器的设计方法
CN114462340A (zh) * 2022-04-12 2022-05-10 北京大学 用于硬件加速器的存储—计算模块互联电路自动设计方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KAI JIANG, LIU ZHI-ZHE; XIU YU-JIE; TIAN YING-HUI; ZHAO CHEN-XU: "Design of processing element aiming to accelerate convolutional neural networks ", COMPUTER ENGINEERING AND DESIGN, vol. 40, no. 12, 16 December 2019 (2019-12-16), pages 3620 - 3624, XP093098738 *

Also Published As

Publication number Publication date
CN114462340B (zh) 2022-07-01
CN114462340A (zh) 2022-05-10

Similar Documents

Publication Publication Date Title
Pedram et al. Codesign tradeoffs for high-performance, low-power linear algebra architectures
Gu et al. DLUX: A LUT-based near-bank accelerator for data center deep learning training workloads
Garofalo et al. A heterogeneous in-memory computing cluster for flexible end-to-end inference of real-world deep neural networks
Liang et al. An efficient hardware design for accelerating sparse CNNs with NAS-based models
Xiao et al. Plasticity-on-chip design: Exploiting self-similarity for data communications
Muñoz-Martínez et al. STONNE: A detailed architectural simulator for flexible neural network accelerators
Moon et al. Evaluating spatial accelerator architectures with tiled matrix-matrix multiplication
WO2023197438A1 (zh) 用于硬件加速器的存储—计算模块互联电路自动设计方法
Chang et al. DASM: Data-streaming-based computing in nonvolatile memory architecture for embedded system
Jia et al. EMS: efficient memory subsystem synthesis for spatial accelerators
Zhang et al. Towards automatic and agile AI/ML accelerator design with end-to-end synthesis
Lu Paving the way for China exascale computing
Qin et al. Enabling flexibility for sparse tensor acceleration via heterogeneity
CN113901746A (zh) 一种用于向量代数的硬件加速器的设计方法
Roychowdhury Derivation, extensions and parallel implementation of regular iterative algorithms
Huang et al. Ready: A ReRAM-based processing-in-memory accelerator for dynamic graph convolutional networks
Li et al. Heterogeneous systems with reconfigurable neuromorphic computing accelerators
Luo et al. Rubick: A synthesis framework for spatial architectures via dataflow decomposition
Wu et al. Implementing DSP algorithms with on-chip networks
Esmaeilzadeh et al. Physically accurate learning-based performance prediction of hardware-accelerated ml algorithms
Chen et al. Exploiting on-chip heterogeneity of versal architecture for gnn inference acceleration
Gomony et al. CONVOLVE: Smart and seamless design of smart edge processors
Ma et al. Darwin3: A large-scale neuromorphic chip with a Novel ISA and On-Chip Learning
Sharma et al. A Heterogeneous Chiplet Architecture for Accelerating End-to-End Transformer Models
Chen et al. Graph-OPU: A Highly Integrated FPGA-Based Overlay Processor for Graph Neural Networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22937076

Country of ref document: EP

Kind code of ref document: A1