WO2022068205A1 - 数据存储和读取方法及系统 - Google Patents

数据存储和读取方法及系统 Download PDF

Info

Publication number
WO2022068205A1
WO2022068205A1 PCT/CN2021/092516 CN2021092516W WO2022068205A1 WO 2022068205 A1 WO2022068205 A1 WO 2022068205A1 CN 2021092516 W CN2021092516 W CN 2021092516W WO 2022068205 A1 WO2022068205 A1 WO 2022068205A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
memory bank
data
transformation vector
storage
Prior art date
Application number
PCT/CN2021/092516
Other languages
English (en)
French (fr)
Inventor
李程
欧阳鹏
张振
Original Assignee
北京清微智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京清微智能科技有限公司 filed Critical 北京清微智能科技有限公司
Priority to US17/484,626 priority Critical patent/US11740832B2/en
Publication of WO2022068205A1 publication Critical patent/WO2022068205A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the fields of data storage and memory management.
  • the present application specifically relates to a data storage method and system.
  • CGRAs Coarse-grained Reconfigurable Architectures
  • main controller usually central processing unit (CPU)
  • processing unit processing element, PE
  • main memory usually multi-bank memory structure
  • the CGRA includes a host controller (host controller), an instruction memory (context memory), a data memory (data memory), a local data memory (PEA shared memory (SM)), a parallel-access data bus (Parallel-access data bus) ), global register file (PEA global regfile), output register (output register), local register file (local regfile), instruction cache (context buffer), multi-bank PEA shared memory (Multi-bank PEA shared memory), multi-channel Selector (multiplexer, MUX), arithmetic logic unit/read and write unit (arithmetic and logic unit/load and store unit, ALU/LSU), global register (global regfile, GR), etc.
  • the execution flow of the CGRA computing system is as follows: the main controller initializes the CGRA instruction and input data and stores them in the main memory.
  • CGRA Before CGRA accelerates the application, input data is transferred from main memory to local memory and instructions are loaded into CGRA's configuration memory. When the CGRA completes the computation, the output data will be transferred from the local memory to the main memory.
  • instructions are usually mapped into different computational units (or referred to as processing units PE) to execute computational processes in parallel.
  • CGRA In order to meet the needs of parallel memory access, CGRA usually adopts a multi-bank memory architecture. In multi-bank memory, each memory bank has independent read and write ports, and single-cycle parallel memory access can be achieved by dispersing data into different banks.
  • CGRA The high energy efficiency of CGRA comes from the large amount of computing resources distributed on its computing array, complex interconnection methods and different levels of storage systems. However, in order to achieve better performance and energy efficiency in application, the cooperation of these resources is required. If these resources are not well scheduled and cooperated, CGRA as an accelerator may actually drag down the performance of the system.
  • CGRA because the hardware architecture of CGRA is very different from that of popular general-purpose processors, the original compilation technology for general-purpose processors cannot be completely transplanted to CGRA. Therefore, it is very necessary to research and develop the compilation technology of CGRA. This compilation technology needs to be able to mine the parallelism in the application and reduce the delay of data reading, and then provide configuration information according to the hardware architecture characteristics of CGRA, so as to achieve the goal of high performance and high energy efficiency.
  • CGRA compiler needs to provide the control code running in the reconfigurable controller and the configuration information of the corresponding data path. Due to the huge difference between the hardware structure of CGRA and that of general-purpose processors, the compiling technology and process of its compiler are also different from those of traditional compilers.
  • the core work of the CGRA compiler is to analyze the application program and divide the program into hardware execution part code and software operation part code, and then compile and process the two parts of the code separately to generate the controller operation code and the configuration information of the reconfigurable data path.
  • the compilation technology of the reconfigurable processor includes task division and scheduling, operation mapping, memory mapping optimization and so on.
  • the technology proposed in this paper belongs to the memory mapping optimization technology, because it will directly affect the performance of the program, so it has received more attention.
  • the program in the example consists of a two-dimensional nested loop, and each loop needs to access five elements of array A. To pipeline the loop, five elements will be accessed simultaneously each clock cycle. Different strategies are used to obtain new address space, and the numbers in the figure represent the offset addresses of elements.
  • the present application provides a data storage method, which disperses elements in different memory banks in an appropriate method by means of linear transformation, and divides the application into multiple small tasks by sub-task cutting, which greatly reduces the application complexity. Memory resource requirements.
  • Another object of the present application is to provide a data reading method, which improves the data reading efficiency by dividing subtasks.
  • Another object of the present application is to provide a data storage system that reduces memory requirements and saves system costs.
  • a data storage method for including multiple memory banks in a CGRA coarse-grained reconfigurable chip.
  • Access modes for multiple repositories include idle mode and occupied mode. Operational data with preset storage bits is stored in the plurality of memory banks.
  • Data storage methods include:
  • the set transformation vector when the conflict is the least and the filling element is the least is used as the current transformation vector
  • the operation data is stored in the plurality of memory banks according to the plurality of current memory bank numbers and the plurality of internal offset addresses.
  • sequentially scanning the plurality of memory banks according to the set transformation vector, and sequentially generating the pending memory bank numbers of the plurality of memory banks includes:
  • the storage dimensions of the operation data are scanned in a set order to obtain the pending memory bank numbers, and if the pending memory bank numbers of two adjacent dimensions are not consecutive, fill in to make the pending memory bank numbers
  • the consecutive padding elements of the memory bank number include:
  • using a greedy algorithm to use the set transformation vector with the least conflict and the least padding elements as the current transformation vector includes:
  • the data storage method further includes:
  • the read window is slid along the arrangement sequence of the memory banks according to the set sliding distance, and the memory data of the plurality of memory banks covered by the initial position of the read window and the memory data covered by the sliding read window are read.
  • the memory data of multiple memory banks is used as the current read data.
  • a data storage system for including a plurality of memory banks in a coarse-grained reconfigurable chip, the access modes of the plurality of memory banks include an idle mode and an occupied mode, the plurality of memory banks Operation data with pre-stored number of bits is stored in .
  • the data storage device includes:
  • a scanner configured to scan multiple storage dimensions of the operation data according to the pre-stored number of bits, match the storage bank of the idle mode, and obtain the current storage bank of each operation data;
  • a generator configured to sequentially scan the plurality of memory banks according to a set transformation vector, and sequentially generate pending memory bank numbers of the plurality of memory banks, and the set transformation vector corresponds to the storage dimension of the operation data;
  • a filler configured to scan the storage dimensions of the operation data in a set order to obtain the pending memory bank numbers, and if the pending memory bank numbers of two adjacent dimensions are not consecutive, fill elements to make the pending memory bank numbers
  • the memory bank number is continuous;
  • an acquirer configured to use a greedy algorithm to use the set transformation vector when the conflict is the least and the filling element is the least as the current transformation vector, wherein the generator generates the current memory bank of the current storage bank according to the current transformation vector number, the acquirer is further configured to convert the current memory bank number into a physical memory bank address corresponding to the current memory bank number through an offset function and obtain the corresponding internal offset address;
  • a memory configured to store the operation data in the plurality of memory banks according to the current memory bank number and the internal offset address.
  • the generator is configured to:
  • N is the number of physical storage rows of any one of the multiple storage banks; wherein the element of x is the physical address number of any one of the storage banks:
  • the filler is further configured to obtain one or more discontinuous positions in the plurality of pending memory bank numbers; if the dimensions corresponding to the one or more discontinuous positions are adjacent, then in the one or more discontinuous positions or multiple break positions are filled with elements to make the pending memory bank numbers consecutive.
  • the acquirer is further configured to:
  • the data storage device includes:
  • the read window is slid along the arrangement sequence of the memory banks according to the set sliding distance, and the memory data of the plurality of memory banks covered by the initial position of the read window and the memory data covered by the sliding read window are read.
  • the memory data of multiple memory banks is used as the current read data.
  • FIG. 1 is a schematic diagram for explaining the architecture of a CGRA system in the prior art.
  • FIG. 2 is a schematic diagram for explaining the compilation system framework of CGRA in the prior art.
  • FIG. 3( a ) is a schematic diagram for explaining memory division in the prior art.
  • FIG. 3(b) is another schematic diagram for explaining memory division in the prior art.
  • FIG. 3( c ) is a schematic diagram for explaining memory division in the present application.
  • FIG. 4 is a schematic flowchart for illustrating a method for storing data in a multi-bank memory in an embodiment of the present application.
  • FIG. 5 is a schematic flowchart for illustrating a method for storing data in a multi-bank memory in an embodiment of the present application.
  • FIG. 6 is a schematic diagram for illustrating the composition of a data storage system with multi-bank memory in an embodiment of the present application.
  • FIG. 7 is a schematic diagram for explaining an example of filling elements in a data storage method of a multi-bank memory in an embodiment of the present application.
  • FIG. 8 is a schematic diagram for explaining an example of a method for dividing subtasks in a data storage method of a multi-database memory in an embodiment of the present application.
  • this method In order to avoid the overhead of data preprocessing and reduce memory requirements, the purpose of this method is to find a strategy that can generate an approximately contiguous address space while optimizing the computational efficiency of CGRA.
  • the address space obtained by this method is shown in Figure 3(c). Some padding elements are regularly added to the original array, which can eliminate conflicts in existing schemes. Since the new address space generated by this application is approximately continuous, data preprocessing is not required, which is also beneficial to the realization of subtask cutting.
  • a data storage method of a multi-bank memory where the multi-bank memory is a plurality of storage banks in a CGRA coarse-grained reconfigurable chip.
  • the access modes of the repository include: idle mode and occupied mode.
  • the data storage method of multi-bank memory can store operation data.
  • the operation data has pre-stored bits.
  • the operation data is the data to be calculated, and it is also the data migrated from the outside of the CGRA to the inside of the CGRA.
  • FIG. 4 illustrates a data storage method for including multiple memory banks in a coarse-grained reconfigurable chip according to an embodiment of the present application.
  • the data storage method includes:
  • Step S101 acquiring multiple storage repositories.
  • the operation data of the operation data For example, according to the number of pre-storage bits, scan multiple storage dimensions of the operation data, match the storage bank of the idle mode, and obtain multiple storage banks of the operation data. Specifically, for the operation data of multiple storage dimensions, according to the pre-stored number of bits, the operation data of each storage dimension is matched with the storage bank in the idle mode, thereby obtaining multiple storage banks corresponding to the operation data.
  • the multiple storage dimensions of the above operation data should be understood as, for example, one-dimensional data refers to data of series type, two-dimensional data refers to data of matrix type, and three-dimensional data refers to storage data such as color. Therefore, the storage space or memory space occupied by the operation data of different dimensions is different, and the requirements for the memory space are also different.
  • Step S102 generating pending memory bank numbers of a plurality of memory banks.
  • a plurality of memory banks are sequentially scanned according to a preset transformation vector, and the pending memory bank numbers of the plurality of memory banks are sequentially generated, wherein the dimension of the preset transformation vector corresponds to the storage dimension of the operation data.
  • Step S103 generating filling elements.
  • the pending memory bank numbers corresponding to the storage dimensions of the operation data are determined in the set order. If the pending memory bank numbers corresponding to two adjacent storage dimensions are not consecutive, elements are filled between the two pending memory bank numbers to make the pending memory bank numbers undetermined. The memory bank numbers are consecutive.
  • Step S104 obtaining the current transformation vector.
  • the set transformation vector when the collision is the least and the padding element is the least is used as the current transformation vector.
  • the dimension of the current transformation vector is the same as the dimension of the operation data.
  • Step S105 generating the current memory bank number.
  • multiple current bank numbers for multiple banks are generated from the current transform vector.
  • Step S106 acquiring the internal offset address.
  • each current memory bank number is converted into a physical memory bank address corresponding to the current memory bank number through the offset function F(x) and the corresponding internal offset address is obtained.
  • step S107 the operation data is stored in a plurality of storage banks.
  • the operation data is stored in a plurality of memory banks according to a plurality of current memory bank numbers and a plurality of internal offset addresses.
  • N is the physical storage row number of any one of the multiple storage banks
  • the element of x is the physical address number of any one of the storage banks:
  • step S103 further includes the following.
  • step S201 the discontinuity position is acquired.
  • Step S202 filling elements.
  • step S104 includes the following.
  • Step S301 setting an initialization cycle start interval.
  • Step S302 obtaining an intermediate transformation vector.
  • a plurality of preset transformation vectors without filling the memory bank number are obtained as the intermediate transformation vector ⁇ .
  • Step S303 acquiring the target intermediate transformation vector.
  • a method for reading data from a multi-bank memory including the above-mentioned data storage method for a multi-bank memory of the present application.
  • the data reading method of the multi-bank memory further includes: setting a reading window, and the reading window covers a set number of multiple storage banks. According to the set sliding distance, the read window is slid along the arrangement order of the memory banks, and the memory data of the multiple memory banks covered by the initial position of the read window and the memory of the multiple memory banks covered by the sliding read window are read. data as the current read data.
  • a data storage device with multi-bank memory is provided, and the multi-bank memory is a plurality of storage banks in a CGRA coarse-grained reconfigurable chip.
  • the access modes of the repository include: idle mode and occupied mode.
  • the multi-bank memory data storage device can store operation data.
  • the operation data has pre-stored bits.
  • the multi-bank memory data storage device includes: a scanner, a generator, a filler, a getter, and a memory.
  • the generator may include a first generator and a second generator
  • the acquirer may include a first acquirer and a second acquirer.
  • the apparatus may include a scanner 101 , a first generator 102 , a filler 301 , a first acquirer 401 , a second generator 501 , a second acquirer 601 and a memory 701 .
  • the scanner 101 is configured to scan multiple storage dimensions of the operation data according to the pre-stored number of bits, match the storage banks in the idle mode, and obtain multiple storage banks of the operation data. Specifically, the scanner 101 is configured to match the operation data of each storage dimension with a memory bank in an idle mode according to the pre-stored number of bits for operation data of multiple storage dimensions, thereby obtaining multiple storage banks corresponding to the operation data.
  • the first generator 201 is configured to sequentially scan a plurality of memory banks according to a preset transformation vector, and sequentially generate pending memory bank numbers of the plurality of memory banks; wherein the dimension of the preset transformation vector corresponds to the storage dimension of the operation data.
  • the filler 301 is configured to determine the pending memory bank numbers corresponding to the storage dimensions of the operation data in a set order. If the pending inner bank numbers corresponding to two adjacent storage dimensions are not consecutive, then fill between the two pending memory bank numbers. element to make pending memory bank numbers consecutive.
  • the first acquirer 401 is configured to use the greedy algorithm to use the set transformation vector with the least conflict and the least padding elements as the current transformation vector.
  • the second generator 501 is configured to generate a plurality of current memory bank numbers of the plurality of memory banks according to the current transformation vector.
  • the second acquirer 601 is configured to convert each current memory bank number into a physical memory bank address corresponding to the current memory number through the offset function F(x) and acquire the corresponding internal offset address.
  • the memory 701 is configured to store operational data in a plurality of memory banks according to a plurality of current memory bank numbers and a plurality of internal offset addresses.
  • the first generator 201 is further configured to:
  • N is the number of physical storage rows of any database in multiple repositories
  • the element of x is the physical address number of any repository:
  • the filler 301 is further configured to acquire one or more discontinuous positions in the plurality of pending memory bank numbers. If the storage dimensions corresponding to one or more discontinuous locations are adjacent, elements are filled in one or more discontinuous locations to make the pending memory bank numbers consecutive.
  • the device further includes a reader 801 .
  • the reader 801 is configured to set a read window covering a set number of multiple repositories. According to the set sliding distance, the read window is slid along the arrangement order of the memory banks, and the memory data of the multiple memory banks covered by the initial position of the read window and the memory of the multiple memory banks covered by the sliding read window are read. data as the current read data.
  • the multi-bank memory partitioning process for CGRA includes address mapping and subtask generation.
  • the address mapping process maps the original array to a new address space (that is, the target memory address space) by constructing the library function B(x) and the internal offset function F(x).
  • the subtask generation process divides the kernel into different subtasks, making it feasible to perform computations on CGRA under low memory conditions.
  • a method based on linear transformation is used for address mapping.
  • x (x0, x1, .
  • ( ⁇ 0, ⁇ 1, . . . , ⁇ (d-1))
  • the scanning order is first determined, which can determine the arrangement of data in the new address space. Then, to determine the storage function, a transform vector ⁇ that minimizes the padding overhead is searched. After the transformation vector ⁇ is obtained, an overall memory partitioning scheme is obtained with the subtask cutting method based on the sliding window.
  • the method proposed in this application specifically includes the following:
  • the method in this application employs a memory filling method designed for a linear transformation-based address mapping scheme to determine the internal offset function F(x).
  • the memory fill method scans each dimension in a set order for pending memory bank numbers. If the pending memory bank numbers of two adjacent storage dimensions are not consecutive, fill the corresponding padding elements to the pending memory bank numbers, so that the calculation of the new address conforms to the law.
  • the scan order of the algorithm is sequential order, which is the same as the original storage order. Once the scan order is determined, the form of the internal offset function F(x) is determined, and the data arrangement is also determined.
  • the geometry of the strategy is to avoid memory conflicts by periodically adding padding elements based on the original array. The specific internal offset calculation process will be given later.
  • a transform vector ⁇ can be found that minimizes the loop initiation interval (AII) and padding element overhead.
  • AII loop initiation interval
  • each increment of ⁇ (f) in that storage dimension i means adding a padding element between i and (i+1) storage dimensions.
  • the offset function F(x) needs to be found to convert the address x to the corresponding internal offset of the memory library. It is assumed that the scan order is the same as the original sort order and that the memory uses a linear address space. In this case, the d-dimensional data needs to be linearized before storing it in the storage system.
  • W (w0, w1, . . . , w(d-1), 1), where wi represents the width of the array in the ith dimension.
  • the width refers to the number of data along the dimension. For example, for a 4*8 matrix, the width of dimension 0 is 4, and the width of dimension 1 is 8.
  • Equation 2 Equation 2:
  • the internal offset function can be determined as:
  • FIG. 8 is a sliding window-based method according to an embodiment of the present application.
  • the target memory consists of 8 memory banks with a maximum depth of 8.
  • a sliding window with a size of 8*8 is first initialized at the beginning of the new address space (as shown by the dotted box).
  • the first subtask consists of the window and the corresponding iteration vector. Determine the first iteration that accesses elements outside the sliding window, then move the window down as far as possible, making sure that the window contains all the elements accessed by that iteration. This way, the second subtask is obtained (as shown by the solid box), and so on.
  • the algorithm is simple and easy to implement: the core verification validity part is a mathematical formula, and the overall algorithm is a greedy algorithm. Easy to understand and easy to implement.
  • the memory partitioning scheme obtained by using the linear conversion method can effectively avoid conflicts.
  • the application's memory requirements can be greatly reduced:
  • the subtask division method based on the sliding window is adopted to divide the application into smaller subtasks. Compared with executing the complete application, the memory resources required to execute the subtasks are greatly reduced.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种数据存储方法,包括:获取运算数据的多个存储库;依次生成多个存储库的待定内存库号;以设定顺序扫描运算数据的存储维度以获取待定内存库号,若两个相邻维度的待定内存库号不连续,填充元素以使待定内存库号连续;通过贪心算法,将冲突最少且填充元素最少时的设定的变换向量作为当前变换向量;根据当前变换向量生成多个存储库的多个当前内存库号;通过偏移函数将每个当前内存库号转换为与当前内存库号对应的物理存储库地址且获取对应的内部偏移地址;根据多个当前内存库号与多个内部偏移地址,将运算数据存储到多个存储库中。

Description

数据存储和读取方法及系统
相关申请的交叉引用
本申请基于申请号为202011063236.1、申请日为2020年09月30日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及数据存储及内存管理领域。本申请具体涉及一种数据存储方法及系统。
背景技术
作为领域专用加速器的一种有前途的选择,粗粒度可重构体系结构(Coarse-grained Reconfigurable Architectures,CGRA)因为具有接近专用集成电路(Application Specific Integrated Circuit,ASIC)的能源效率以及较高的软件可编程性引起了越来越多的关注。CGRA通常由主控制器(通常是中央处理器(central processing unit,CPU)),处理单元(processing element,PE)阵列,主存储器和本地存储器(一般是多库内存结构)组成,如图1所示,其中,CGRA包括主控制器(host controller)、指令存储器(context memory)、数据存储器(data memory)、局部数据存储器(PEA shared memory(SM))、并行访问数据总线(Parallel-access data bus)、全局寄存器堆(PEA global regfile)、输出寄存器(output register)、局部寄存器堆(local regfile)、指令缓存(context buffer)、多内存库局部数据存储器(Multi-bank PEA shared memory)、多路选择器(multiplexer,MUX)、运算逻辑单元/读取与写入单元(arithmetic and logic unit/load and store unit,ALU/LSU)、全局寄存器(global regfile,GR)等。CGRA计算系统的执行流程如下:主控制器初始化CGRA指令和输入数据并存储至主存储器中。
在CGRA对应用进行加速之前,输入数据从主存储器传输到本地存储器中,并将指令加载到CGRA的配置内存中。当CGRA完成计算时,输出数据将从本地存储器传输到主存储器。在CGRA中,针对计算密集的应用,通常将指令映射至不同的计算单元(或称为处理单元PE)中来并行执行计算过程。为了满足并行访存的需求,CGRA通常采用多库内存架构。在多库内存中,每个内存库有独立的读写端口,通过将数据分散到不同的库中可以实现单周期的并行访存。
CGRA的高能效来源于其计算阵列上分布的大量计算资源、复杂的互联方式和不同层次的存储系统。然而,要在应用时取得更好的性能与能效,需要这些资源的协同合作,如果这 些资源没有得到很好的调度与合作,CGRA作为加速器反而可能拖累系统的性能。此外,因为CGRA的硬件架构与流行的通用处理器有着巨大的区别,使得原有的针对通用处理器的编译技术无法完全移植CGRA上。所以,对CGRA的编译技术研究、开发是十分有必要的。这套编译技术需要能够挖掘出应用中的并行性并减少数据读取的延时,然后,针对CGRA的硬件架构特点来给出配置信息,从而达到高性能、高能效的目标。
为了使CGRA能够有效地完成不同类型的计算任务,需要为CGRA的主控制器与数据通路生成对应的目标程序。因此,一个CGRA编译器需要提供可重构控制器中运行的控制代码以及对应数据通路的配置信息。由于CGRA的硬件结构与通用处理器的硬件结构存在的巨大的差异,其编译器的编译技术与流程也不同于传统的编译器。CGRA编译器的核心工作是对应用程序分析并且将程序划分为硬件执行部分代码与软件运行部分代码,然后对两部分的代码分别编译处理,生成控制器运行代码与可重构数据通路的配置信息。如图2所示,可重构处理器的编译技术包括任务划分与调度、运算映射、内存映射优化等等。本文提出的技术属于内存映射优化技术,因为其会直接影响程序运行的性能,所以受到了较多的关注。
如下代码中展示了一个程序的例子,假设A为9×9矩阵。
for(int i=1;i<Row-1;i++)
for(int j=1;j<Col-1;j++)
access(A[i-1][j],A[i+1][j],A[i][j],
A[i][j-1],A[i][j+1])
例子中的程序由一个二维嵌套循环组成,并且每次循环需要访问五个数组A中的元素。为了将循环流水化,每个时钟周期中五个元素将被同时访问。使用不同的策略来获取新的地址空间,图中的数字表示元素的偏移地址。
首先,将原始数组直接拷贝到目标内存,获得的地址空间如图3(a)所示。在此方案中,冲突将在每个周期发生,因为对于每次迭代,A[i-1][j]和A[i][j-1]位于相同的库中(与A[i+1][j]和A[i][j+1])。例如在第一次循环中,程序同时访问1、9、10、11、19地址的元素,由于1、9地址的元素位于同一个内存库(bank)中,所以每周期会产生冲突,严重拖累程序运行性能。
然后,基于线性变换(linear transformation based,LTB)的算法将生成如图3(b)所示的地址空间。在每个循环中的冲突被避免了,但是由于数据的不连续,因此需要做额外数据预处理。并且该策略假定数据必须能够完全放入目标存储器中,不利于做子任务切割。
发明内容
本申请提供一种数据存储方法,其通过线性转换的方法将元素以恰当的方法分散在不同 的内存库中,并通过子任务切割的方式将应用划分为多个小任务,大幅降低了应用对内存资源的需求。
本申请的另一目的是提供一种数据读取方法,通过对子任务切割,提高了数据读取效率。
本申请的又一目的是提供一种数据存储系统,降低内存需求,节约系统成本。
根据本申请,提供了一种数据存储方法,用于包括CGRA粗粒度可重构芯片中的多个存储库。多个存储库的访问模式包括空闲模式和占用模式。多个存储库中存储有具有预设存储位数的运算数据。
数据存储方法包括:
根据所述预存储位数扫描所述运算数据的多个存储维度,匹配空闲模式的存储库,获取运算数据的多个存储库;
根据预设变换向量顺序扫描所述多个存储库,依次生成所述多个存储库的待定内存库号,所述设定变换向量的维度与所述运算数据的存储维度对应;
以设定顺序扫描所述运算数据的存储维度以获取所述待定内存库号,若两个相邻维度的所述待定内存库号不连续,则填充元素以使所述待定内存库号连续;
通过贪心算法,将冲突最少且填充元素最少时的设定的变换向量作为当前变换向量;
根据所述当前变换向量生成所述多个存储库的多个当前内存库号;
通过偏移函数将每个当前内存库号转换为与所述当前内存库号对应的物理存储库地址且获取对应的内部偏移地址;
根据所述多个当前内存库号与所述多个内部偏移地址,将所述运算数据存储到所述多个存储库中。
在一些实施例中,根据设定变换向量顺序扫描所述多个存储库,依次生成所述多个存储库的待定内存库号包括:
设定所述设定变换向量α=(α0,α1,...,α(d-1)),顺序扫描所述多个存储库,根据下述公式生成所述多个存储库的待定内存库号;其中,N为所述多个存储库中任一存储库的物理存储行位数;其中x的元素为所述任一存储库的物理地址号:
B(x)=(α·x)%N。
在一些实施例中,以设定顺序扫描所述运算数据的存储维度以获取所述待定内存库号,若两个相邻维度的所述待定内存库号不连续,则填充以使所述待定内存库号连续的填充元素包括:
获取所述多个待定内存库号中一个或多个间断位置;
若所述一个或多个间断位置对应的维度相邻,则在所述一个或多个间断位置填充元素以使所述待定内存库号连续。
在一些实施例中,通过贪心算法,将冲突最少且填充元素最少时的设定的变换向量作为当前变换向量包括:
设初始化循环启动间隔AII=0,Nb=N,N为所述多个存储库中任一存储库的内存库号;
获取没有填充内存库号的多个设定变换向量作为中间变换向量α;
判断是否存在根据所述中间变换向量α,其中
Figure PCTCN2021092516-appb-000001
0≤αi<N,在无冲突约束的情况下,用最小化填充总开销函数Padding(α)方法计算得到的最佳的中间变换向量;若存在则输出所述中间变换向量α作为当前变换向量,若不存在,则设AII=AII+1,Nb=Nb+N,重复计算直至得到当前变换向量。
在一些实施例中,数据存储方法还包括:
设定读取窗口,所述读取窗口覆盖设定数量的多个存储库;
根据设定的滑动距离将所述读取窗口沿存储库的排列顺序滑动,读取所述读取窗口的初始位置所涵盖的多个存储库的内存数据及滑动后的读取窗口所涵盖的多个内存库的内存数据作为当前读取数据。
根据本申请,提供了一种数据存储系统,用于包括粗粒度可重构芯片中的多个存储库,所述多个存储库的访问模式包括空闲模式和占用模式,所述多个存储库中存储有具有预存储位数的运算数据。数据存储装置包括:
扫描器,配置为根据所述预存储位数扫描所述运算数据的多个存储维度,匹配空闲模式的存储库,获取每个运算数据的当前存储库;
生成器,配置为根据设定变换向量顺序扫描所述多个存储库,依次生成所述多个存储库的待定内存库号,所述设定变换向量与所述运算数据的存储维度对应;
填充器,配置为以设定顺序扫描所述运算数据的存储维度以获取所述待定内存库号,若两个相邻维度的所述待定内存库号不连续,则填充元素以使所述待定内存库号连续;
获取器,其配置为通过贪心算法,将冲突最少且填充元素最少时的设定的变换向量作为当前变换向量,其中所述生成器根据所述当前变换向量生成所述当前存储库的当前内存库号,所述获取器进一步配置为通过偏移函数将所述当前内存库号转换为与所述当前内存库号对应的物理存储库地址且获取对应的内部偏移地址;和
存储器,配置为根据所述当前内存库号与所述内部偏移地址,将所述运算数据存储到所述多个存储库中。
在一些实施例中,所述生成器配置为:
设定所述设定变换向量α=(α0,α1,...,αd-1),顺序扫描所述多个存储库,根据下述公式生成所述多个存储库的待定内存库号;其中,N为所述多个存储库中任一存储库的物理存储行位数;其中x的元素为所述任一存储库的物理地址号:
B(x)=(α·x)%N。
在一些实施例中,所述填充器还配置为获取所述多个待定内存库号中一个或多个间断位置;若所述一个或多个间断位置对应的维度相邻,则在所述一个或多个间断位置填充元素以使所述待定内存库号连续。
在一些实施例中,所述获取器还配置为:
设初始化循环启动间隔AII=0,Nb=N,N为所述多个存储库中任一存储库的内存库号;
获取没有填充内存库号的多个设定变换向量作为中间变换向量α;
判断是否存在根据所述中间变换向量α,其中
Figure PCTCN2021092516-appb-000002
0≤αi<N,在无冲突约束的情况下,用最小化填充总开销函数Padding(α)方法计算得到的最佳的中间变换向量;若存在则输出所述中间变换向量α作为当前变换向量,若不存在,则设AII=AII+1,Nb=Nb+N,重复计算直至得到当前变换向量。
在一些实施例中,数据存储装置包括:
读取器,配置为:
设定读取窗口,所述读取窗口覆盖设定数量的多个存储库;
根据设定的滑动距离将所述读取窗口沿存储库的排列顺序滑动,读取所述读取窗口的初始位置所涵盖的多个存储库的内存数据及滑动后的读取窗口所涵盖的多个内存库的内存数据作为当前读取数据。
下文将以明确易懂的方式,结合附图对多库内存的数据存储方法的特性、技术特征、优点及其实现方式予以进一步说明。
附图说明
图1是用于说明现有技术中CGRA系统架构的示意图。
图2是用于说明现有技术中CGRA的编译系统框架的示意图。
图3(a)是用于说明现有技术中内存划分的示意图。
图3(b)是用于说明现有技术中内存划分的另一示意图。
图3(c)是用于说明本申请中内存划分的示意图。
图4是用于说明在本申请实施例中,多库内存的数据存储方法的流程示意图。
图5是用于说明在本申请实施例中,多库内存的数据存储方法的流程示意图。
图6是用于说明在本申请实施例中,多库内存的数据存储系统组成示意图。
图7是用于说明在本申请实施例中,多库内存的数据存储方法的填充元素示例的示意图。
图8是用于说明在本申请实施例中,多库内存的数据存储方法的子任务切割方法示例的示意图。
具体实施方式
为了对本申请的技术特征、目的和效果有更加清楚的理解,现对照附图说明本申请的具体实施方式,在各图中相同的标号表示结构相同或结构相似但功能相同的部件。
在本文中,“示意性”表示“充当实例、例子或说明”,不应将在本文中被描述为“示意性”的任何图示、实施方式解释为一种更优选的或更具优点的技术方案。为使图面简洁,各图中只示意性地表示出了与本示例性实施例相关的部分,它们并不代表其作为产品的实际结构及真实比例。
为了避免数据预处理的开销并减少内存需求,本方法的目的是找到一种可以生成近似连续的地址空间,同时可以优化CGRA计算效率的策略。通过本方法获得的地址空间如图3(c)所示。一些填充元素被规律地添加到原始数组,这可以消除现有方案中的冲突。由于本申请生成的新地址空间是近似连续的,所以不需要做数据预处理,也有利于子任务切割的实现。
在本申请中,术语“内存库”和术语“存储库”可以互换地使用。
根据本申请实施例,提供了一种多库内存的数据存储方法,多库内存为CGRA粗粒度可重构芯片中的多个存储库。存储库的访问模式包括:空闲模式和占用模式。多库内存的数据存储方法能够存储运算数据。运算数据具有预存储位数。运算数据为待计算的数据,也是从CGRA外部迁移到CGRA内部的数据。
在一些实施例中,图4示出根据本申请实施例的数据存储方法,用于包括粗粒度可重构芯片中的多个存储库。
如图4所示,该数据存储方法包括:
步骤S101,获取多个存储库。
例如,根据预存储位数,扫描运算数据的多个存储维度,匹配空闲模式的存储库,获取运算数据的多个存储库。具体地,对于多个存储维度的运算数据,根据预存储位数,为每个存储维度的运算数据匹配空闲模式的存储库,从而获得运算数据对应的多个存储库。
上述运算数据的多个存储维度应该被理解为,例如,一维数据指的是数列类的数据,二 维数据指的是矩阵类的数据,三维数据指的是如色彩的存储数据。因此,不同维度的运算数据所占用的存储空间或内存空间是不同的并且对内存空间的要求也不同。
步骤S102,生成多个存储库的待定内存库号。
例如,根据预设变换向量顺序扫描多个存储库,依次生成多个存储库的待定内存库号,其中预设变换向量的维度与运算数据的存储维度对应。
步骤S103,生成填充元素。
例如,以设定顺序确定运算数据的存储维度对应的待定内存库号,若两个相邻存储维度对应的待定内存库号不连续,则在两个待定内存库号之间填充元素以使待定内存库号连续。
步骤S104,获得当前变换向量。
例如,通过贪心算法,将冲突最少且填充元素最少时的设定的变换向量作为当前变换向量。其中,当前变换向量的维数与运算数据的维数相同。
步骤S105,生成当前内存库号。
例如,根据当前变换向量生成多个存储库的多个当前内存库号。
步骤S106,获取内部偏移地址。
例如,通过偏移函数F(x)将每个当前内存库号转换为与该当前内存库号对应的物理存储库地址且获取对应的内部偏移地址。
步骤S107,将运算数据存储到多个存储库。
例如,根据多个当前内存库号与多个内部偏移地址,将运算数据存储到多个存储库中。
在本申请的一种实施方式中,步骤S102中包括:根据预设变换向量α=(α0,α1,...,α(d-1)),顺序扫描多个存储库,根据如下公式1生成多个数据库的待定内存库号B(x),其中符号“·”表示向量的点积。其中,N为多个存储库中任一存储库的物理存储行位数,x的元素为任一存储库的物理地址号:
B(x)=(α·x)%N  公式1。
在本申请的一种实施方式中,如图5所示,步骤S103还包括如下。
步骤S201,获取间断位置。
例如,获取多个待定内存库号中一个或多个间断位置。
步骤S202,填充元素。
例如,若一个或多个间断位置对应的存储维度相邻,则在一个或多个间断位置填充元素以使待定内存库号连续。
在本申请的一种实施方式中,如图5所示,步骤S104包括如下。
步骤S301,设置设初始化循环启动间隔。
例如,将初始化循环启动间隔设置为AII=0,Nb=N,其中N为多个存储库中任一存储库的内存库号,其中Nb为中间变量。
步骤S302,获取中间变换向量。
例如,获取没有对内存库号进行填充的多个预设变换向量作为中间变换向量α。
步骤S303,获取目标中间变换向量。
例如,判断是否存在根据中间变换向量α,其中
Figure PCTCN2021092516-appb-000003
0≤αi<N,使得在无冲突约束的情况下,将填充总开销函数Padding(α)最小化而得到的目标中间变换向量。若存在目标中间变换变量,则输出目标中间变换向量作为当前变换向量,若不存在目标中间变换变量,则使循环启动间隔AII=AII+1,Nb=Nb+N,返回步骤S303,重新计算,直至得到当前变换向量。
根据本申请实施例,提供了一种多库内存的数据读取方法,包括如上所述的本申请的多库内存的数据存储方法。
其中,多库内存的数据读取方法还包括:设定读取窗口,该读取窗口覆盖设定数量的多个存储库。根据设定滑动距离将读取窗口沿存储库的排列顺序滑动,读取读取窗口的初始位置所涵盖的多个存储库的内存数据及滑动后的读取窗口涵盖的多个内存库的内存数据作为当前读取数据。
根据本申请实施例,提供了一种多库内存的数据存储装置,多库内存为CGRA粗粒度可重构芯片中的多个存储库。存储库的访问模式包括:空闲模式和占用模式。多库内存的数据存储装置能够存储运算数据。运算数据具有预存储位数。
具体地,多库内存的数据存储装置包括:扫描器,生成器,填充器,获取器、和存储器。其中,生成器可以包括第一生成器和第二生成器,获取器可以包括第一获取器和第二获取器。
如图6所示,该装置可以包括扫描器101、第一生成器102、填充器301、第一获取器401、第二生成器501,第二获取器601和存储器701。
其中,扫描器101被配置为根据预存储位数扫描运算数据的多个存储维度,匹配空闲模式的存储库,获取运算数据的多个存储库。具体地,扫描器101被配置为对于多个存储维度的运算数据,根据预存储位数,为每个存储维度的运算数据匹配空闲模式的存储库,从而获得运算数据对应的多个存储库。
第一生成器201被配置为根据预设变换向量顺序扫描多个存储库,依次生成多个存储库的待定内存库号;其中预设变换向量的维度与运算数据的存储维度对应。
填充器301被配置为以设定顺序确定运算数据的存储维度对应的待定内存库号,若两个 相邻存储维度对应的待定内库号不连续,则在两个待定内存库号之间填充元素以使待定内存库号连续。
第一获取器401被配置为通过贪心算法,将冲突最少且填充元素最少时的设定的变换向量作为当前变换向量。
第二生成器501被配置为根据当前变换向量生成多个存储库的多个当前内存库号。
第二获取器601被配置为通过偏移函数F(x)将每个当前内存库号转换为与该当前内存号对应的物理存储库地址且获取对应的内部偏移地址。
存储器701被配置为根据多个当前内存库号和多个内部偏移地址,将运算数据存储到多个存储库中。
在本申请的一种实施方式中,第一生成器201还被配置为:
根据预设变换向量α=(α0,α1,...,α(d-1)),顺序扫描多个存储库,根据下述公式生成多个存储库的待定内存库号B(x)。其中,N为多个存储库中任一数据库的物理存储行位数,x的元素为任一存储库的物理地址号:
B(x)=(α·x)%N。
在本申请的一种实施方式中,填充器301还被配置为获取多个待定内存库号中一个或多个间断位置。若一个或多个间断位置对应的存储维度相邻,则在一个或多个间断位置填充元素以使待定内存库号连续。
在本申请的一种实施方式中,第一获取器401还被配置为设置初始化循环启动间隔AII=0,Nb=N,其中N为多个存储库中任一存储库的内存库号。获取没有对内存库号进行填充的多个预设变换向量作为中间变换向量α。判断是否存在根据中间变换向量α,其中
Figure PCTCN2021092516-appb-000004
0≤αi<N,使得在无冲突约束的情况下,将填充总开销函数Padding(α)最小化而得到的目标中间变换向量。若存在目标中间变换向量,则输出目标中间变换向量作为当前变换向量,若不存在目标中间变换向量,则使AII=AII+1,Nb=Nb+N,返回获取器401的执行重新获取,直至得到当前变换向量。
在本申请的一种实施方式中,如图6所示,该装置还包括读取器801。
读取器801被配置为设定读取窗口,该读取窗口覆盖设定数量的多个存储库。根据设定滑动距离将读取窗口沿存储库的排列顺序滑动,读取读取窗口的初始位置所涵盖的多个存储库的内存数据及滑动后的读取窗口涵盖的多个内存库的内存数据作为当前读取数据。
根据本申请的实施例,用于CGRA的多库内存划分过程包括地址映射与子任务生成。地址映射过程通过构造库函数B(x)和内部偏移函数F(x)将原始数组映射到新的地址空间(即 目标内存地址空间)。子任务生成过程将内核划分为不同的子任务,从而使得在内存不足的情况下对CGRA执行计算变得可行。
采用基于线性变换的方法来进行地址映射,为了将d维地址x=(x0,x1,...,x(d-1))映射到N个存储体物理存储器中的新位置,需要确定变换向量α=(α0,α1,...,α(d-1)),然后通过以下公式确定内存库号:B(x)=(α·x)%N。
根据本申请提供的方法,首先确定扫描顺序,这可以决定新地址空间中的数据排列。然后,为了确定存储功能,搜索使填充开销最小化的变换向量α。在得到变换向量α后,配合基于滑动窗口的子任务切割方法,便得到一个整体的内存划分方案。
在一个实例中,本申请中提出的方法具体包括以下内容:
一、扫描顺序
本申请中的方法采用为基于线性变换的地址映射方案设计的内存填充方法,以确定内部偏移函数F(x)。为了生成新地址,内存填充方法以一个设定顺序扫描每个维度以获取待定内存库号。如果两个相邻存储维度的待定内存库号不连续,则将相应的填充元素填充至该待定内存库号,以使新地址的计算符合规律。
如图7所示,当一个存储行包括8个存储元素时,B(x)=(2*0+x1)%8(即B(x)=(2*x0+1*x1)%8)示出了一个内存填充示例,其中,α=(2,1),x=(x1,x2)。根据上述公式获得当前行(第一行)中最后一个元素的库号为7(即,7=(2*0+63)%8,其中α=(2,1),x=(0,63)),且下一行(第二行)的第一个元素的库号为2(即,2=(2*1+0)%8,其中α=(2,1),x=(1,0)),因此,需要添加了两个填充元素(例如,在库号分别为0和1的内存库中添加填充元素)从而生成新地址空间。
这种内部偏移生成方法中,最多存在(d!)种根据不同的扫描顺序(例如反向顺序),对相同的变换向量α可能的地址排列。
如上文所述,为了减少传输的数据总量并提高传输效率,应连续放置新地址空间中的数据。算法的扫描顺序为顺序顺序,与原始存储顺序相同。一旦确定了扫描顺序,就确定了内部偏移函数F(x)的形式,并且还确定了数据排列。策略的几何意义是通过基于原始数组定期添加填充元素来避免内存冲突。稍后将给出具体的内部偏移计算过程。
二、内存库映射函数
根据本申请,可以找到一个使循环启动间隔(AII)和填充元素开销最小的变换向量α。为了降低AII,考虑使用无冲突约束。基于该约束,可以基于生成的多个内存库号序列寻找满足要求的候选线性变换矢量。给定m个不同的内存访问x(0),x(1),...,x(m-1),无冲突的 库映射函数应满足:
Figure PCTCN2021092516-appb-000005
B(x(i))!=B(x(j))。
得到一组候选的满足无冲突条件的线性变换向量后,需要找到填充开销最小的变换向量。假设α(f)是不对原始数组做任何填充处理的变换向量,从而没有填充开销。对于存储维度i,每个α(f)在该存储维度i的增量都意味着在i和(i+1)存储维度之间添加一个填充元素。引入填充向量的概念q=(q0,q1,...,q(d-1)),其中qi表示存储维度i(第i维)尺寸的增加:qi=(αi-α(f)i)%N。
引入填充向量之后,由填充向量和运算数据的存储维度做内积,即可获得总填充开销。最后,如果最终获得K×N个内存库的无冲突解决方案,那么显然最多只需要访问N个内存库K次即可无冲突地获得所有数据元素。例如,对于给定的内存库号N,算法的框架描述如下:
1.初始化循环启动间隔AII=0,Nb=N。
2.获取不对内存库号进行填充处理的变换向量α(f)。
3.获取所有可能的变换向量α,其中
Figure PCTCN2021092516-appb-000006
0≤αi<N,寻找在无冲突约束的情况下,使得填充总开销函数Padding(α)最小化对应的目标变换向量。
4.若没有找到目标变换向量,则使AII=AII+1,Nb=Nb+N,重新获取所有可能的变换向量α,并寻找在无冲突约束的情况下,使得填充总开销函数Padding(α)最小化对应的目标变换向量。
三、内部偏移计算
生成库函数B(x)之后,需要找到偏移函数F(x)将地址x转换为对应的内存库内部偏移。假设扫描顺序与原始排列顺序相同,并且存储器使用线性地址空间。在此情况下,需要将d维数据线性化,然后再将其存储在存储系统中。假设W=(w0,w1,...,w(d-1),1),其中wi表示数组在第i维的宽度。其中,宽度指的是沿维度的数据的个数,例如对于4*8大小的矩阵,维度0的宽度为4,维度1的宽度为8。不失一般性,向量坐标可以线性化为标量x,如公式2所示:
Figure PCTCN2021092516-appb-000007
考虑添加填充元素后,地址空间的线性化。添加填充元素后,每个存储维度的宽度将会改变。而第i个维度上的填充大小Pi可以通过以下公式3确定:
Figure PCTCN2021092516-appb-000008
因此,填充空白元素后的地址可以通过以下公式4确定:
Figure PCTCN2021092516-appb-000009
内部偏移函数可以确定为:
Figure PCTCN2021092516-appb-000010
四、子任务切割
分区内核的一种直观方法是让每个子任务包含尽可能多的迭代向量,直到数据大小超出内存大小为止。然而,这样做有几个问题。首先,子任务之间可能有一些重叠的数据,如果重复传输数据,可能会导致额外的开销。其次,地址需要重新计算,这会带来额外的计算开销。最后,将原始数组随机拆分,可能会引入额外的直接内存访问(direct memory access,DMA)初始化开销。因此,提出了基于滑动窗口的方法。
图8是根据本申请实施例的基于滑动窗口的方法。例如,目标存储器由8个内存库组成,最大深度为8。
根据本申请提出的方法,首先在新地址空间的开头初始化一个8*8大小的滑动窗口(如虚线框所示)。第一个子任务由窗口和相应的迭代向量组成。确定访问滑动窗口外部元素的第一个迭代,然后将窗口尽可能向下移动,并确保该窗口包含该迭代访问的所有元素。这样,获得了第二个子任务(如实线框所示),依此类推。
需要注意的是,两个相邻子任务之间存在重叠部分。切换子任务时,不需要传输这部分数据,仅需要传输不重叠的部分数据。所有子任务执行完毕后,所得到的结果应与原任务的结果一致。
根据本申请实施例的方法和装置可以提供以下有益效果:
算法简单易实现:核心的验证有效性部分为数学公式,整体算法为贪心算法。易于理解且容易实现。
有效地减少访存冲突:利用线性转换方法得到的内存划分方案可以有效避免冲突。
没有数据预处理的开销:通过限制扫描顺序,得到近似连续的地址空间,得到的数据有良好的局部性,不需要额外的预处理步骤。
可以大幅减少应用对内存的需求量:采用基于滑动窗口的子任务划分方式,将应用划分为更小的子任务,比起执行完整应用,执行子任务所需的内存资源大幅减少了。
应当理解,虽然本说明书是按照各个实施方式中描述的,但并非每个实施方式仅包含一个独立的技术方案,说明书的这种叙述方式仅仅是为清楚起见,本领域技术人员应当将说明书作为一个整体,各实施例中的技术方案也可以经适当组合,形成本领域技术人员可以理解的其他实施方式。
上文所列出的一系列的详细说明仅仅是针对本申请的可行性实施方式的具体说明,它们并非用以限制本申请的保护范围,凡未脱离本申请技艺精神所作的等效实施方式或变更均应包含在本申请的保护范围之内。

Claims (10)

  1. 一种数据存储方法,其特征在于,用于包括粗粒度可重构芯片中的多个存储库,所述多个存储库的访问模式包括空闲模式和占用模式,所述多个存储库中存储有具有预存储位数的运算数据,
    所述数据存储方法包括:
    根据所述预存储位数扫描所述运算数据的多个存储维度,匹配空闲模式的存储库,获取运算数据的多个存储库;
    根据预设变换向量顺序扫描所述多个存储库,依次生成所述多个存储库的待定内存库号,所述设定变换向量的维度与所述运算数据的存储维度对应;
    以设定顺序扫描所述运算数据的存储维度以获取所述待定内存库号,若两个相邻维度的所述待定内存库号不连续,则填充元素以使所述待定内存库号连续;
    通过贪心算法,将冲突最少且填充元素最少时的设定的变换向量作为当前变换向量;
    根据所述当前变换向量生成所述多个存储库的多个当前内存库号;
    通过偏移函数将每个当前内存库号转换为与所述当前内存库号对应的物理存储库地址且获取对应的内部偏移地址;
    根据所述多个当前内存库号与所述多个内部偏移地址,将所述运算数据存储到所述多个存储库中。
  2. 根据权利要求1所述的数据存储方法,其特征在于,根据设定变换向量顺序扫描所述多个存储库,依次生成所述多个存储库的待定内存库号包括:
    设定所述设定变换向量α=(α0,α1,...,α(d-1)),顺序扫描所述多个存储库,根据下述公式生成所述多个存储库的待定内存库号;其中,N为所述多个存储库中任一存储库的物理存储行位数;其中x的元素为所述任一存储库的物理地址号:
    B(x)=(α·x)%N。
  3. 根据权利要求1所述的数据存储方法,其特征在于,以设定顺序扫描所述运算数据的存储维度以获取所述待定内存库号,若两个相邻维度的所述待定内存库号不连续,则填充以使所述待定内存库号连续的填充元素包括:
    获取所述多个待定内存库号中一个或多个间断位置;
    若所述一个或多个间断位置对应的维度相邻,则在所述一个或多个间断位置填充元素以 使所述待定内存库号连续。
  4. 根据权利要求1所述的数据存储方法,其特征在于,通过贪心算法,将冲突最少且填充元素最少时的设定的变换向量作为当前变换向量包括:
    设初始化循环启动间隔AII=0,Nb=N,N为所述多个存储库中任一存储库的内存库号;
    获取没有填充内存库号的多个设定变换向量作为中间变换向量α;
    判断是否存在根据所述中间变换向量α,其中
    Figure PCTCN2021092516-appb-100001
    0≤αi<N,在无冲突约束的情况下,用最小化填充总开销函数Padding(α)方法计算得到的最佳的中间变换向量;若存在则输出所述中间变换向量α作为当前变换向量,若不存在,则设AII=AII+1,Nb=Nb+N,重复计算直至得到当前变换向量。
  5. 根据权利要求1所述的数据存储方法,其特征在于,还包括:
    设定读取窗口,所述读取窗口覆盖设定数量的多个存储库;
    根据设定的滑动距离将所述读取窗口沿存储库的排列顺序滑动,读取所述读取窗口的初始位置所涵盖的多个存储库的内存数据及滑动后的读取窗口所涵盖的多个内存库的内存数据作为当前读取数据。
  6. 一种数据存储装置,其特征在于,用于包括粗粒度可重构芯片中的多个存储库,所述多个存储库的访问模式包括空闲模式和占用模式,所述多个存储库中存储有具有预存储位数的运算数据;
    所述数据存储装置包括:
    扫描器,配置为根据所述预存储位数扫描所述运算数据的多个存储维度,匹配空闲模式的存储库,获取每个运算数据的当前存储库;
    生成器,配置为根据设定变换向量顺序扫描所述多个存储库,依次生成所述多个存储库的待定内存库号,所述设定变换向量与所述运算数据的存储维度对应;
    填充器,配置为以设定顺序扫描所述运算数据的存储维度以获取所述待定内存库号,若两个相邻维度的所述待定内存库号不连续,则填充元素以使所述待定内存库号连续;
    获取器,其配置为通过贪心算法,将冲突最少且填充元素最少时的设定的变换向量作为当前变换向量,其中所述生成器根据所述当前变换向量生成所述当前存储库的当前内存库号,所述获取器进一步配置为通过偏移函数将所述当前内存库号转换为与所述当前内存库号对应的物理存储库地址且获取对应的内部偏移地址;和
    存储器,配置为根据所述当前内存库号与所述内部偏移地址,将所述运算数据存储到所述多个存储库中。
  7. 根据权利要求6所述的数据存储装置,其特征在于,所述生成器配置为:
    设定所述设定变换向量α=(α0,α1,...,αd-1),顺序扫描所述多个存储库,根据下述公式生成所述多个存储库的待定内存库号;其中,N为所述多个存储库中任一存储库的物理存储行位数;其中x的元素为所述任一存储库的物理地址号:
    B(x)=(α·x)%N。
  8. 根据权利要求6所述的数据存储装置,其特征在于,所述填充器还配置为获取所述多个待定内存库号中一个或多个间断位置;若所述一个或多个间断位置对应的维度相邻,则在所述一个或多个间断位置填充元素以使所述待定内存库号连续。
  9. 根据权利要求6所述的数据存储装置,其特征在于,所述获取器还配置为:
    设初始化循环启动间隔AII=0,Nb=N,N为所述多个存储库中任一存储库的内存库号;
    获取没有填充内存库号的多个设定变换向量作为中间变换向量α;
    判断是否存在根据所述中间变换向量α,其中
    Figure PCTCN2021092516-appb-100002
    0≤αi<N,在无冲突约束的情况下,用最小化填充总开销函数Padding(α)方法计算得到的最佳的中间变换向量;若存在则输出所述中间变换向量α作为当前变换向量,若不存在,则设AII=AII+1,Nb=Nb+N,重复计算直至得到当前变换向量。
  10. 根据权利要求6所述的数据存储装置,其特征在于,还包括:
    读取器,配置为:
    设定读取窗口,所述读取窗口覆盖设定数量的多个存储库;
    根据设定的滑动距离将所述读取窗口沿存储库的排列顺序滑动,读取所述读取窗口的初始位置所涵盖的多个存储库的内存数据及滑动后的读取窗口所涵盖的多个内存库的内存数据作为当前读取数据。
PCT/CN2021/092516 2020-09-30 2021-05-08 数据存储和读取方法及系统 WO2022068205A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/484,626 US11740832B2 (en) 2020-09-30 2021-09-24 Data storage and reading method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011063236.1 2020-09-30
CN202011063236.1A CN111930319B (zh) 2020-09-30 2020-09-30 一种多库内存的数据存储、读取方法及系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/484,626 Continuation US11740832B2 (en) 2020-09-30 2021-09-24 Data storage and reading method and device

Publications (1)

Publication Number Publication Date
WO2022068205A1 true WO2022068205A1 (zh) 2022-04-07

Family

ID=73334836

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/092516 WO2022068205A1 (zh) 2020-09-30 2021-05-08 数据存储和读取方法及系统

Country Status (2)

Country Link
CN (1) CN111930319B (zh)
WO (1) WO2022068205A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111930319B (zh) * 2020-09-30 2021-09-03 北京清微智能科技有限公司 一种多库内存的数据存储、读取方法及系统
CN114328298B (zh) * 2022-03-14 2022-06-21 南京芯驰半导体科技有限公司 一种用于向量存取的片内存储器地址映射系统及方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927270A (zh) * 2014-02-24 2014-07-16 东南大学 一种面向多个粗粒度动态可重构阵列的共享数据缓存装置及控制方法
CN105630736A (zh) * 2015-12-29 2016-06-01 东南大学—无锡集成电路技术研究所 可重构系统存储系统的数据宏块预测访问装置及方法
US20190227838A1 (en) * 2018-01-24 2019-07-25 Alibaba Group Holding Limited System and method for batch accessing
CN111930319A (zh) * 2020-09-30 2020-11-13 北京清微智能科技有限公司 一种多库内存的数据存储、读取方法及系统

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105487838B (zh) * 2015-11-23 2018-01-26 上海交通大学 一种动态可重构处理器的任务级并行调度方法与系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927270A (zh) * 2014-02-24 2014-07-16 东南大学 一种面向多个粗粒度动态可重构阵列的共享数据缓存装置及控制方法
CN105630736A (zh) * 2015-12-29 2016-06-01 东南大学—无锡集成电路技术研究所 可重构系统存储系统的数据宏块预测访问装置及方法
US20190227838A1 (en) * 2018-01-24 2019-07-25 Alibaba Group Holding Limited System and method for batch accessing
CN111930319A (zh) * 2020-09-30 2020-11-13 北京清微智能科技有限公司 一种多库内存的数据存储、读取方法及系统

Also Published As

Publication number Publication date
CN111930319A (zh) 2020-11-13
CN111930319B (zh) 2021-09-03

Similar Documents

Publication Publication Date Title
US20230251861A1 (en) Accelerating linear algebra kernels for any processor architecture
Baskaran et al. Optimizing sparse matrix-vector multiplication on GPUs
Xygkis et al. Efficient winograd-based convolution kernel implementation on edge devices
US9477465B2 (en) Arithmetic processing apparatus, control method of arithmetic processing apparatus, and a computer-readable storage medium storing a control program for controlling an arithmetic processing apparatus
US11500621B2 (en) Methods and apparatus for data transfer optimization
Bošnački et al. Parallel probabilistic model checking on general purpose graphics processors
WO2022068205A1 (zh) 数据存储和读取方法及系统
Huang et al. Implementing Strassen's algorithm with CUTLASS on NVIDIA Volta GPUs
Huang et al. Strassen’s algorithm reloaded on GPUs
Yin et al. Conflict-free loop mapping for coarse-grained reconfigurable architecture with multi-bank memory
Man et al. The approximate string matching on the hierarchical memory machine, with performance evaluation
WO2016024508A1 (ja) マルチプロセッサ装置
Ben Abdelhamid et al. A block-based systolic array on an HBM2 FPGA for DNA sequence alignment
Liu et al. Parallel reconstruction of neighbor-joining trees for large multiple sequence alignments using CUDA
Ishii et al. Fast modular arithmetic on the Kalray MPPA-256 processor for an energy-efficient implementation of ECM
Rucci et al. Smith-Waterman protein search with OpenCL on an FPGA
Li et al. Automatic FFT performance tuning on OpenCL GPUs
Kyo et al. An integrated memory array processor for embedded image recognition systems
EP3108358B1 (en) Execution engine for executing single assignment programs with affine dependencies
Bistaffa et al. Optimising memory management for belief propagation in junction trees using GPGPUs
US11740832B2 (en) Data storage and reading method and device
Li et al. Combining memory partitioning and subtask generation for parallel data access on cgras
Shen et al. Memory partition for simd in streaming dataflow architectures
Nakano A time optimal parallel algorithm for the dynamic programming on the hierarchical memory machine
Kessler et al. Optimized mapping of pipelined task graphs on the Cell BE

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21873868

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21873868

Country of ref document: EP

Kind code of ref document: A1