CN114218136A - Area-friendly storage address mapping method facing systolic array - Google Patents

Area-friendly storage address mapping method facing systolic array Download PDF

Info

Publication number
CN114218136A
CN114218136A CN202111552673.4A CN202111552673A CN114218136A CN 114218136 A CN114218136 A CN 114218136A CN 202111552673 A CN202111552673 A CN 202111552673A CN 114218136 A CN114218136 A CN 114218136A
Authority
CN
China
Prior art keywords
cache
data
bank
offset
systolic array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111552673.4A
Other languages
Chinese (zh)
Inventor
文梅
杨韧禹
沈俊忠
曹亚松
张洋
刘胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202111552673.4A priority Critical patent/CN114218136A/en
Publication of CN114218136A publication Critical patent/CN114218136A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • G06F12/1009Address translation using page tables, e.g. page table structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • G06F12/1027Address translation using associative or pseudo-associative address translation means, e.g. translation look-aside buffer [TLB]
    • G06F12/1045Address translation using associative or pseudo-associative address translation means, e.g. translation look-aside buffer [TLB] associated with a data cache

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses a mapping method of an area-friendly storage address facing a systolic array, which comprises the steps of dividing data to be written into a C cache of a single-port SP in off-chip storage into data blocks; traversing data to be written into a C cache of a single-port SP in off-chip storage in a row unit, respectively mapping each data block of a current line into the C cache to obtain the current line C in the C cachejEach column of data block cj_iBank offset C in C cachej_i_0Bank internal offset cj_i_1. According to the advantages and disadvantages of the single port SP, the invention can realize the high-efficiency address conversion based on the single port SP, and under the condition of not influencing the behavior of the access of the intermediate result in the running process of the pulse accelerator, the address mapping of the data written from the outside of the chip into the intermediate result for storage is changed, thereby achieving the purpose of reducing the area of the memory bank.

Description

Area-friendly storage address mapping method facing systolic array
Technical Field
The invention relates to a data storage technology, in particular to a ripple array-oriented area-friendly storage address mapping method.
Background
Since the appearance of deep learning, people's life has changed from the world, and the idea of developing a special chip for neural networks has been deeply focused. In 2013, multiple products and services such as Google image search, Google photos, Google cloud vision API, Google translation and the like provided by Google all need to use a deep neural network. At a huge application scale, google internally realizes that these millions of servers running overnight have a rapidly growing computational demand internally, so that the number of data centers needs to be doubled to be satisfied. However, internal centers have not been able to be maintained simply by means of GPUs and CPUs, whether from a cost or computational perspective, which has also prompted the emergence of specialized deep learning accelerators.
On-chip storage is one of the core components of many domain-specific architectures (e.g., deep learning accelerators). In general, the area overhead of on-chip memory accounts for more than 25% of the total chip area. As the complexity of deep learning algorithms continues to increase, it becomes a challenge for accelerators to integrate larger on-chip memories to respond to algorithm requirements. To reduce the overhead of off-chip memory access and thereby increase computational efficiency, state-of-the-art deep learning accelerators integrate a large amount of on-chip memory to save inter-layer features or weights.
The structure of a systolic array accelerator, which is the most widely used accelerator in the deep learning field, is shown in fig. 2. The Systolic Array (SA) is a square with sides Pe. There are Pe processing units, each having a corresponding buffer to store weight, partial sum and active data. Matrix a represents the input image matrix and matrix B is the weight matrix. Reading data of the A matrix and the B matrix from an on-chip buffer; a and B have two buffers, each of size 1 MB. Meanwhile, the C matrix is used to store the initial deviation and the intermediate result of the calculation. Due to the size limitations of the systolic array and the network parameters, we will generate a large number of intermediate results in the computation for subsequent computations. By placing two buffers on the chip to store the C matrix, we reduce the data transfer off-chip and on-chip.
During calculation, after data in an SA cache is prepared, the weight matrix is transmitted to the SA from left to right, and meanwhile, the data of the C matrix is also transmitted to the SA from top to bottom. It should be noted that both the a matrix and the C matrix change from rectangular to parallelogram. This is because the time taken for operation and the time required to transfer data are different in systolic arrays. The input mode is beneficial to the systolic array, ensures that the longitudinal part transmission and the transverse original data transmission are simultaneously generated in the calculation process, and further ensures the efficient operation of the systolic array and the correctness of the calculation result.
Generally, the IP core classes are classified into three types, a Single Port (SP), a Dual Port (DP), and a pseudo-dual port (TP). The main components are RAM. The single-port RAM corresponds to SP in the IP core, only has one group of control signal line, address line and data line, can not read and write at the same time, and can only be used as one of data input or data output under the action of the control signal at a certain time. The dual-port RAM is provided with two groups of independent control signal lines, address lines and data lines corresponding to the DP in the IP core, wherein the two groups are not influenced mutually, and two independent systems are allowed to randomly access the dual-port RAM at the same time. I.e. the shared multi-port memory, can read and write at the same time (both ports can perform read-write operation, note that arbitration occurs when the dual-port RAM simultaneously reads and writes the same address). And the pseudo-dual port RAM corresponds to the pseudo-dual port TP in the IP core, one port is read only, and one port is written only. In practical applications, all banks IP with simultaneous read and write requirements in the systolic array accelerator are made of TP or DP.
The memory bank is formed by splicing a plurality of identical IP cores (each IP core is called a memory bank). Although the area overhead of the SP memory is the lowest, it can only support one memory read and write access at a time. However, both pseudo-dual port TP and DP based memories can support simultaneous read and write requests (for pseudo-dual port TP based memories, the addresses of the read and write requests are required to be different). By changing the size (depth) of the bank, we have different impact on the space consumption of the C-cache. For the calculation of the C cache space overhead, a compiler of a renderer is used to generate a space occupied by a single port SP and a pseudo-dual port TP when the depths of banks are different. As shown in table 1, the space occupied by the single port SP with respect to the pseudo-dual port TP is optimized to different degrees for different depths.
Table 1 shows how SPs and TPs are occupied in different banks of a bank generated by a compiler of a renderer.
Depth of field SP TP Ratio of
64W144B 1359 1654 17.8%
128W144B 1743 2437 28.5%
256W144B 2673 3952 32.4%
448W144B 3607 5764 37.4%
512W144B 4528 6559 31.0%
In table 1 there are: the ratio is (TP-SP)/TP 100%. Referring to table 1, it can be seen that the space occupied by the single port SP with respect to the pseudo-dual port TP is optimized to different degrees and is not linear for different depths. The main reason is that in the hardware design, the larger the bank depth is, the more complicated the address lines are required for pseudo-dual port TP storage, which also leads to continuous expansion of pseudo-dual port TP storage and single port SP storage. However, when the depth reaches a certain value, although the single port SP saves a larger area than the pseudo-dual port TP, the ratio is reduced. The selection of an appropriate depth for optimum results is therefore also a factor to be considered.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: according to the advantages and disadvantages of the single port SP, the invention can realize efficient address conversion based on the single port SP, and under the condition of not influencing the behavior of intermediate result access in the running process of the pulse accelerator, the invention changes the address mapping of data written from the outside of a chip into which intermediate results are stored, thereby achieving the purpose of reducing the area of a memory bank.
In order to solve the technical problems, the invention adopts the technical scheme that:
a systolic array-oriented area-friendly memory address mapping method comprises the following steps:
1) dividing data to be written into a C cache of a single-port SP in off-chip storage into data blocks;
2) for off-chip storage mediumWriting the first line of the data of the C cache of the single port SP, and storing the first address a of the first line0Mapping the corresponding first data block to the C cache to obtain a first line C in the determined C cache0First data block c of0_0Bank offset c of memory0_0_0With an offset of c in bank0_0_1Are all 0, and are blocked according to the first data of the first row0_0Bank offset c of memory0_0_0With an offset of c in bank0_0_1Determining each subsequent column data block c of the row0_iBank offset C in C cache0_i_0Bank internal offset c0_i_iAnd jumping to the next step:
3) judging whether the data to be written into the C cache of the single-port SP in the off-chip storage is processed or not, if so, ending and exiting; if not, the first address a of the previous line is processedj-1Adding the number n of data columns to obtain the first address a of the current rowjAnd then, jumping to the next step:
4) mapping each data block of the current line to the C cache to obtain the current line C in the C cachejEach column of data block cj_iBank offset C in C cachej_i_0Bank internal offset cj_i_1Jump to step 3).
Optionally, when the data stored outside the chip in step 1) is divided into data blocks, specifically, dividing each 16 data blocks into one data block, so that n data columns of each row correspond to n data columns of each row
Figure BDA0003417590280000031
And dividing the data into blocks.
Optionally, the first line C in the C cache is determined in the step 2)0First data block c of0_0Bank offset c of memory0_0_0With an offset of c in bank0_0_1All values of (A) are 0.
Optionally, the first data block c according to the first row in step 2) is divided into blocks0_0Bank offset c of memory0_0_0With an offset of c in bank0_0_1Determining each subsequent column data block c of the row0_iBank offset C in C cache0_i_0Bank internal offset c0_i_iThe functional expression of (a) is:
c0_i_0=c0_0_0+i%(c/z),
c0_i_0=c0_0_1+i/(c/z),
in the above formula, i is the column serial number of the data block, C is the total depth of the C cache, and z is the bank depth in the C cache.
Optionally, the current line C in the cache C is obtained in step 4)jEach column of data block cj_iBank offset C in C cachej_i_0Bank internal offset cj_i_1The functional expression of (a) is:
cj_i_0=(c(j-1) _ i _0+cnt)%(c/z),
cj_i_1=c(j-1)_i_1+(c(j-1)_i_0+cnt)/(c/z),
In the above formula, c(j-1)_i_0For the ith column data block c of the previous row(j-1)_iThe bank in the C cache is offset, cnt is the number of data blocks in one line, C is the total depth of the C cache, z is the bank depth in the C cache, C is the bank depth in the C cache(j-1)_i_1For the ith column data block c of the previous row(j-1)_iOffset within bank in C-cache.
Optionally, step 4) is followed by further including storing the conflicting write data by a first register Reg0 and delaying the write data by one beat when the systolic array SA reads and writes from the C-buffer, so that the write operation W0 and the read operation R0 acting on one bank of the C-buffer form a stable pipeline: the write operation W0 of the previous beat is written in the current beat, and the data storage register written in the current beat is covered with the data of the previous beat, so as to solve the read-write conflict caused by the write operation W0 and the read operation R0 of one bank of the C cache.
Optionally, the following steps of reading and writing the C-buffer in the systolic array SA are further included after step 4):
s1) the systolic array SA reads the data required by the C matrix from the C cache in sequence, and if the data required by the C matrix is not read completely, the step S1) is skipped to be executed again; otherwise, skipping to execute the next step;
s2) if the last block of the B matrix is used up, and the next block C matrix needs to be introduced into the systolic array SA and the C matrix is read again, if yes, jumping to execute the step S3); otherwise, jumping to execute step S1);
s3) determining whether the last data of the data required by the C matrix is placed in the first bank0 of the C cache, and whether the write operation W0 and the read operation R0 occurred in the C cache at the same time in the first bank0 at the previous time, and if yes, jumping to execute step S4); otherwise, jumping to execute step S1);
s4) if the write operation W1 at the current time is sent to the second bank1 in the C cache and the read operation R1 at the current time is sent to the first bank0 in the C cache, then the first register Reg0 at the current time and the read operation R1 at the current time are caused to generate read-write collision because the colliding write data are stored by the first register Reg0 and the write data are written with a delay of one beat, at this time, the value temporarily stored in the first register Reg0 is stored in the second register Reg1 and written with a delay of one beat, and step S1 is executed; if the write operation W1 at the current time is sent to the first bank0 in the C cache and the read operation R1 at the current time is sent to the second bank1 in the C cache, at this time, since the conflicting write data is stored in the first register Reg0 and the write data is written with a delay of one beat, the read-write conflict occurs between the first register Reg0 at the current time and the write operation W1 at the current time, the value temporarily stored in the first register Reg0 is stored in the second register Reg1 and written with a delay of one beat, the data W1_ data of the write operation W1 at the current time and the address W1_ addr are written in the C cache, and step S1 is executed.
In addition, the invention also provides a deep learning accelerator, which comprises an accelerator unit and an on-chip memory which are connected with each other, wherein the on-chip memory comprises a systolic array SA and a C cache for providing a C matrix for the systolic array SA, the C cache is a C cache of a single-port SP, and a storage address mapping part is connected between the C cache and an external memory interface of the on-chip memory, and the storage address mapping part is programmed or configured to execute the steps of the area-friendly storage address mapping method facing the systolic array.
In addition, the invention also provides computer equipment which comprises a microprocessor and a memory which are connected with each other, and the computer equipment also comprises the deep learning accelerator, wherein the microprocessor and the deep learning accelerator are in communication connection.
Furthermore, the present invention also provides a computer readable storage medium having stored therein steps for execution by a computer device to implement the systolic array-oriented area-friendly memory address mapping method.
Compared with the prior art, the invention mainly has the following advantages: compared with pseudo dual port TP, a single port SP bank cannot perform simultaneous read and write operations on the same bank (if the access addresses of the read signal and the write signal are in the same bank, an error will be caused if the read and write operations are performed on the bank simultaneously). Assuming that the memory occupation space of a certain pulse accelerator is S, the required calculation time after matrix input of a certain scale is T, and the calculation accuracy is K. The invention aims to optimize the pulsating memory bank, reduce and optimize the occupied space S, and enable the accuracy K not to be influenced and the calculation efficiency T not to be influenced. The core technical problem to be solved by the present invention is how to properly process data in a memory bank, so that read and write operations that may occur in the same memory bank are performed in a staggered manner as much as possible. The design of the address mapping module determines whether the designed SP memory bank has read-write conflict or not, and the correctness of the operation result is ensured. The SP memory bank (C cache of single-port SP) aimed by the invention is mainly applied to an intermediate result memory bank for matrix calculation of the systolic array accelerator. In order to achieve the same operation effect under the condition that a memory bank uses smaller space overhead, the invention designs a set of unique address mapping scheme for the memory bank, so that when an operation unit performs read-write operation on the memory bank, the destination address of the read-write operation is not in the same memory bank, and in order to meet the efficiency problem of hardware in calculation, the invention reduces the multiplication and division operation during address operation as much as possible, so that the mapping mode is friendly to the hardware.
Drawings
FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.
FIG. 2 is a diagram of a systolic array accelerator architecture according to an embodiment of the present invention.
Fig. 3 is a diagram illustrating the main structures of the off-chip memory bank and the C cache, and the storage mapping manner of data according to an embodiment of the present invention.
Fig. 4 is a diagram illustrating a processing operation of a first conflict situation according to an embodiment of the present invention.
Fig. 5 is a diagram illustrating the processing operation of the second conflict scenario in the embodiment of the present invention.
Fig. 6 is a diagram illustrating the processing operation of the third conflict scenario in the embodiment of the present invention.
Detailed Description
The invention discloses a ripple array-oriented area-friendly memory address mapping method, which aims to optimize the on-chip area of a ripple accelerator, and reduce the area cost of an on-chip memory as much as possible in the aspect of on-chip memory on the premise of ensuring the operation efficiency and the calculation accuracy. The area-friendly memory address mapping method facing systolic array of the present invention will be further described in detail with reference to examples below.
As shown in fig. 1, the area-friendly memory address mapping method for the systolic array of this embodiment includes:
1) dividing data to be written into a C cache of a single-port SP in off-chip storage into data blocks (blocks);
2) aiming at a first line of data to be written into a C cache of a single-port SP in off-chip storage, a first address a of the first line is used0Mapping the corresponding first data block to the C cache to obtain a first line C in the determined C cache0First data block c of0_0Bank offset c of memory0_0_0With an offset of c in bank0_0_1Are all 0, and are blocked according to the first data of the first row0_0Bank offset c of memory0 0_0With an offset of c in bank0_0_1Determining each subsequent column data block c of the row0_iBank offset C in C cache0_i_0Bank internal offset c0_iiAnd jumping to the next step:
3) judging whether the data to be written into the C cache of the single-port SP in the off-chip storage is processed or not, if so, ending and exiting; if not, the first address a of the previous line is processedj-1Adding the number n of data columns to obtain the first address a of the current rowjAnd then, jumping to the next step:
4) mapping each data block of the current line to the C cache to obtain the current line C in the C cachejEach column of data block cj_iBank offset C in C cachej_i_0Bank internal offset cj_i_1Jump to step 3).
It should be noted that, when the data stored off-chip in step 1) is divided into data blocks, the data blocks may be specified as needed. For example, as an optional implementation manner, in this embodiment, every 16 data blocks are divided into one data block, so that n data columns of each row correspond to n data columns
Figure BDA0003417590280000061
And dividing the data into blocks.
In this embodiment, the first line C in the C cache is determined in step 2)0First data block c of0_0Bank offset c of memory0_0_0With an offset of c in bank0_0_1All values of (A) are 0.
In this embodiment, the first data block c according to the first row in step 2) is divided into blocks0_0Bank offset c of memory0_0_0With an offset of c in bank0_0_1Determining each subsequent column data block c of the row0_iBank offset C in C cache0_i_0Bank internal offset c0_iiThe functional expression of (a) is:
c0_i_0=c0_0_0+i%(c/z),
c0_i_0=c0_0_1+i/(c/z),
in the above formula, i is the column serial number of the data block, C is the total depth of the C cache, and z is the bank depth in the C cache.
In this embodiment, the current line C in the cache C is obtained in step 4)jEach column of data block cj_iBank offset C in C cachej_i_0Bank internal offset cj_i_1The functional expression of (a) is:
cj_i_0=(c(j-1)_i_0+cnt)%(c/z),
cj_i_1=c(j-1)_i_1+(c(j-1)_i_0+cnt)/(c/z),
in the above formula, c(j-1)_i_0For the ith column data block c of the previous row(j-1)_iThe bank in the C cache is offset, cnt is the number of data blocks in one line, C is the total depth of the C cache, z is the bank depth in the C cache, C is the bank depth in the C cache(j-1)_i_1For the ith column data block c of the previous row(j-1)_iOffset within bank in C-cache.
Although the address mapping method of the present embodiment has been performed to avoid conflicts as much as possible, there are many extreme cases in actual operation of hardware. In order to ensure that the operation result is correct, this embodiment needs to analyze all read and write behaviors in the operation process of the pulse accelerator, and analyze and process all conflict situations. In the embodiment, each conflict form is detected and solved through the analyzed read-write behavior, so that the accuracy of the operation result is ensured, and the operation efficiency of the whole calculation process is not greatly influenced.
As shown in fig. 2, the area-friendly memory address mapping method for the systolic array of this embodiment mainly includes the following two modules: the method comprises the steps of designing an address mapping module and a conflict processing module, wherein the address mapping module is used for executing the steps 1) to 4) and adopting a set of unique address mapping scheme, so that when an arithmetic unit carries out read-write operation on the address mapping module, the destination address of the read-write operation is not in the same bank, and in order to meet the efficiency problem of hardware in calculation, multiplication-division operation during address operation is reduced as much as possible, and the mapping mode is friendly to the hardware. Fig. 3 shows the main structure of the off-chip memory bank and the C-cache, and the storage manner of the data. Where m denotes the number of data rows of the off-chip memory bank, n denotes the number of data columns of the data block, and each small box represents 16 data (1024 bits). In the actual operation process, both the cache a and the cache B do not relate to data access at the same time, so the embodiment mainly performs related optimization implementation on the cache C, and therefore the address mapping module of the embodiment is mainly placed in a dashed box between the off-chip storage and the cache C in fig. 2. The analysis of the model designed according to this embodiment can yield:
Figure BDA0003417590280000071
in the above formula, ConflictProbabilityFor the probability of collision occurrence, C is the total depth of the C cache, k represents the number of memory banks in a data block (in this embodiment, each 16-bit data in off-chip storage is divided into one data block), and z represents the depth of each memory bank constituting the C cache. Therefore, this embodiment needs to reduce k as much as possible, and k is the best choice.
For off-chip data is read into the C-cache line by line as shown in FIG. 3. Therefore, knowing the first address of a certain line stored off-chip, the address of the line is known, and then the first address of the next line can be calculated according to the first address of the previous line, and so on. Therefore, in off-chip storage, the address calculation rule for the next row is the address of the previous row plus n. For example, as shown in FIG. 3, the off-chip stored data is first divided into data small blocks in the manner described above, and for the 0 th row, its first address (the address where the first data in this row is stored) is a00 (corresponding to bank offset in C-cache and bank intra offset C0_0_0=c0_0_10) (line 0 in external memory0 block corresponds to the address in the C cache) and the address for the next block is a 1-16, which corresponds to the bank offset in the C cache as C0_1_0=(c0_0_0+ m)% 8, bank internal offset c0_1_1=c0_0_1+ m/(C/z) (where m/(C/z) is a fixed value), C is the total depth of the C cache, z is the bank depth in the C cache, m represents the number of data rows of the off-chip bank, and n represents the number of data columns. Similarly, for the next block in the same row, the calculation is also performed according to the above formula. For the case of crossing rows, the first address of the previous row is known in this embodiment, so the first address of the next row is equal to the first address of the previous row plus n, i.e. an/16 ═ n (where j ═ n/16 represents its block sequence number in the external memory), and the bank offset is C for the bank in the C cache1_0_0=c0_0_0+1 (if c)1_0_0C/z, then1_0_00), bank intra-offset is c1_1_0=c0_1_0(if c)0_0_0C/z, then1_1_0=c0_1_0+1) where C/z represents the bank number in the C cache.
For performing collision simulation and collision avoidance, this embodiment describes, by analyzing the behavior of the SA reading and writing the C cache, the symbol instead of the reading and writing behavior: the first read-write behaviors are respectively R0 and W0, the second read-write behaviors are respectively R1 and W1, and the third read-write behaviors are respectively R2 and W2. Three forms of collision are obtained, corresponding to fig. 4 to 6, respectively.
In addition, this embodiment further includes, after step 4), when the systolic array SA reads and writes from the C-buffer, storing the conflicting write data by a first register Reg0, and delaying the write data by one beat, so that the write operation W0 and the read operation R0 acting on one bank of the C-buffer form a stable pipeline: the write operation W0 of the previous beat is written in the current beat, and the data storage register written in the current beat is covered with the data of the previous beat, so as to solve the read-write conflict caused by the write operation W0 and the read operation R0 of one bank of the C cache. This approach corresponds to the first conflict mode, as shown in fig. 4, where W0 and R0 act on one bank simultaneously, which will result in read and write conflicts, since, as mentioned earlier, in the C-cache,data is read and written across the bank banks (one bank is skipped down every next beat), so as long as W0 collides with R0, the subsequent W1 collides with R1 (so as long as the read and write states are not blocked, the collision continues). To solve this problem, the present embodiment designs a register (first register Reg0) for storing conflicting write data (because the write has a lower priority than the read during design), and delays the write data by one beat, so as to ensure a stable pipeline, where the previous beat is written during the beat, and the data stored in the register during the beat is overwritten by the previous beat. In FIG. 4, t0Is the first beat, t1For the second beat, since the conflicting write data is saved by the first register Reg0 (first beat write operation W)0Data W of0A data and address W0Addr), delay write data by one beat to the second beat t1And writing into a C cache (C buffer), so that a write operation W0 and a read operation R0 acting on one bank of the C cache form a stable pipeline.
In this embodiment, the following steps of reading and writing the C buffer in the systolic array SA are further included after step 4):
s1) the systolic array SA reads the data required by the C matrix from the C cache in sequence, and if the data required by the C matrix is not read completely, the step S1) is skipped to be executed again; otherwise, skipping to execute the next step;
s2) if the last block of the B matrix is used up, and the next block C matrix needs to be introduced into the systolic array SA and the C matrix is read again, if yes, jumping to execute the step S3); otherwise, jumping to execute step S1);
s3) determining whether the last data of the data required by the C matrix is placed in the first bank0 of the C cache, and whether the write operation W0 and the read operation R0 occurred in the C cache at the same time in the first bank0 at the previous time, and if yes, jumping to execute step S4); otherwise, jumping to execute step S1);
s4) if the write operation W1 at the current time is sent to the second bank1 in the C cache, the read operation R1 at the current time is sent to the first bank0 in the C cache, and the operations are executedWhen the conflict write data is saved through a first register Reg0, the write data is delayed for one beat, so that the read-write conflict occurs between the first register Reg0 at the current moment and the read operation R1 at the current moment, at this moment, the value temporarily stored in the first register Reg0 is stored in a second register Reg1 to be delayed for one beat for writing, and the step S1 is executed by jumping; this processing method corresponds to the second conflict form, as shown in fig. 5, because the C buffer is used more than once in the matrix operation, because the a matrix is fixed, and the B buffer cannot completely store the next entire B matrix at any time, the partition of the B matrix is involved, and therefore, each time the partition of the B matrix is switched, the reuse of the C matrix is caused, which may occur in the first bank (this may occur only in the first bank). R0 and W0 occur in the same bank, and then the register 1 stores all information of W0 for the next beat of writing, but if a new round of reading involving C cache is carried out in the next beat, two consecutive read signals are generated and are in the same bank, and at the moment, because W0 is ready to be written, the read signals are flushed, so that data cannot be read correctly, and calculation errors are caused. To solve the above problem, this embodiment adds a register 1 on the basis of the previously designed register 1, and then uses a judgment control signal to judge whether there is a read-write conflict when the beats exist, if so, sets the flag bit to 1, starts the register 1 to store all the information of the register 0, and the register 0 clears the data therein, and writes the data of the register 0 into the bank when R2 and W2 operate. In FIG. 5, t0Is the first beat, t1For the second beat, t2In the third beat, the first beat stores the conflicting write data through a first register Reg0, and the write data is delayed for one beat, so that the read-write conflict occurs between the current first register Reg0 and the current read operation R1, and at this time, the value temporarily stored in the first register Reg0 is stored in the second register Reg1, and the writing is delayed for one beat (the third beat).
If the write operation W1 at the current time is sent to the first memory bank0 in the C cache, and the read operation R1 at the current time is sent to the second memory bank in the C cacheAt this time, the bank1 stores the conflicting write data through the first register Reg0, and writes the write data with a delay of one beat, so that the read-write conflict occurs between the first register Reg0 at the current time and the write operation W1 at the current time, at this time, the value temporarily stored in the first register Reg0 is stored in the second register Reg1 to be written with a delay of one beat, the data W1_ data and the address W1_ addr of the write operation W1 at the current time are written into the C cache, and the step S1 is skipped. This approach corresponds to a third form of collision, as shown in fig. 6, which is similar to the second case except that the read and write cases are interchanged at this time. At the same time, the read operation R0 and the write operation W0 occur in the same bank, and in the next beat, the write operation W1 is still written in the bank, and at this time, a conflict of two writes occurs, which is the same as the processing manner of the case two, and only the condition of the determination control signal needs to be changed to the condition that if two writes occur in the same bank, the second register Reg1 is enabled, and the processing manner is similar to the second case. In FIG. 6, t0Is the first beat, t1For the second beat, t2For the third beat, the second beat writes the data W1 of the first beat of the write operation W3578 buffered in the second register Reg11A data and address W1A _addris written into the C cache, and the third beat writes the data W0 of the write operation W0 cached in the second register Reg1 in the first beat0A data and address W0And writing the _ addr into the C cache.
In order to verify the mapping method of the area-friendly storage address facing the systolic array in this embodiment, the results of the test comparison of the C caches of the single-port SP before and after the optimization by using the mapping method of the area-friendly storage address facing the systolic array in this embodiment are shown in tables 1 and 2, with TP storage as a reference.
Table 1: and comparing the test results of the test matrix scales of different C matrixes.
Figure BDA0003417590280000091
Figure BDA0003417590280000101
Table 2: and comparing the test results on the scale of the test matrix at different depths.
Depth of field S(SP) S(TP) Saving area overhead
64W144B 392,192 467,712 16%
128W144B 245,248 334,080 27%
256W144B 191929 271739 29%
448W144B 151,200 237,480 36%
512W144B 150,432 215,424 30%
As can be seen from tables 1 and 2, the address mapping scheme and the conflict processing scheme proposed in this embodiment can effectively reduce the space overhead of the memory bank, and at the same time, have no significant influence on the operation result and the computation efficiency of the ripple accelerator, and can provide guidance for the design of the on-chip storage capacity.
In addition, the present embodiment further provides a deep learning accelerator, which includes an accelerator unit and an on-chip memory, which are connected to each other, and the on-chip memory includes a systolic array SA and a C-cache for providing a C matrix for the systolic array SA, where the C-cache is a C-cache of a single port SP, and a memory address mapping component is connected between the C-cache and an external memory interface of the on-chip memory, and the memory address mapping component is programmed or configured to execute the steps of the aforementioned area-friendly memory address mapping method facing systolic array. The on-chip memory shown in fig. 2 further includes an a buffer for providing an a matrix for the systolic array SA, a B buffer for providing a B matrix for the systolic array SA, and a systolic array buffer (SA buffer), but the buffer structure for providing the a matrix and the B matrix is not in the scope of optimization of this embodiment, and its implementation is not limited to the specific embodiment shown in fig. 2. The on-chip memory of the embodiment can be applied to not only the deep learning accelerator structure shown in fig. 2, but also other various accelerator structures capable of realizing systolic array acceleration.
In addition, the embodiment further provides a computer device, which includes a microprocessor and a memory that are connected to each other, the computer device further includes the deep learning accelerator, and the microprocessor and the deep learning accelerator are connected in communication.
In addition, the present embodiment also provides a computer-readable storage medium, which stores therein steps for execution by a computer device to implement the foregoing systolic array-oriented area-friendly storage address mapping method.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims (10)

1. A systolic array-oriented area-friendly memory address mapping method is characterized by comprising the following steps:
1) dividing data to be written into a C cache of a single-port SP in off-chip storage into data blocks;
2) aiming at a first line of data to be written into a C cache of a single-port SP in off-chip storage, a first address a of the first line is used0Mapping the corresponding first data block to the C cache to obtain a first line C in the determined C cache0First data block c of0_0Bank offset c of memory0_0_0With an offset of c in bank0_0_1Are all 0, and are blocked according to the first data of the first row0_0Bank offset c of memory0_0_0With an offset of c in bank0_0_1Determining each subsequent column data block c of the row0_iBank offset C in C cache0_i_0Bank internal offset c0_i_iAnd jumping to the next step:
3) judging whether the data to be written into the C cache of the single-port SP in the off-chip storage is processed or not, if so, ending and exiting; if not, the first address a of the previous line is processedj-1Adding the number n of data columns to obtain the first address a of the current rowjAnd then, jumping to the next step:
4) mapping each data block of the current line to the C cache to obtain the current line C in the C cachejEach column of data block cj_iBank offset C in C cachej_i_0Bank internal offset cj_i_1Jump to step 3).
2. The face facing systolic array of claim 1The method for mapping the storage address is characterized in that when data stored outside a chip is divided into data blocks in the step 1), each 16 data blocks are divided into one data block, so that n data columns in each row correspond to one data block
Figure FDA0003417590270000011
And dividing the data into blocks.
3. The systolic array-oriented area-friendly memory address mapping method of claim 1, characterized in that in step 2) the first row C in the C-cache is determined0First data block c of0_0Bank offset c of memory0_0_0With an offset of c in bank0_0_1All values of (A) are 0.
4. The systolic array-oriented area-friendly memory address mapping method of claim 3, characterized in that in step 2) the first data block c according to the first row is used0_0Bank offset c of memory0_0_0With an offset of c in bank0_0_1Determining each subsequent column data block c of the row0_iBank offset C in C cache0_i_0Bank internal offset c0_i_iThe functional expression of (a) is:
c0_i_0=c0_0_0+i%(c/z),
c0_i_0=c0_0_1+i/(c/z),
in the above formula, i is the column serial number of the data block, C is the total depth of the C cache, and z is the bank depth in the C cache.
5. The systolic array-oriented area-friendly memory address mapping method of claim 4, characterized in that, in step 4), the current line C in C cache is obtainedjEach column of data block cj_iBank offset C in C cachej_i_0Bank internal offset cj_i_1The functional expression of (a) is:
cj_i_0=(c(j-1)_i_0+cnt)%(c/z),
cj_i_1=c(j-1)_i_1+(c(j-1)_i_0+cnt)/(c/z),
in the above formula, c(j-1)_i_0For the ith column data block c of the previous row(j-1)_iThe bank in the C cache is offset, cnt is the number of data blocks in one line, C is the total depth of the C cache, z is the bank depth in the C cache, C is the bank depth in the C cache(j-1)_i_1For the ith column data block c of the previous row(j-1)_iOffset within bank in C-cache.
6. The systolic array-oriented area-friendly memory address mapping method of claim 5, characterized in that, after step 4), when the systolic array SA reads and writes from and to the C-buffer, the write data conflict is saved by a first register Reg0, and the write data is delayed by one beat, so that the write operation W0 and the read operation R0 acting on a bank of the C-buffer form a stable pipeline: the write operation W0 of the previous beat is written in the current beat, and the data storage register written in the current beat is covered with the data of the previous beat, so as to solve the read-write conflict caused by the write operation W0 and the read operation R0 of one bank of the C cache.
7. The systolic array-oriented area-friendly memory address mapping method of claim 6, characterized in that step 4) is followed by the following steps of reading and writing C-cache in systolic array SA:
s1) the systolic array SA reads the data required by the C matrix from the C cache in sequence, and if the data required by the C matrix is not read completely, the step S1) is skipped to be executed again; otherwise, skipping to execute the next step;
s2) if the last block of the B matrix is used up, and the next block C matrix needs to be introduced into the systolic array SA and the C matrix is read again, if yes, jumping to execute the step S3); otherwise, jumping to execute step S1);
s3) determining whether the last data of the data required by the C matrix is placed in the first bank0 of the C cache, and whether the write operation W0 and the read operation R0 occurred in the C cache at the same time in the first bank0 at the previous time, and if yes, jumping to execute step S4); otherwise, jumping to execute step S1);
s4) if the write operation W1 at the current time is sent to the second bank1 in the C cache and the read operation R1 at the current time is sent to the first bank0 in the C cache, then the first register Reg0 at the current time and the read operation R1 at the current time are caused to generate read-write collision because the colliding write data are stored by the first register Reg0 and the write data are written with a delay of one beat, at this time, the value temporarily stored in the first register Reg0 is stored in the second register Reg1 and written with a delay of one beat, and step S1 is executed; if the write operation W1 at the current time is sent to the first bank0 in the C cache and the read operation R1 at the current time is sent to the second bank1 in the C cache, at this time, since the conflicting write data is stored in the first register Reg0 and the write data is written with a delay of one beat, the read-write conflict occurs between the first register Reg0 at the current time and the write operation W1 at the current time, the value temporarily stored in the first register Reg0 is stored in the second register Reg1 and written with a delay of one beat, the data W1_ data of the write operation W1 at the current time and the address W1_ addr are written in the C cache, and step S1 is executed.
8. A deep learning accelerator comprising an accelerator unit and an on-chip memory connected to each other, the on-chip memory comprising a systolic array SA and a C-cache for providing a C matrix for the systolic array SA, characterized in that the C-cache is a C-cache of a single port SP, and a memory address mapping means is connected between the C-cache and an external memory interface of the on-chip memory, the memory address mapping means being programmed or configured to perform the steps of the systolic array oriented area friendly memory address mapping method of any of claims 1 to 7.
9. A computer device comprising a microprocessor and a memory connected to each other, wherein the computer device further comprises the deep learning accelerator of claim 8, and the microprocessor and the deep learning accelerator are communicatively connected to each other.
10. A computer-readable storage medium, wherein steps for execution by a computer device to implement the systolic array-oriented area-friendly memory address mapping method of any one of claims 1-7 are stored in the computer-readable storage medium.
CN202111552673.4A 2021-12-17 2021-12-17 Area-friendly storage address mapping method facing systolic array Pending CN114218136A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111552673.4A CN114218136A (en) 2021-12-17 2021-12-17 Area-friendly storage address mapping method facing systolic array

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111552673.4A CN114218136A (en) 2021-12-17 2021-12-17 Area-friendly storage address mapping method facing systolic array

Publications (1)

Publication Number Publication Date
CN114218136A true CN114218136A (en) 2022-03-22

Family

ID=80703727

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111552673.4A Pending CN114218136A (en) 2021-12-17 2021-12-17 Area-friendly storage address mapping method facing systolic array

Country Status (1)

Country Link
CN (1) CN114218136A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116050328A (en) * 2022-12-30 2023-05-02 成都电科星拓科技有限公司 Chip memory splitting method, device, equipment and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116050328A (en) * 2022-12-30 2023-05-02 成都电科星拓科技有限公司 Chip memory splitting method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN107657581B (en) Convolutional neural network CNN hardware accelerator and acceleration method
TWI554883B (en) Systems and methods for segmenting data structures in a memory system
CN110059020B (en) Access method, equipment and system for extended memory
US11704538B2 (en) Data processing method and device
CN108733415B (en) Method and device for supporting vector random access
US20090240895A1 (en) Systems and methods for coalescing memory accesses of parallel threads
CN110738308B (en) Neural network accelerator
EP3657337B1 (en) Method, apparatus, device and storage medium for accessing static random access memory
US20180293163A1 (en) Optimizing storage of application data in memory
JP7008983B2 (en) Methods and equipment for accessing tensor data
US11579921B2 (en) Method and system for performing parallel computations to generate multiple output feature maps
CN113762493A (en) Neural network model compression method and device, acceleration unit and computing system
CN111656339A (en) Memory device and control method thereof
CN109117415B (en) Data sharing system and data sharing method thereof
CN114218136A (en) Area-friendly storage address mapping method facing systolic array
CN114564434A (en) Universal multi-core brain processor, accelerator card and computer equipment
US11544189B2 (en) System and method for memory management
US11663454B2 (en) Digital integrated circuit with embedded memory for neural network inferring
JP7410961B2 (en) arithmetic processing unit
CN117234720A (en) Dynamically configurable memory computing fusion data caching structure, processor and electronic equipment
CN108920097B (en) Three-dimensional data processing method based on interleaving storage
US11886347B2 (en) Large-scale data processing computer architecture
US8539207B1 (en) Lattice-based computations on a parallel processor
TWI749331B (en) Memory with processing in memory architecture and operating method thereof
CN113392959A (en) Method for reconstructing architecture in computing system and computing system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination