TWI740725B

TWI740725B - Method of data transmission and merging

Info

Publication number: TWI740725B
Application number: TW109140782A
Authority: TW
Inventors: 林裕盛; 陳維超; 陳佩君
Original assignee: 英業達股份有限公司
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2021-09-21
Also published as: TW202221574A

Abstract

A method of data transmission and merging is adapted to a sender and a receiver end that are in communication with each other. The method includes a sending stage and a receiving stage. The sending end stage includes: transmitting the first block data, the second block data and the third block data to the receiver, obtaining the fourth block data and the fifth block data and transmitting the third block data, the fourth block and the fifth block to the receiver. The receiving stage includes receiving the first block data, the second block data and the third block data, merging the first block data, the second block data and the third block data to perform a convolution operation, and receiving the fourth block data and the fifth block data, and merging the third block data, the fourth block data and the fifth block data to perform another convolution operation.

Description

Methods of data transfer and consolidation

本發明關於卷積神經網路加速器，特別是一種在平鋪式處理的卷積運算中切割資料進行傳遞及合併資料的方法。The present invention relates to a convolutional neural network accelerator, in particular to a method for cutting data, transferring and merging data in a convolution operation of flat processing.

卷積神經網路（Convolution Neural Network，CNN）是目前被認為在電腦視覺及影像處理上最廣泛被使用的機器學習技術之一。卷積神經網路的主要運算是卷積核（kernel）與特徵圖（feature map）之間的卷積，其透過乘積累加（Multiply Accumulate，MAC）運算而消耗大量功率。Convolution Neural Network (CNN) is currently considered to be one of the most widely used machine learning techniques in computer vision and image processing. The main operation of the convolutional neural network is the convolution between the convolution kernel (kernel) and the feature map (feature map), which consumes a lot of power through the operation of Multiply Accumulate (MAC).

比起冗餘運算的能源浪費，如何提升資料存取能力以及減少資料傳輸頻寬在未來的加速器設計中更加重要。一則因為記憶體頻寬成長速度慢於處理單元的運算速度，意味著相同的演算法可能受限於記憶體及其架構；二則因為目前的神經網路多採用小卷積核配合更深的網路，這樣減少了MAC運算但增加了記憶體用量。據統計，隨著神經網路的模型演進，在動態隨機存取記憶體（Dynamic Random Access Memory，DRAM）上存取特徵圖所消耗的功率比起其他運算消耗的功率更加可觀。Compared with the energy waste of redundant computing, how to improve data access capability and reduce data transmission bandwidth is more important in future accelerator designs. One is because the memory bandwidth growth rate is slower than the computing speed of the processing unit, which means that the same algorithm may be limited by the memory and its architecture; the other is because the current neural networks mostly use small convolution cores with deeper networks. This reduces MAC operations but increases memory usage. According to statistics, with the evolution of neural network models, the power consumed by accessing feature maps on Dynamic Random Access Memory (DRAM) is more significant than the power consumed by other operations.

目前的CNN通常採用平鋪式處理（tiled processing），也就是處理單元每次從外部儲存器載入一個區塊進行運算。例如：外部儲存器DRAM儲存的資料區塊未經壓縮而直接被載入至靠近處理單元的靜態隨機存取記憶體（Static Random Access Memory，SRAM）作為快取。然而，這種方式在每次切換處理區塊時而存取DRAM時，需要消耗大量的功率並佔用大量的記憶體頻寬。例如：將DRAM儲存的資料切割成多個相同大小的子張量並且壓縮，再將壓縮後的各個子張量傳送至SRAM解壓縮，處理單元從SRAM提取所需的區塊資料進行運算。雖然壓縮區塊資料可以節省資料傳輸時消耗的功率及佔用的頻寬，然而，若子張量切割大小設置過大，可能導致SRAM儲存本次處理時不會使用的資料，造成SRAM空間的浪費。或者為了取得完整的區塊資料，而花費時間解壓縮大檔案，但其中只有少量資料可用。另一方面，若子張量切割大小設置過小，為了以正確的順序解壓縮還原出完整的區塊資料，需要額外佔用頻寬載入大量的指標以獲取每一個壓縮檔所屬的位置。Current CNNs usually use tiled processing, that is, the processing unit loads a block from an external memory to perform operations at a time. For example, the data block stored in the external memory DRAM is directly loaded into the Static Random Access Memory (SRAM) near the processing unit as a cache without being compressed. However, this method needs to consume a lot of power and occupy a lot of memory bandwidth when accessing the DRAM every time the processing block is switched. For example, the data stored in the DRAM is cut into multiple sub-tensors of the same size and compressed, and each compressed sub-tensor is sent to the SRAM for decompression, and the processing unit extracts the required block data from the SRAM for calculation. Although compressing block data can save the power consumed during data transmission and the bandwidth occupied, if the sub-tensor cut size is set too large, it may cause SRAM to store data that will not be used in this process, resulting in a waste of SRAM space. Or in order to obtain complete block data, it takes time to decompress large files, but only a small amount of data is available. On the other hand, if the sub-tensor cut size is set too small, in order to decompress and restore the complete block data in the correct order, additional bandwidth is required to load a large number of indicators to obtain the location of each compressed file.

有鑑於此，本發明提出一種有效率且硬體導向的資料存取方案，適用於CNN的特徵圖。本發明將資料分割為不同大小的子張量（subtensor），並且使用少量的指標，在已壓縮但是隨機存取的格式下儲存這些子張量。這種設計使目前的CNN加速器能夠以平鋪式處理的方式即時獲取和解壓縮子張量。本發明適用於需要對齊、合併的資料存取架構，並且只需要對現有架構進行小幅度的修改即可適用於本發明。In view of this, the present invention proposes an efficient and hardware-oriented data access scheme, which is suitable for CNN feature maps. The present invention divides data into subtensors of different sizes, and uses a small number of indicators to store these subtensors in a compressed but random access format. This design enables current CNN accelerators to instantly acquire and decompress sub-tensors in a tiled manner. The present invention is suitable for data access architectures that need to be aligned and merged, and only a small amount of modification to the existing architecture is required to be applicable to the present invention.

依據本發明一實施例的一種資料傳遞及合併的方法，適用於彼此通訊連接的發送端及接收端，所述方法包括：發送端階段以及接收端階段。發送端階段包含：傳送第一區塊資料、第二區塊資料及第三區塊資料至接收端，取得第四區塊資料及第五區塊資料及傳送第三區塊資料、第四區塊資料及第五區塊資料至接收端。接收端階段包含接收第一區塊資料、第二區塊資料及第三區塊資料，合併第一區塊資料、第二區塊資料及第三區塊資料以進行一卷積運算，接收第四區塊資料、第五區塊資料，及合併第三區塊資料、第四區塊資料及第五區塊資料以進行另一卷積運算。A method of data transmission and merging according to an embodiment of the present invention is applicable to a sending end and a receiving end that are in communication with each other, and the method includes: a sending end stage and a receiving end stage. The sending end stage includes: transmitting the first block data, the second block data and the third block data to the receiving end, obtaining the fourth block data and the fifth block data, and transmitting the third block data and the fourth block data The block data and the fifth block data are sent to the receiving end. The receiving end stage includes receiving the first block data, the second block data, and the third block data, merging the first block data, the second block data, and the third block data to perform a convolution operation, and receiving the first block data, the second block data, and the third block data. The four-block data, the fifth block data, and the third block data, the fourth block data, and the fifth block data are combined to perform another convolution operation.

綜上所述，本發明提出了一種用於輸入特徵圖的有效儲存方案，可減少外部儲存器的頻寬，並且符合現有的CNN加速器架構中的儲存器存取模式。給定特定的CNN層和加速器配置，本發明可將張量資料切割為特定大小的多個子張量。現有的CNN加速器可在少量的硬體修改之下整合本發明。以提升整體效能。In summary, the present invention proposes an effective storage solution for input feature maps, which can reduce the bandwidth of external storage and conform to the storage access mode in the existing CNN accelerator architecture. Given a specific CNN layer and accelerator configuration, the present invention can cut the tensor data into multiple sub-tensors of specific sizes. The existing CNN accelerator can integrate the present invention with a small amount of hardware modification. To improve overall performance.

以上之關於本揭露內容之說明及以下之實施方式之說明係用以示範與解釋本發明之精神與原理，並且提供本發明之專利申請範圍更進一步之解釋。The above description of the disclosure and the following description of the embodiments are used to demonstrate and explain the spirit and principle of the present invention, and to provide a further explanation of the scope of the patent application of the present invention.

以下在實施方式中詳細敘述本發明之詳細特徵以及特點，其內容足以使任何熟習相關技藝者了解本發明之技術內容並據以實施，且根據本說明書所揭露之內容、申請專利範圍及圖式，任何熟習相關技藝者可輕易地理解本發明相關之構想及特點。以下之實施例係進一步詳細說明本發明之觀點，但非以任何觀點限制本發明之範疇。The detailed features and characteristics of the present invention are described in detail in the following embodiments, and the content is sufficient to enable anyone familiar with the relevant skills to understand the technical content of the present invention and implement it accordingly, and in accordance with the content disclosed in this specification, the scope of patent application and the drawings Anyone who is familiar with the relevant art can easily understand the concept and features related to the present invention. The following examples further illustrate the viewpoints of the present invention in detail, but do not limit the scope of the present invention by any viewpoint.

本發明適用於任何具有卷積運算的領域。在本發明中，提出一種資料傳遞集合併的方法，其中包含一種輸入特徵圖分割方式，藉此避免存取到部分壓縮的子張量（subtensor），且最小化子張量的數量以避免資料破碎（data fragmentation）。The invention is applicable to any field with convolution operation. In the present invention, a method of data transfer set and merge is proposed, which includes a method of input feature map segmentation, thereby avoiding access to partially compressed subtensors, and minimizing the number of subtensors to avoid data Fragmentation (data fragmentation).

圖1是本發明一實施例的資料傳遞及合併的方法的流程圖。所述的方法適用於通訊連接的發送端及接收端。發送端例如包含外部儲存器DRAM及處理區塊資料分割的控制電路，接收端例如為包含運算單元及快取SRAM的CNN加速器。FIG. 1 is a flowchart of a method of data transmission and merging according to an embodiment of the present invention. The method described is applicable to the sending end and the receiving end of a communication connection. The sending end includes, for example, an external memory DRAM and a control circuit for processing block data partitioning, and the receiving end is, for example, a CNN accelerator including an arithmetic unit and a cache SRAM.

圖2是輸入特徵圖在水平方向被切割為多個資料區塊的示意圖。假設CNN加速器每次的運算係處理一個輸入特徵圖F，則在第

次處理輸入特徵圖

，在第

次處理輸入特徵圖

，且在第

次處理輸入特徵圖

FIG. 2 is a schematic diagram of the input feature map being cut into multiple data blocks in the horizontal direction. Assuming that each operation of the CNN accelerator processes an input feature map F, then in the first

Input feature map

, In the

Input feature map

, And in the first

Input feature map

如圖1所示，本發明一實施例的資料傳遞及合併的方法分為發送端階段P1及接收端階段P2。發送端階段P1包含步驟S1、S2及S3。接收端階段P2包含步驟S4、S5、S6及S7。As shown in FIG. 1, the data transfer and merging method of an embodiment of the present invention is divided into a sending end stage P1 and a receiving end stage P2. The sender stage P1 includes steps S1, S2, and S3. The receiving end stage P2 includes steps S4, S5, S6, and S7.

步驟S1是「發送端傳送第一指標、第一區塊資料B1、第二區塊資料B2及第三區塊資料B3至接收端」。步驟S1對應於前述的「在第

次處理輸入特徵圖

」。實務上，在發送端傳送第一區塊資料B1、第二區塊資料B2及第三區塊資料B3之前，更包括以壓縮器對第一區塊資料B1、第二區塊資料B2及第三區塊資料B3各自進行壓縮，藉此減少傳輸到接收端時佔用的頻寬。所述的第一指標用於指示第一區塊資料B1的起始位址、第一區塊資料B1的資料大小、第二區塊資料B2的資料大小以及第三區塊資料的資料大小。 Step S1 is "the sending end transmits the first indicator, the first block data B1, the second block data B2, and the third block data B3 to the receiving end". Step S1 corresponds to the aforementioned "in the first

Input feature map

". In practice, before the sender transmits the first block data B1, the second block data B2, and the third block data B3, it further includes the first block data B1, the second block data B2, and the second block data B1, the second block data B2 and the third block data B3. The three blocks of data B3 are compressed separately, thereby reducing the bandwidth occupied when transmitting to the receiving end. The first indicator is used to indicate the start address of the first block data B1, the data size of the first block data B1, the data size of the second block data B2, and the data size of the third block data.

步驟S2 是「發送端取得第四區塊資料B4及第五區塊資料B5」。詳言之，發送端的控制電路可按後文述及的配置方式對空間上相鄰的連續三個輸入特徵圖

、

及

進行分割而得到第一至第五區塊資料B1~B5。 Step S2 is "the sender obtains the fourth block data B4 and the fifth block data B5". In detail, the control circuit of the sending end can perform three consecutive input feature maps that are adjacent in space according to the configuration method described later.

,

and

The division is performed to obtain the first to fifth block data B1 to B5.

步驟S3 是「發送端傳送第二指標、第三區塊資料B3、第四區塊資料B4及第五區塊資料B5至接收端」。步驟S3對應於前述的「在第

次處理輸入特徵圖

」。實務上，在發送端傳送第三區塊資料B3、第四區塊資料B4及第五區塊資料B5之前，更包括以壓縮器對第三區塊資料B3、第四區塊資料B4及第五區塊資料B5各自進行壓縮，藉此減少傳輸到接收端時佔用的頻寬。所述的第二指標用於指示第三區塊資料B3的起始位址、第三區塊資料B3的資料大小、第四區塊資料B4的資料大小以及第五區塊資料B5的資料大小。 Step S3 is "the sending end transmits the second indicator, the third block data B3, the fourth block data B4, and the fifth block data B5 to the receiving end". Step S3 corresponds to the aforementioned "in the

Input feature map

". In practice, before transmitting the third block data B3, the fourth block data B4, and the fifth block data B5, the third block data B3, the fourth block data B4, and the fourth block data B3 are processed by a compressor. Each of the five blocks of data B5 is compressed to reduce the bandwidth occupied when transmitting to the receiving end. The second indicator is used to indicate the start address of the third block data B3, the data size of the third block data B3, the data size of the fourth block data B4, and the data size of the fifth block data B5 .

步驟S4 是「接收端接收第一指標、第一區塊資料、第二區塊資料及第三區塊資料」。如圖1所示，步驟S4在步驟S1完成後執行。Step S4 is "the receiving end receives the first indicator, the first block data, the second block data, and the third block data". As shown in Figure 1, step S4 is executed after step S1 is completed.

步驟S5 是「接收端依據第一指標合併第一區塊資料、第二區塊資料及第三區塊資料以進行一卷積運算」。實務上，在進行卷積運算之前，更包括以解壓縮器解壓縮第一區塊資料B1、第二區塊資料B2及第三區塊資料B3。這三個區塊資料B1~B3被解壓縮完成後被儲存在SRAM，處理單元可依據第一指標取得第一區塊資料B1在SRAM中的第一起始位址，並依據第一起始位址及第一區塊資料B1的資料大小計算出第二區塊資料B2在SRAM中的第二起始位址，再依據第二起始位址及第二區塊B2的資料大小計算出第三區塊資料B3在SRAM中的第三起始位址。Step S5 is "the receiving end merges the first block data, the second block data, and the third block data according to the first index to perform a convolution operation." In practice, before performing the convolution operation, it further includes decompressing the first block data B1, the second block data B2, and the third block data B3 with a decompressor. The three block data B1~B3 are decompressed and stored in SRAM. The processing unit can obtain the first starting address of the first block data B1 in SRAM according to the first index, and according to the first starting address And the data size of the first block data B1 to calculate the second start address of the second block data B2 in the SRAM, and then calculate the third address based on the second start address and the data size of the second block B2 The third starting address of block data B3 in SRAM.

步驟S6 是「接收端接收第二指標、第四區塊資料、第五區塊資料」。如圖1所示，步驟S6在步驟S3完成後執行。Step S6 is "the receiving end receives the second indicator, the fourth block data, and the fifth block data". As shown in Fig. 1, step S6 is executed after step S3 is completed.

步驟S7是「接收端依據第二指標合併第三區塊資料、第四區塊資料及第五區塊資料以進行另一卷積運算」。實務上，在進行卷積運算之前，更包括以解壓縮器解壓縮第三區塊資料B3、第四區塊資料B4及第五區塊資料B5。這三個區塊資料B3~B5被解壓縮完成後被儲存在SRAM，處理單元可依據第二指標取得第三區塊資料B3在SRAM中的第三起始位址，並依據第三起始位址及第三區塊資料B3的資料大小計算出第四區塊資料B4在SRAM中的第四起始位址，再依據第四起始位址及第四區塊B4的資料大小計算出第五區塊資料B5在SRAM中的第五起始位址。Step S7 is "the receiving end merges the third block data, the fourth block data, and the fifth block data according to the second index to perform another convolution operation." In practice, before performing the convolution operation, it further includes using a decompressor to decompress the third block data B3, the fourth block data B4, and the fifth block data B5. The three block data B3~B5 are decompressed and stored in SRAM. The processing unit can obtain the third start address of the third block data B3 in SRAM according to the second index, and according to the third start Calculate the fourth starting address of the fourth block data B4 in the SRAM based on the address and the data size of the third block data B3, and then calculate it based on the fourth starting address and the data size of the fourth block B4 The fifth starting address of the fifth block data B5 in the SRAM.

如圖2所示，本發明提出一種切割輸入特徵圖

、

及

的方式，以下透過兩個實施例進一步詳述切割配置的實現方式。第一實施例以實際數字說明本發明的運作方式，並在第二實施例以代數形式說明本發明一般化的實現方式。 As shown in Figure 2, the present invention proposes a cutting input feature map

,

and

In the following, two embodiments are used to further detail the implementation of the cutting configuration. The first embodiment uses actual numbers to illustrate the operation of the present invention, and the second embodiment uses algebraic form to illustrate the generalized implementation of the present invention.

圖3是第一實施例的示意圖。假設CNN架構如下述：卷積核大小為

且輸出特徵圖大小為

區塊，並採用0填充（zero-padding）以維持輸出特徵圖與輸出入特徵圖大小相同。 Fig. 3 is a schematic diagram of the first embodiment. Suppose the CNN architecture is as follows: the size of the convolution kernel is

And the output feature map size is

Blocks, and use zero-padding to maintain the same size of the output feature map and the input and output feature maps.

在第一次處理時，從輸入特徵圖左上角提取

的輸入區塊。如圖1所示，水平方向的左邊界為

，右邊界為

。 In the first processing, extract from the upper left corner of the input feature map

Input block. As shown in Figure 1, the left boundary in the horizontal direction is

, The right margin is

.

在第二次處理時，從第一次輸入區塊右邊界的位置向右方步進8個單位以提取下一個輸入區塊。In the second processing, step 8 units to the right from the position of the right boundary of the first input block to extract the next input block.

由於在CNN處理的同一層中，步進長度為常數，因此每次所提取的輸入區塊的左邊界及右邊界可組成兩個等差數列。左邊界以

表示，右邊界以

表示。本發明提出的配置方式即為這兩個邊界形成的分割。換言之，本發明提出的配置方式為聯集

，在此範例中，

。 Since the step length is constant in the same layer processed by CNN, the left and right edges of the input block extracted each time can form two arithmetic series. The left border is

Means that the right boundary is

Express. The configuration method proposed by the present invention is the division formed by these two boundaries. In other words, the configuration method proposed by the present invention is union

, In this example,

.

因為

)且

，而且上述的配置方式通用於水平方向及垂直方向的切割，所以每個輸入特徵圖可切分為下列四種形狀的子張量：

、

、

及

。圖4是輸入資料的切割示意圖。 Because

)and

, And the above configuration method is commonly used for horizontal and vertical cutting, so each input feature map can be divided into the following four shapes of sub-tensors:

,

and

. Figure 4 is a schematic diagram of cutting input data.

的輸入區塊可由一個

、兩個

及四個

所組成。

The input block can be a

,two

And four

Constituted.

此外，由於光環(halo)資料的存取只限於二維平面，因此本發明提出的分割配置不需要實現在通道的維度。In addition, since the access of halo data is limited to a two-dimensional plane, the split configuration proposed in the present invention does not need to be implemented in the dimension of the channel.

以下敘述第二實施例。在此實施例中，CNN每一層的運算被一般化，並以下列參數表示：The second embodiment will be described below. In this embodiment, the operation of each layer of CNN is generalized and represented by the following parameters:

卷積核大小以

表示。因為卷積核大小通常為奇數。 The size of the convolution kernel is

Express. Because the size of the convolution kernel is usually odd.

跨距以

表示。跨距是各自和兩個卷積窗作卷積的兩個元素之間的距離。當

時，輸出特徵圖大小小於輸入特徵圖大小且計算成本下降。 Span

Express. The span is the distance between two elements that are convolved with two convolution windows. when

When the output feature map size is smaller than the input feature map size, the computational cost is reduced.

擴張間距以

表示。在擴張卷積（dilated convolution）時，原始卷積核的兩個相鄰元素按此擴張間距擴張為卷積窗。 Expand the spacing to

Express. In dilated convolution, two adjacent elements of the original convolution kernel are expanded into a convolution window according to this dilated interval.

輸出特徵圖的大小以

表示。 The size of the output feature map is

Express.

基於上述的參數表示，圖5展示了一般卷積採用本發明的一種配置方式

的範例。 Based on the above parameter representation, Figure 5 shows a configuration method of the present invention for general convolution

Example.

當計算輸出特徵圖中最左邊的輸出元素時，卷積窗從輸入特徵圖最左方開始提取，假設此輸入特徵圖最左方的邊界為

，則最右方的邊界為

。兩個相鄰子張量之間的偏移量為

。而本發明定義的配置方式如下： When calculating the leftmost output element in the output feature map, the convolution window is extracted from the far left of the input feature map, assuming that the leftmost boundary of the input feature map is

, Then the rightmost boundary is

. The offset between two adjacent subtensors is

. The configuration method defined in the present invention is as follows:

圖6展示了擴張卷積採用本發明的另一種配置方式

的範例。 Figure 6 shows another configuration of dilated convolution using the present invention

Example.

擴張卷積參照上述的運算方式可得出另一配置方式如下：With reference to the above-mentioned calculation method for dilated convolution, another configuration method can be obtained as follows:

從本發明上述提出的配置方式中可得知：若

整除

，則符合

的配置同樣也符合

。 It can be known from the above-mentioned configuration method of the present invention that: if

Divide

, It meets

The configuration also conforms to

.

舉AlexNet CONV1為例，其

，對應於本發明的一配置

。因此，AlexNet CONV1也適用於本發明另一配置

。 Take AlexNet CONV1 as an example, its

, Corresponding to a configuration of the present invention

. Therefore, AlexNet CONV1 is also suitable for another configuration of the present invention

.

承上所述，在所有的卷積層中選擇單一N值可簡化硬體實作。在本發明一實施例中，N＝8是一個大多數情況適用的設定值。As mentioned above, choosing a single value of N in all convolutional layers can simplify hardware implementation. In an embodiment of the present invention, N=8 is a set value suitable for most situations.

給定本發明的一個切割配置方式所切割出的多個子張量需要被儲存在一個資料結構中，並滿足記憶體對齊的需求，以最大化壓縮帶來的益處。因為子張量壓縮後的大小不同，本發明額外儲存用於代表這些子張量的指標。Given the multiple sub-tensors cut by a cutting configuration of the present invention, they need to be stored in a data structure and meet the requirements of memory alignment to maximize the benefits of compression. Because the compressed size of the sub-tensors is different, the present invention additionally stores indicators used to represent these sub-tensors.

圖7展示本發明如何儲存子張量以及指標。在設置指標時，對於鄰近的子張量，如圖7中子張量1、2、3及4，僅使用指標A1指示區塊1的起始位址以及使用指標SZ1~SZ4分別指示這4個子張量各自被壓縮後的資料大小。因此，存取這些子張量時分兩步驟，首先從指標A1取得起始位址，然後分別從指標SZ1~SZ8獲取每個子張量的大小並累加至起始位址以獲得每個子張量的實際偏移量。Figure 7 shows how the present invention stores sub-tensors and indicators. When setting indicators, for adjacent sub-tensors, such as sub-tensors 1, 2, 3, and 4 in Figure 7, only indicator A1 is used to indicate the starting address of block 1, and indicators SZ1~SZ4 are used to indicate these 4 respectively. The compressed data size of each sub-tensor. Therefore, access to these sub-tensors is divided into two steps. Firstly, the starting address is obtained from the index A1, and then the size of each sub-tensor is obtained from the indicators SZ1~SZ8 and accumulated to the starting address to obtain each sub-tensor. The actual offset.

由於本發明的指標並不需要對應至每一個子張量，因此可以有效地減少指標的總大小。Since the index of the present invention does not need to correspond to each sub-tensor, the total size of the index can be effectively reduced.

本發明提出的一種硬體導向（hardware-friendly）的資料分割方式，可用於輸入特徵圖的儲存、存取和壓縮。本發明將輸入特徵圖切割為多個不同大小，藉此避免接收端提取到不會用到的子張量而浪費快取空間。本發明只需要少量的指標即可記錄每個壓縮後的子張量的位置。應用僅需對現有的CNN加速器架構進行微幅修改，因為本發明幾乎適用於所有的壓縮演算法，並且只需要改變現有的特徵圖切割方式。本發明可節省大量的記憶體傳輸頻寬。A hardware-friendly data segmentation method proposed by the present invention can be used for storage, access, and compression of input feature maps. The present invention cuts the input feature map into a plurality of different sizes, thereby preventing the receiving end from extracting unused sub-tensors and wasting cache space. The present invention only needs a small amount of indicators to record the position of each compressed sub-tensor. The application only needs to slightly modify the existing CNN accelerator architecture, because the present invention is applicable to almost all compression algorithms, and only the existing feature map cutting method needs to be changed. The invention can save a large amount of memory transmission bandwidth.

綜上所述，本發明提出了一種用於稀疏特徵圖的有效儲存方案，可減少外部儲存器的頻寬，並且符合現代CNN加速器架構中的儲存器存取模式。給定特定的CNN層和加速器配置，本發明可將張量資料切割為特定大小的多個子張量。現有的CNN加速器可在少量的硬體修改之下整合本發明。以提升整體效能。In summary, the present invention proposes an effective storage solution for sparse feature maps, which can reduce the bandwidth of external storage and conform to the storage access mode in the modern CNN accelerator architecture. Given a specific CNN layer and accelerator configuration, the present invention can cut the tensor data into multiple sub-tensors of specific sizes. The existing CNN accelerator can integrate the present invention with a small amount of hardware modification. To improve overall performance.

雖然本發明以前述之實施例揭露如上，然其並非用以限定本發明。在不脫離本發明之精神和範圍內，所為之更動與潤飾，均屬本發明之專利保護範圍。關於本發明所界定之保護範圍請參考所附之申請專利範圍。Although the present invention is disclosed in the foregoing embodiments, it is not intended to limit the present invention. All changes and modifications made without departing from the spirit and scope of the present invention fall within the scope of the patent protection of the present invention. For the scope of protection defined by the present invention, please refer to the attached scope of patent application.

P1:發送端階段 P2:接收端階段 S1~S7:步驟 A1:子張量1的起始位址 A5:子張量5的起始位址 SZ1~SZ8:子張量1~8的資料大小P1: sender stage P2: Receiver stage S1~S7: steps A1: The starting address of subtensor 1 A5: The starting address of subtensor 5 SZ1~SZ8: data size of subtensor 1~8

圖1是本發明一實施例的資料傳遞及合併的方法的流程圖；圖2是特徵圖在水平方向被切割為多個資料區塊的示意圖；圖3是第一實施例的示意圖；圖4是輸入資料的切割示意圖；圖5展示一般卷積採用本發明的一種配置方式的範例；圖6展示了擴張卷積採用本發明的另一種配置方式的範例；以及圖7繪示在本發明中，子張量及指標儲存的一種範例。 FIG. 1 is a flowchart of a method of data transmission and merging according to an embodiment of the present invention; Figure 2 is a schematic diagram of the feature map being cut into multiple data blocks in the horizontal direction; Figure 3 is a schematic diagram of the first embodiment; Figure 4 is a schematic diagram of cutting input data; FIG. 5 shows an example of a configuration method of the present invention for general convolution; Figure 6 shows an example of another configuration of the dilated convolution using the present invention; and FIG. 7 shows an example of sub-tensor and index storage in the present invention.

P1:發送端階段 P1: sender stage

P2:接收端階段 P2: Receiver stage

S1~S7:步驟 S1~S7: steps

Claims

A method of data transmission and merging is suitable for a sending end and a receiving end of a communication connection. The method includes: A sender stage, including: Transmitting a first block data, a second block data, and a third block data to the receiving end by the sending end; Obtain a fourth block data and a fifth block data from the sender; and Transmitting the third block data, the fourth block data, and the fifth block data to the receiving end by the sending end; and A receiving end stage, including: Receiving the first block data, the second block data, and the third block data with the receiving end; Merge the first block data, the second block data, and the third block data with the receiving end to perform a convolution operation; Receiving the fourth block data and the fifth block data with the receiving end; and The receiving end combines the third block data, the fourth block data, and the fifth block data to perform another convolution operation.

For example, the method of data transmission and merging described in claim 1, when transmitting the first block data, the second block data, and the third block data to the receiving end by the sending end, it further includes: The sending end sends a first indicator to the receiving end, where the first indicator is used to indicate the start address of the first block of data, the data size of the first data block, and the data of the second data block The size and the data size of the third data block; and When transmitting the third block data, the fourth block data, and the fifth block data to the receiving end by the transmitting end, the method further includes: transmitting a second indicator to the receiving end by the transmitting end, wherein The second indicator is used to indicate the start address of the third block data, the data size of the third data block, the data size of the fourth data block, and the data size of the fifth data block.

The method of data transmission and merging as described in claim 1, Before transmitting the first block data, the second block data, and the third block data to the receiving end by the sending end, it further includes: compressing the first block data and the second block data with a compressor Block data and the third block data; Before transmitting the third block data, the fourth block data, and the fifth block data to the receiving end by the transmitting end, the method further includes: compressing the third block data, the fourth block data, and the fourth block data by the compressor. Block data and the fifth block asset; Before merging the first block data, the second block data, and the third block data with the receiving end to perform the convolution operation, the method further includes: decompressing the first block data with a decompressor , The second block data and the third block data; and Before merging the fourth block data, the fifth block data, and the sixth block data by the receiving end to perform the convolution operation, the method further includes: decompressing the fourth block data by the decompressor , The fifth block data and the sixth block data.