TWI727643B - Artificial intelligence accelerator and operation thereof - Google Patents

Artificial intelligence accelerator and operation thereof Download PDF

Info

Publication number
TWI727643B
TWI727643B TW109103471A TW109103471A TWI727643B TW I727643 B TWI727643 B TW I727643B TW 109103471 A TW109103471 A TW 109103471A TW 109103471 A TW109103471 A TW 109103471A TW I727643 B TWI727643 B TW I727643B
Authority
TW
Taiwan
Prior art keywords
weight
data group
artificial intelligence
sub
input data
Prior art date
Application number
TW109103471A
Other languages
Chinese (zh)
Other versions
TW202131316A (en
Inventor
呂函庭
葉騰豪
許柏凱
魏旻良
Original Assignee
旺宏電子股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 旺宏電子股份有限公司 filed Critical 旺宏電子股份有限公司
Priority to TW109103471A priority Critical patent/TWI727643B/en
Application granted granted Critical
Publication of TWI727643B publication Critical patent/TWI727643B/en
Publication of TW202131316A publication Critical patent/TW202131316A/en

Links

Images

Abstract

An artificial intelligence accelerator receives an input data set in binary and a selected layer of overall weight pattern layers. The artificial intelligence accelerator includes multiple processing tiles and a summation output circuit. Each processing tile receives one of input data subsets of the input data set and performs convolution operation respectively on weight blocks of each sub weight pattern of the overall weight pattern to obtain weight operation values and then obtain a weight output as expected by direct convolution operation on the input data subset with the sub weight pattern through performing multiple stages of shifting and adding operation on the weight operation values. The summation output circuit obtains a summation value as expected by direct convolution operation on the input data set with the overall weight pattern through performing multiple stages of shifting and adding operation in sum on the weight output values of the processing tiles.

Description

人工智慧加速器以及其處理方法Artificial intelligence accelerator and its processing method

本發明是有關於人工智慧加速器的技術,且特別是關於由分割的輸入位元與分割的權重區塊所構成的人工智慧加速器。 The present invention relates to the technology of artificial intelligence accelerators, and particularly relates to artificial intelligence accelerators composed of divided input bits and divided weight blocks.

人工智慧加速器的應用包括例如類似過濾器的作用,其辨別輸入資料所代表的形態與已知形態的吻合程度。一種應用例如對所攝取的影像中通過人工智慧加速器辨別出是否包含有眼睛、鼻子、臉等等的訊息。 The application of artificial intelligence accelerators includes, for example, the role of a filter, which distinguishes how well the shape represented by the input data matches the known shape. An application, for example, uses an artificial intelligence accelerator to identify whether the captured image contains information such as eyes, nose, face, and so on.

人工智慧加速器所要處的資料例如是一張影像的所有畫素的資料,也就是其輸入資料是很大量位元的資料,這些資料平行輸入後,進行儲存在人工智慧加速器內各種形態進行比對運算。這些形態是以權重的方式儲存在的大量記憶胞中。記憶胞的架構是3D的架構,其由多層的2D記憶胞層所構成,每一層代表一個特徵形態,以權重值的方式儲存於一層的記憶胞陣列,其藉由字元線控制依序選擇所要處理的一層的記憶胞陣列。輸入資料是由位元線輸入。輸入資料與記憶胞陣列進行迴旋運算 (convolution operation)得到對應此層的記憶胞陣列的特徵圖案的吻合程度。 The data required by the artificial intelligence accelerator is, for example, the data of all the pixels of an image, that is, the input data is a large number of bits of data. After these data are input in parallel, they are stored in the artificial intelligence accelerator for comparison in various forms. Operation. These patterns are stored in a large number of memory cells in the form of weights. The memory cell structure is a 3D structure, which is composed of multiple 2D memory cell layers. Each layer represents a characteristic shape, which is stored in a memory cell array of one layer in the form of a weight value, which is sequentially selected by the word line control The memory cell array of one layer to be processed. The input data is input by bit lines. Input data and memory cell array for convolution operation (convolution operation) obtains the degree of coincidence of the characteristic patterns of the memory cell array corresponding to this layer.

人工智慧加速器要處理的運算量很大,如果多層的記憶胞陣列集合在一個單元且以位元當作運處理單位,其整體的電路會很大。而如此在操作上速度會延遲且耗能。就從人工智慧加速器所要求能快速處理來過濾辨別輸入圖像的內容來看,一般單一電路晶片的設計是需要繼續提升,例如操作速度的考慮。 Artificial intelligence accelerators have to deal with a large amount of calculations. If a multi-layer memory cell array is assembled in one unit and the bit is used as the processing unit, the overall circuit will be very large. However, the operation speed will be delayed and consume energy. From the point of view that artificial intelligence accelerators require fast processing to filter and distinguish the content of the input image, the design of a single circuit chip generally needs to continue to be improved, such as the consideration of operating speed.

本發明的實施例提供一種人工智慧加速器,由分割的輸入位元與分割的權重區塊所構成的人工智慧加速器,其後再藉由移位與相加運算將分別平行運算的值組合還原到單一晶片所預期的運算結果。如此,人工智慧加速器的處理速度可以有效提升,且可以減少耗能。 The embodiment of the present invention provides an artificial intelligence accelerator. The artificial intelligence accelerator is composed of divided input bits and divided weight blocks, and then the value combination of the respective parallel operations is restored to The expected calculation result for a single chip. In this way, the processing speed of the artificial intelligence accelerator can be effectively increased, and energy consumption can be reduced.

於一實施例,本發明提供一種人工智慧加速器,接收二位元的一輸入資料組與多層的整體權重形態中被選擇的其一進行迴旋運算。該輸入資料組分為多個次資料組。人工智慧加速器包括多個處理片及加總輸出電路。每一個該處理片包括:接收端元件,分別接收一個該次資料組。權重儲存部儲存該整體權重形態的部分權重形態,其中該權重儲存部包含多個權重區塊,每一個該權重區塊依照位元順序儲存該部分權重形態的一區塊部份,其中該權重儲存部的記憶胞陣列結構相對於所對應的該次資料組,規劃成將該次資料組分別與每一個該區塊部份進行迴旋運算得到 依序的多個權重運算值。逐塊輸出電路包含多個移位器及多個加法器,通過多階的移位與相加運算將該多個權重運算值總和得到由該次資料組與該部分權重形態直接進行迴旋運算所預期的權重輸出值。加總輸出電路包含多個移位器及多個加法器,通過多階的移位與相加運算對該多個權重輸出值總和得到由該輸入資料組與該整體權重形態直接進行迴旋運算所預期的總和值。 In one embodiment, the present invention provides an artificial intelligence accelerator that receives a two-bit input data set and a selected one of the multi-layer overall weight forms to perform a convolution operation. The input data group is multiple sub-data groups. The artificial intelligence accelerator includes multiple processing chips and a total output circuit. Each of the processing chips includes: receiving end components, which respectively receive a sub-data group. The weight storage section stores a partial weight form of the overall weight form, wherein the weight storage section includes a plurality of weight blocks, and each of the weight blocks stores a block part of the partial weight form in a bit order, wherein the weight The memory cell array structure of the storage unit is planned to perform a convolution operation on the data group and each part of the block relative to the corresponding data group. Multiple weight calculation values in sequence. The block-by-block output circuit includes multiple shifters and multiple adders. The sum of the multiple weight calculation values is obtained through multi-level shift and addition operations. The convolution calculation is directly performed on the data group and the part of the weight form. The expected weight output value. The summation output circuit includes multiple shifters and multiple adders. The sum of the multiple weighted output values is obtained by multi-level shift and addition operations. The input data group and the overall weight form are directly convoluted. The expected sum value.

於一實施例,對於所述的人工智慧加速器,該輸入資料組包含i個位元,分為p個該次資料組,i與p是整數,每一個該次資料組包含i/p個位元。 In one embodiment, for the artificial intelligence accelerator, the input data group includes i bits, divided into p sub-data groups, i and p are integers, and each sub-data group contains i/p bits yuan.

於一實施例,對於所述的人工智慧加速器,該輸入資料組包含i個位元,該多個處理片的數量是p個,該輸入資料組分為p個該次資料組,該i與p是大於或等於2的整數,該i大於該p,每一個該次資料組包含i/p個位元。 In one embodiment, for the artificial intelligence accelerator, the input data group includes i bits, the number of the plurality of processing slices is p, the input data group is p sub-data groups, and the i and p is an integer greater than or equal to 2, the i is greater than the p, and each sub-data group contains i/p bits.

於一實施例,對於所述的人工智慧加速器,該權重儲存部包含的該多個權重區塊的數量是q個,q大於或等於2的整數,該權重儲存部包含j個位元,該j、q是大於或等於2的整數,該j大於該q,每一個該權重區塊包含j/q個記憶胞。 In one embodiment, for the artificial intelligence accelerator, the number of weight blocks included in the weight storage section is q, and q is an integer greater than or equal to 2, and the weight storage section includes j bits, and j and q are integers greater than or equal to 2, the j is greater than the q, and each weight block contains j/q memory cells.

於一實施例,對於所述的人工智慧加速器,該多個處理片總合的該多個權重儲存部的記憶胞總量為p*q*2(i/p+j/q)In one embodiment, for the artificial intelligence accelerator, the total number of memory cells of the plurality of weight storage parts of the plurality of processing slices is p*q*2 (i/p+j/q) .

於一實施例,對於所述的人工智慧加速器,對於該逐塊輸出電路,在每一階的該移位與相加運算中包括至少一個該移位器與至少一個該加法器。對於每一階的多個輸入值以相鄰兩個為 一處理單元,其中屬於較高位元的該輸入值通過該移位器後與較低位元的該輸入值由該加法器相加後輸出給下一階,其中最後一階輸出單一值當作對應該處理片的該權重輸出值。 In one embodiment, for the artificial intelligence accelerator, for the block-by-block output circuit, each stage of the shift and add operation includes at least one shifter and at least one adder. For multiple input values of each order, take two adjacent ones as A processing unit, in which the input value belonging to the higher bit passes through the shifter and the input value of the lower bit is added by the adder and then output to the next stage, where the single output value of the last stage is regarded as a pair The weight output value of the slice should be processed.

於一實施例,對於所述的人工智慧加速器,在第一階的該移位器的移位量是j/q記憶胞,後一階的該移位器的移位量是前一階的該移位器的移位量的兩倍。 In one embodiment, for the artificial intelligence accelerator, the shift amount of the shifter in the first stage is j/q memory cells, and the shift amount of the shifter in the later stage is that of the previous stage Twice the amount of shift of this shifter.

於一實施例,對於所述的人工智慧加速器,對於該加總輸出電路,在每一階的該移位與相加運算中包括至少一個該移位器與至少一個該加法器。對於每一階的多個輸入值以相鄰兩個為一處理單元,其中屬於較高位元的該輸入值通過該移位器後與較低位元的該輸入值由該加法器相加後輸出給下一階,其中最後一階輸出單一值當作該總和值。 In one embodiment, for the artificial intelligence accelerator, for the sum output circuit, at least one shifter and at least one adder are included in the shift and add operation of each stage. For multiple input values of each level, two adjacent ones are used as a processing unit, where the input value belonging to the higher bit passes through the shifter and the lower bit input value is added by the adder Output to the next level, where the single output value of the last level is regarded as the total value.

於一實施例,對於所述的人工智慧加速器,在第一階的該移位器的移位量是i/p位元,後一階的該移位器的移位量是前一階的該移位器的移位量的兩倍。 In one embodiment, for the artificial intelligence accelerator, the shift amount of the shifter in the first stage is i/p bits, and the shift amount of the shifter in the later stage is that of the previous stage Twice the amount of shift of this shifter.

於一實施例,對於所述的人工智慧加速器,其更包括正規化處理電路,對該總和值進行正規化處理得到正規化總和值,以及數量化處理電路,以一基數將正規化總和值數量化成為整數量值。 In one embodiment, for the artificial intelligence accelerator, it further includes a normalization processing circuit, which normalizes the total value to obtain a normalized total value, and a quantization processing circuit, which normalizes the normalized total value by a base number Turn it into an integer value.

於一實施例,對於所述的人工智慧加速器,該處理電路包括多個感應放大器,分別感應每一個該區塊部份進行迴旋運算後得到多個感應值,當作該多個權重運算值。 In one embodiment, for the artificial intelligence accelerator, the processing circuit includes a plurality of sense amplifiers, which respectively sense each part of the block to perform a convolution operation to obtain a plurality of sensed values, which are used as the multiple weighted calculation values.

於一實施例,本發明更提供一種用於人工智慧加速器的處理方法。該人工智慧加速器接收二位元的一輸入資料組與多層的整體權重形態中被選擇的其一進行迴旋運算,該輸入資料組分為多個次資料組。該處理方法包括使用多個處理片,每一個該處理片包括執行。使用接收端元件分別接收一個該次資料組。使用權重儲存部儲存該整體權重形態的部分權重形態,其中該權重儲存部包含多個權重區塊,每一個該權重區塊依照位元順序儲存該部分權重形態的一區塊部份,其中該權重儲存部的記憶胞陣列結構相對於所對應的該次資料組,規劃成將該次資料組分別與每一個該區塊部份進行迴旋運算得到依序的多個權重運算值。使用包含多個移位器及多個加法器的逐塊輸出電路,通過多階的移位與相加運算將該多個權重運算值總和得到由該次資料組與該部分權重形態直接進行迴旋運算所預期的權重輸出值。使用包含多個移位器及多個加法器的加總輸出電路,通過多階的移位與相加運算對該多個權重輸出值總和得到由該輸入資料組與該整體權重形態直接進行迴旋運算所預期的總和值。 In one embodiment, the present invention further provides a processing method for artificial intelligence accelerators. The artificial intelligence accelerator receives a two-bit input data group and a selected one of the multi-layer overall weight forms to perform a convolution operation, and the input data group is a plurality of sub-data groups. The processing method includes the use of multiple processing slices, each of which includes execution. Use the receiving end component to receive one of this data group respectively. A weight storage unit is used to store a partial weight form of the overall weight form, wherein the weight storage part includes a plurality of weight blocks, and each of the weight blocks stores a block part of the partial weight form in a bit order, wherein the weight storage part contains a plurality of weight blocks. The memory cell array structure of the weight storage part is planned to perform a convolution operation on the data group with each part of the block to obtain a plurality of weight operation values in sequence with respect to the corresponding data group. Using a block-by-block output circuit that includes multiple shifters and multiple adders, the sum of the multiple weight calculation values is obtained by multi-level shift and addition operations. The data group and the part of the weight form are directly convolved. Calculate the expected weight output value. Using a summation output circuit that includes multiple shifters and multiple adders, the sum of the multiple weighted output values is obtained by multi-level shift and addition operations. The input data group and the overall weight form are directly convolved The sum value expected by the operation.

於一實施例,對於所述的人工智慧加速器的處理方法,該輸入資料組包含i個位元,分為p個該次資料組,i與p是整數,每一個該次資料組包含i/p個位元。 In one embodiment, for the processing method of the artificial intelligence accelerator, the input data group includes i bits, divided into p sub-data groups, i and p are integers, and each sub-data group includes i/ p bits.

於一實施例,對於所述的人工智慧加速器的處理方法,該輸入資料組包含i個位元,該多個處理片的數量是p個,該輸入資料組分為p個該次資料組,該i與p是大於或等於2的整數,該 i大於該p,每一個該次資料組包含i/p個位元。 In one embodiment, for the processing method of the artificial intelligence accelerator, the input data group includes i bits, the number of the plurality of processing slices is p, and the input data group is p sub-data groups, The i and p are integers greater than or equal to 2, the i is greater than p, and each sub-data group contains i/p bits.

於一實施例,對於所述的人工智慧加速器的處理方法,該權重儲存部包含的該多個權重區塊的數量是q個,q大於或等於2的整數,該權重儲存部包含j個位元,該j、q是大於或等於2的整數,該j大於該q,每一個該權重區塊包含j/q個記憶胞,該多個處理片總合的該多個權重儲存部的記憶胞總量為p*q*2(i/p+j/q)In one embodiment, for the processing method of the artificial intelligence accelerator, the number of the plurality of weight blocks included in the weight storage unit is q, and q is an integer greater than or equal to 2, and the weight storage unit includes j bits Element, the j and q are integers greater than or equal to 2, the j is greater than the q, each weight block contains j/q memory cells, and the memory of the multiple weight storage parts of the multiple processing slices The total amount of cells is p*q*2 (i/p+j/q) .

於一實施例,對於所述的人工智慧加速器的處理方法,對於該逐塊輸出電路的操作,在每一階的該移位與相加運算中包括使用至少一個該移位器與至少一個該加法器。對於每一階的多個輸入值以相鄰兩個為一處理單元,其中屬於較高位元的該輸入值通過該移位器後與較低位元的該輸入值由該加法器相加後輸出給下一階,其中最後一階輸出單一值當作對應該處理片的該權重輸出值。 In one embodiment, for the processing method of the artificial intelligence accelerator, for the operation of the block-by-block output circuit, each step of the shift and addition operation includes using at least one shifter and at least one Adder. For multiple input values of each level, two adjacent ones are used as a processing unit, where the input value belonging to the higher bit passes through the shifter and the lower bit input value is added by the adder Output to the next level, where the last level output a single value as the weighted output value corresponding to the processed slice.

於一實施例,對於所述的人工智慧加速器的處理方法,在第一階的該移位器的移位量是j/q記憶胞,後一階的該移位器的移位量是前一階的該移位器的移位量的兩倍。 In one embodiment, for the processing method of the artificial intelligence accelerator, the shift amount of the shifter in the first stage is j/q memory cell, and the shift amount of the shifter in the later stage is the former The first-order shift is twice the amount of shift of the shifter.

於一實施例,對於所述的人工智慧加速器的處理方法,對於該加總輸出電路的操作,在每一階的該移位與相加運算中包括至少一個該移位器與至少一個該加法器。對於每一階的多個輸入值以相鄰兩個為一處理單元,其中屬於較高位元的該輸入值通過該移位器後與較低位元的該輸入值由該加法器相加後輸出給下 一階,其中最後一階輸出單一值當作該總和值。 In one embodiment, for the processing method of the artificial intelligence accelerator, for the operation of the summing output circuit, each step of the shift and add operation includes at least one shifter and at least one adder Device. For multiple input values of each level, two adjacent ones are used as a processing unit, where the input value belonging to the higher bit passes through the shifter and the lower bit input value is added by the adder Output to the next The first order, where the last order outputs a single value as the total value.

於一實施例,對於所述的人工智慧加速器的處理方法,在第一階的該移位器的移位量是i/p位元,後一階的該移位器的移位量是前一階的該移位器的移位量的兩倍。 In one embodiment, for the processing method of the artificial intelligence accelerator, the shift amount of the shifter in the first stage is i/p bits, and the shift amount of the shifter in the later stage is the former The first-order shift is twice the amount of shift of the shifter.

於一實施例,對於所述的人工智慧加速器的處理方法,更包括:使用正規化處理電路,對該總和值進行正規化處理得到正規化總和值;以及使用數量化處理電路,以一基數將正規化總和值數量化成為整數量值。 In one embodiment, the processing method of the artificial intelligence accelerator further includes: using a normalization processing circuit to normalize the total value to obtain a normalized total value; and using a quantization processing circuit to convert The normalized sum value is quantified into an integer value.

於一實施例,對於所述的人工智慧加速器的處理方法,該處理電路包括多個感應放大器,使用該多個感應放大器分別感應每一個該區塊部份進行迴旋運算後得到多個感應值,當作該多個權重運算值。 In one embodiment, for the processing method of the artificial intelligence accelerator, the processing circuit includes a plurality of sense amplifiers, and the multiple sense amplifiers are used to respectively sense each part of the block and perform a convolution operation to obtain a plurality of sensed values. Take it as the multiple weight calculation values.

為讓本發明的上述特徵和優點能更明顯易懂,下文特舉實施例,並配合所附圖式作詳細說明如下。 In order to make the above-mentioned features and advantages of the present invention more comprehensible, the following specific embodiments are described in detail in conjunction with the accompanying drawings.

20:人工智慧加速器 20: Artificial Intelligence Accelerator

50:輸入資料 50: Input data

52:接收端元件 52: receiving end components

54:儲存單元 54: storage unit

56:記憶胞陣列結構 56: Memory cell array structure

58:輸出資料 58: output data

60:感應放大器 60: induction amplifier

100_1、100_2、100_(p-1)、100_p:處理片 100_1, 100_2, 100_(p-1), 100_p: processing slices

102_1、102_2、102_p:次輸入資料組 102_1, 102_2, 102_p: secondary input data group

104_1、104_2、104_p:輸出資料 104_1, 104_2, 104_p: output data

300:儲存單元 300: storage unit

302:權重區塊 302: weight block

308、314、316:移位器 308, 314, 316: shifter

312、318:加法器 312, 318: Adder

350、354、356:移位器 350, 354, 356: shifter

352、358:加法器 352, 358: Adder

400:處理電路 400: Processing circuit

402:乘法器 402: Multiplier

404:常數 404: constant

406:加法器 406: Adder

408:偏移量 408: offset

500:量化電路 500: quantization circuit

502:除法器 502: Divider

504:基數 504: base

600:整體系統 600: Overall system

602:人工智慧加速器 602: Artificial Intelligence Accelerator

604:控制單元 604: Control Unit

700:記憶體 700: memory

S100、S102、S104、S106、S108:步驟 S100, S102, S104, S106, S108: steps

圖1是依據本發明一實施例,人工智慧加速器的基本架構示意圖。 FIG. 1 is a schematic diagram of the basic architecture of an artificial intelligence accelerator according to an embodiment of the present invention.

圖2是依據本發明一實施例,人工智慧加速器的操作機制示意圖。 FIG. 2 is a schematic diagram of the operation mechanism of an artificial intelligence accelerator according to an embodiment of the present invention.

圖3是依據本發明一實施例,人工智慧加速器的規劃示意圖。 FIG. 3 is a schematic diagram of the planning of an artificial intelligence accelerator according to an embodiment of the present invention.

圖4是依據本發明一實施例,人工智慧加速器的規劃示意圖。 FIG. 4 is a schematic diagram of the planning of an artificial intelligence accelerator according to an embodiment of the present invention.

圖5是依據本發明一實施例,儲存單元的記憶胞架構示意圖。 FIG. 5 is a schematic diagram of a memory cell structure of a storage unit according to an embodiment of the present invention.

圖6是依據本發明一實施例,對於一個處理片針對多個權重區塊加總的機制示意圖。 FIG. 6 is a schematic diagram of a mechanism for summing multiple weight blocks for one processing slice according to an embodiment of the present invention.

圖7是依照本發明的一實施例,是依據本發明一實施例,對於多個處理片之間的加總電路的操作機制示意圖。 FIG. 7 is a schematic diagram of the operation mechanism of the summing circuit between multiple processing slices according to an embodiment of the present invention.

圖8是依據本發明一實施例,人工智慧加速器的整體應用配置示意圖。 FIG. 8 is a schematic diagram of the overall application configuration of an artificial intelligence accelerator according to an embodiment of the present invention.

圖9是依據本發明一實施例,人工智慧加速器的處理方法的流程示意圖。 FIG. 9 is a schematic flowchart of an artificial intelligence accelerator processing method according to an embodiment of the present invention.

本發明的實施例提供一種人工智慧加速器,由分割的輸入位元與分割的權重區塊所構成的人工智慧加速器。利用分割的輸入位元與分割的權重區塊平行,再藉由移位與相加運算將分別平行運算的值做組合,以還原得到單一晶片所預期的運算結果。如此,人工智慧加速器的處理速度可以有效提升,且可以減少耗能。 The embodiment of the present invention provides an artificial intelligence accelerator, which is composed of divided input bits and divided weight blocks. The divided input bits are used in parallel with the divided weight blocks, and the values of the parallel operations are combined by shift and addition operations to restore the expected results of a single chip. In this way, the processing speed of the artificial intelligence accelerator can be effectively increased, and energy consumption can be reduced.

以下提供多個實施例來說明本發明,但是本發明不限於所舉的實施例。 A number of embodiments are provided below to illustrate the present invention, but the present invention is not limited to the illustrated embodiments.

圖1是依據本發明一實施例,人工智慧加速器的基本架構示意圖。參閱圖1,人工智慧加速器20包括以3D結構配置的 NAND儲存單元54,其包含多層的2D記憶陣列。每一層的記憶陣列的每一個記憶胞分別儲存一權重值。每一層的記憶陣列的所有權重值依照預定的特徵而構成權重形態。權重形態例如是所要辨識的形態的資料,例如臉、耳、眼睛、鼻子、嘴、或是物件等等的形狀資料。每一種權重形態以2D記憶陣列儲存於3D NAND記憶單元54的一層。 FIG. 1 is a schematic diagram of the basic architecture of an artificial intelligence accelerator according to an embodiment of the present invention. Referring to FIG. 1, the artificial intelligence accelerator 20 includes a 3D structure configured The NAND storage unit 54 includes a multi-layer 2D memory array. Each memory cell of the memory array of each layer stores a weight value. All the weight values of the memory array of each layer constitute a weight form according to predetermined characteristics. The weight form is, for example, the form data to be recognized, such as the form data of the face, ears, eyes, nose, mouth, or objects. Each weighting form is stored in a layer of the 3D NAND memory cell 54 in a 2D memory array.

通過人工智慧加速器20的記憶胞陣列(cell array)結構56對應輸入資料以線路安排,儲存於記憶胞的權重形態可以與經過接收端元件52所接收與轉換的輸入資料50進行迴旋運算。此迴旋運算一般例如是矩陣的相乘運算而得到一個輸出值。依照對一層的權重形態通過記憶胞陣列結構56進行迴旋運算後得到輸出資料58。輸出資料58可表示輸入資料50與權重形態的匹配程度。迴旋運算可以依據本領域的一般方式來進行,無需別限制。於此其詳細操作不再予描述。從效能而言,每一層的權重形態類似一個物件的過濾層,以辨識輸入資料50與之相匹配的程度而達到辨識的功能。 With the cell array structure 56 of the artificial intelligence accelerator 20 corresponding to the input data in a line arrangement, the weight form stored in the memory cell can be convoluted with the input data 50 received and converted by the receiving terminal element 52. This convolution operation is generally, for example, a matrix multiplication operation to obtain an output value. The output data 58 is obtained by performing a convolution operation on the memory cell array structure 56 according to the weight form of a layer. The output data 58 may indicate the degree of matching between the input data 50 and the weight pattern. The convolution operation can be performed according to the general method in the field, without any restrictions. The detailed operation will not be described here. In terms of performance, the weight form of each layer is similar to the filter layer of an object, and the identification function is achieved by identifying the degree to which the input data 50 matches.

圖2是依據本發明一實施例,人工智慧加速器的操作機制示意圖。參閱圖1、2,輸入資料50例如是一張影像的數位資料。以動態的偵測的影像為例,攝影機隨時拍攝的實際影像的部分或全部,會由人工智慧加速器20進行辨別是否包含有儲存於儲存單元54的多種物件的至少其一。由於影像解析度的提高,一張圖像的資料包含大量的資料。儲存單元54的架構是3D的多層的2D 記憶胞陣列。一層記憶胞陣列包含i個用於輸入資料的位元線以及對應權重列(row)的j個選擇線。也就是,儲存權重的儲存單元54是以多層的i*j的矩陣所構成。參數i與j是很大的整數。輸入資料50會由儲存單元54的位元線接收,其分別接收影像的畫素資料。通過週邊配置的處理電路,輸入資料50與權重進行包含矩陣相乘的迴旋運算而輸出運算後的輸出資料58。 FIG. 2 is a schematic diagram of the operation mechanism of an artificial intelligence accelerator according to an embodiment of the present invention. Referring to FIGS. 1 and 2, the input data 50 is, for example, digital data of an image. Taking a dynamically detected image as an example, part or all of the actual image captured by the camera at any time will be identified by the artificial intelligence accelerator 20 whether it contains at least one of the various objects stored in the storage unit 54. Due to the improvement of image resolution, the data of an image contains a large amount of data. The structure of the storage unit 54 is a 3D multi-layer 2D Memory cell array. The memory cell array of one layer includes i bit lines for inputting data and j selection lines corresponding to the weight rows. That is, the storage unit 54 for storing the weight is composed of a multi-layer i*j matrix. The parameters i and j are very large integers. The input data 50 is received by the bit lines of the storage unit 54 which respectively receive the pixel data of the image. Through the processing circuit arranged in the periphery, the input data 50 and the weight are subjected to a convolution operation including matrix multiplication, and the output data 58 after the operation is output.

就較直接的迴旋運算,其可以用分別的單一個位元與單一個權重進行一一運算。然而由於要處理的資料量很大,因此整體的儲存單元是很大,而構成一個相當大的處理晶片。在操作上,其速度會較慢。另外,大尺寸的晶片操作時所產生的耗能(熱)也會較大。就人工智慧加速器所需求的功能,其需要較快的辨識速度,而同時也希望能減低操作的耗能。 As for the more direct convolution operation, it can use a single bit and a single weight to perform one-by-one operations. However, due to the large amount of data to be processed, the overall storage unit is very large, which constitutes a relatively large processing chip. In operation, its speed will be slower. In addition, the energy consumption (heat) generated during the operation of large-size wafers will also be greater. As for the functions required by the artificial intelligence accelerator, it needs a faster recognition speed, and at the same time, it is also hoped to reduce the energy consumption of the operation.

圖3是依據本發明一實施例,人工智慧加速器的規劃示意圖。參閱圖3,本發明進一步提出人工智慧加速器的運算規劃方式。本發明的人工智慧加速器維持接收平行輸入的整體的輸入資料50,但是將輸入資料50(也稱為輸入資量組)分為多個次輸入資料組102_1、...、102_p。每一個次輸入資料組分別由一個處理片(processing tile)100_1、...、100_p進行針對次輸入資料組102_1、...、102_p的迴旋運算。每一個處理片100_1、...、100_p僅是處理整體迴旋運算的一部分運算。如果以輸入資量50包含i個位元線為例,將i個位元線分為p組,p為2或是大於2的整數。如此一個處理片包含i/p個位元線,接收次輸入資料組102_1、...、 102_p,也就是次輸入資料組是含有i/p個位元資料。於此,參數i與p的關係是i可以被p整除的關係。然而,如果p個處理片不能整除i個位元線,則其最後一個處理片僅處理餘數的位元線,其依照實際需要來規劃,而無需限制。 FIG. 3 is a schematic diagram of the planning of an artificial intelligence accelerator according to an embodiment of the present invention. Referring to Fig. 3, the present invention further proposes an artificial intelligence accelerator operation planning method. The artificial intelligence accelerator of the present invention maintains the overall input data 50 that receives parallel input, but divides the input data 50 (also referred to as the input data group) into a plurality of secondary input data groups 102_1,..., 102_p. Each sub-input data group is respectively processed by a processing tile (processing tile) 100_1,..., 100_p to perform a convolution operation on the sub-input data group 102_1,..., 102_p. Each processing slice 100_1,..., 100_p is only a part of the processing of the overall convolution operation. If the input resource 50 includes i bit lines as an example, the i bit lines are divided into p groups, and p is 2 or an integer greater than 2. Such a processing chip contains i/p bit lines, and receives the secondary input data sets 102_1,..., 102_p, that is, the secondary input data group contains i/p bits of data. Here, the relationship between the parameter i and p is that i can be divisible by p. However, if the p processing slices cannot divide the i bit lines, the last processing slice only processes the remaining bit lines, which is planned according to actual needs without limitation.

依照圖3的架構,其是採用p個處理片100_1、...、100_p來處理當前所選擇的一層的權重形態,以進行迴旋運算。整體的輸入資料對應p個處理片也分為p個次輸入資料組102_1、...、102_p而輸入到對應的處理片100_1、...、100_p。經過p個處理片100_1、...、100_p所處理完的迴旋運算後得到輸出值104_1、...、104_p,例如是電流值。其後使用後面會描述的移位與相加的處理,可以得到相對應於以整體的輸入資料組與整體的權重形態進行迴旋運算後的結果。 According to the architecture of FIG. 3, p processing slices 100_1,..., 100_p are used to process the weight form of the currently selected layer to perform the convolution operation. The overall input data corresponding to the p processing slices is also divided into p secondary input data groups 102_1,..., 102_p and input to the corresponding processing slices 100_1,..., 100_p. The output values 104_1,..., 104_p are obtained after the convolution operation processed by the p processing slices 100_1,..., 100_p, which are, for example, current values. Afterwards, using the shift and addition processing described later, the result corresponding to the convolution operation with the overall input data group and the overall weight form can be obtained.

就圖3的分割方式,對於儲存於處理片內的部分的權重形態是直接與次輸入資料組進行迴旋運算。迴旋運算的效率可能可以再進一步提升。本發明在一實施例提出再進一步針對權重的區塊規劃。 Regarding the segmentation method of Figure 3, the weight form of the part stored in the processing slice is directly convoluted with the secondary input data set. The efficiency of the convolution calculation may be further improved. An embodiment of the present invention proposes a further block planning for weights.

圖4是依據本發明一實施例,人工智慧加速器的規劃示意圖。參閱圖4,就整體預定的輸入資料組例如包含排序從0到i-1的i個資料,其例是a0...ai-1的二進位數值,其每一個位元a當作一個位元線的輸入資料,如此經由i個位元線輸入。於一實施例,其例如將i個資料分為p組,也就是一個次輸入資料組102_1、102_2...。一個次輸入資料組102_1、102_2...例如包含i/p個資料, 但是多個處理片100_1、100_2...是依序配置。處理片100_1、100_2...對應整體輸入資料組的順序接收對應的次輸入資料組102_1、102_2...。例如第一個處理片接收a0到ai/p-1的資料,下一個處理片接收ai/p到a2*i/p-1的資料,以此類推。次輸入資料組102_1、102_2...由接收端元件66接收。接收端元件66例如包括感應放大器60以感應數位的輸入資料。位元線解碼電路62得到對應的邏輯輸出,由電壓該關器64將資量輸入。接收端元件66依照實際需要設置,本發明不限制接收端元件66的電路配置。 FIG. 4 is a schematic diagram of the planning of an artificial intelligence accelerator according to an embodiment of the present invention. Referring to Fig. 4, the overall predetermined input data group includes i data sorted from 0 to i-1, for example, a binary value of a 0 ... a i-1 , and each bit a is regarded as The input data of one bit line is thus input via i bit lines. In one embodiment, for example, it divides i data into p groups, that is, input data groups 102_1, 102_2... one by one. One sub-input data group 102_1, 102_2..., for example, contains i/p data, but multiple processing slices 100_1, 100_2... are arranged in sequence. The processing slices 100_1, 100_2... receive the corresponding secondary input data sets 102_1, 102_2... in the order corresponding to the overall input data set. For example, the first processing slice receives data from a 0 to a i/p-1 , the next processing slice receives data from a i/p to a 2*i/p-1 , and so on. The secondary input data sets 102_1, 102_2... are received by the receiving end component 66. The receiving end component 66 includes, for example, a sense amplifier 60 to sense digital input data. The bit line decoding circuit 62 obtains the corresponding logic output, and the voltage switch 64 inputs the data. The receiving end element 66 is set according to actual needs, and the present invention does not limit the circuit configuration of the receiving end element 66.

一個次輸入資料組102_1、102_2...由對應的一個處理片100_1、100_2...進行迴旋運算。此處理片100_1、100_2...的迴旋運算是整體迴旋運算的一部分運算。每一個處理片100_1、100_2...對應接收的次輸入資料組102_1、102_2...分別平行處理。次輸入資料組102_1、102_2...通過接收端元件66而進入到儲存單元90中所關聯的記憶胞。 Each input data group 102_1, 102_2... is convoluted by a corresponding processing piece 100_1, 100_2... The convolution operation of the processing slices 100_1, 100_2... is a part of the overall convolution operation. Each processing piece 100_1, 100_2... corresponds to the received secondary input data group 102_1, 102_2... and is processed in parallel respectively. The secondary input data groups 102_1, 102_2... enter the associated memory cell in the storage unit 90 through the receiving end element 66.

於一實施例,對於儲存權重的記憶胞在列(row)的數量例例如是j個,j是大的整數。也就是說,對應一條位元線有j個記憶胞。每一個記憶胞儲存一個權重值。於此,記憶胞列也可以稱為選擇線。於一實施例,j個記憶胞例如分割為q個權重區塊92。在j可以被q整除的實施例,一個權重區塊包含j/q個記憶胞。對於一個記憶胞,從輸出端來看,其也是相當於二進位數字串的一個位元。這些權重區塊也是依照權重的排序,從0到j-1分割成q個權重區塊92。 In one embodiment, the number of rows of memory cells storing weights is, for example, j, where j is a large integer. In other words, there are j memory cells corresponding to a bit line. Each memory cell stores a weight value. Herein, the memory cell array can also be referred to as a selection line. In one embodiment, j memory cells are divided into q weight blocks 92, for example. In the embodiment where j is divisible by q, a weight block contains j/q memory cells. For a memory cell, from the output end, it is also equivalent to one bit of a binary number string. These weight blocks are also sorted according to the weight, and are divided into q weight blocks 92 from 0 to j-1.

對於整個迴旋運算,其要得到加總值,以Sum表示,其如式(1)所示:Suma * W (1)其中“a”代表輸入資料組,W代表儲存單元中所選擇的一層權重的二維陣列。 For the entire convolution operation, it is necessary to get the total value, expressed in Sum, which is shown in formula (1): Suma * W (1) where " a " represents the input data group, and W represents the selection in the storage unit A two-dimensional array of layer weights.

對於輸入的次輸入資料組以包含8個位元的資料為例,其以二進位的數字串表示是[a0a1...a7],例如是[10011010],其對應一個十進位的數值。類似的,權重區塊有是由位元串來表示,例如第一個權重區塊包含[W0...Wj/q-1],依序最後權重區塊是[W(q-1)*j/q...Wj-1]來表示,其分別也是代表一個十進位的數值。 For the input sub-input data group, take data containing 8 bits as an example. The binary number string is represented as [a 0 a 1 ... a 7 ], for example, [10011010], which corresponds to a decimal The numerical value. Similarly, the weight block is represented by a bit string. For example, the first weight block contains [W 0 ...W j/q-1 ], and the last weight block in sequence is [W (q-1 )*j/q ...W j-1 ], which respectively represent a decimal value.

如此,整個迴旋運算如式(2)表示:SUM=(W0...Wj/q-1*20+...+W(q-1)*j/q...Wj-1*2j*(q-1)/q)*20*a0...ai/p-1+(W0...Wj/q-1*20+...+W(q-1)*j/q...Wj-1*2j*(q-1)/q)*2i/p*ai/p...a2*i/p-1+...+(W0...Wj/q-1*20+...+W(q-1)*j/q...Wj-1*2j*(q-1)/q)*2i*(p-1)/p*a(p-1)*i/p...ai-1 (2) In this way, the entire convolution operation is expressed as formula (2): SUM=(W 0 ...W j/q-1 *2 0 +...+W (q-1)*j/q ...W j- 1 *2 j*(q-1)/q )*2 0 *a 0 ...a i/p-1 +(W 0 ...W j/q-1 *2 0 +...+W (q-1)*j/q ...W j-1 *2 j*(q-1)/q )*2 i/p *a i/p ...a 2*i/p-1 + ...+(W 0 ...W j/q-1 *2 0 +...+W (q-1)*j/q ...W j-1 *2 j*(q-1) /q )*2 i*(p-1)/p *a (p-1)*i/p ...a i-1 (2)

對於如圖2所示i*j的二維陣列所儲存的權重形態,Sum是權重形態與整體輸入資料組(a0...ai-1)進行迴旋運算所預計得到的值。迴旋運算是整合在記憶胞陣列結構的規劃中,通過配線使 對位元的輸入資料與所選擇一層的記憶胞所儲存的權重形態進行迴旋運算。關於矩陣的迴旋運算的詳細實際運算是本技術領域所能瞭解,於此不予詳細描述。本發明實施例是依照迴旋運算將權重資料,分割成為由多個處理片100_1、100_2...進行平行運算,而對於每一個處理片100_1、100_2...所分割成多個權重區塊92,也可以分別平行運算。本發明一實施例也提出對於每一個處理片,使用移位與相加的方式將分割的多個權重區塊重新恢復到整體單一權重區塊所要的結果。另外,對於多個處理片已利用移位與相加的方式將分割的多個處理片加總,得到整體所要的運算值。 For the weight patterns stored in the i*j two-dimensional array as shown in Figure 2, Sum is the value predicted by the weight pattern and the overall input data set (a 0 ... a i-1 ) through the convolution operation. The convolution operation is integrated in the planning of the memory cell array structure. Through wiring, the input data of the bit and the weight form stored in the memory cell of the selected layer are convoluted. The detailed actual operation of the matrix convolution operation is understood in the technical field and will not be described in detail here. According to the embodiment of the present invention, the weight data is divided into multiple processing slices 100_1, 100_2... for parallel operation according to the convolution operation, and each processing slice 100_1, 100_2... is divided into multiple weight blocks 92 , Can also be operated in parallel respectively. An embodiment of the present invention also proposes that for each processing slice, a method of shifting and adding is used to restore the divided multiple weight blocks to the desired result of a single weight block as a whole. In addition, for a plurality of processing slices, the divided processing slices have been summed up by means of shift and addition to obtain the overall required operation value.

每一個處理片100_1、100_2...也設置有處理電路70進行迴旋運算。另外處理片100_1、100_2...也設置有逐塊(block-wise)輸出電路80,其包含多階的移位與相加運算,對平行的零階輸出資料,依照位元(記憶胞)順序,得到對應例如[W0...Wj/q-1]、...的資料。處理片之間也利用移位與相加運算得到最後整體的迴旋運算。 Each processing slice 100_1, 100_2... is also provided with a processing circuit 70 to perform convolution operations. In addition, the processing chips 100_1, 100_2... are also provided with a block-wise output circuit 80, which includes multi-level shift and addition operations. For parallel zero-level output data, according to the bit (memory cell) In order, the data corresponding to [W 0 ...W j/q-1 ], ... are obtained. Shift and addition operations are also used between processing slices to obtain the final overall convolution operation.

在如上的規劃下,對於一個處理片的一個權重區塊的處理所使用的儲存量是2(i/p+j/q)。對於整體而言,其包含p個處理片,每一個處理是q個權重區塊,因此所需要的記憶胞數量可以降低為p*q*2(i/p+j/q)Under the above plan, the storage used for processing a weight block of a processing slice is 2 (i/p+j/q) . As a whole, it contains p processing slices, and each processing is q weight blocks, so the number of memory cells required can be reduced to p*q*2 (i/p+j/q) .

以下較詳細描述根據分割的權重區塊以及分割的處理片來得到整體運算的結果。 The following describes in more detail the result of the overall operation based on the divided weight blocks and the divided processing slices.

圖5是依據本發明一實施例,儲存單元的記憶胞架構示意圖。參閱圖5,對於一個處理片儲存單元,其對應一條位元線 BL_1、BL_2...,會包含多個記憶胞串,垂直分佈連接到位元線BL,構成3D的結構。記憶胞串的每一個記憶胞是屬於一層的記憶胞陣列,儲存權重形態其中的一個權重值。在位元線BL_1、BL_2...的記憶胞串是由選擇線SSL啟動。對應多個選擇線SSL的記憶胞構成權重區塊,如Block_n所標示。輸入資料由位元線BL輸入,經控制而流入對應的記憶胞,如此到迴旋運算。其後由輸出端SL_n結合而輸出。儲存單元會包含q個區塊,如Block_n*q所標示。 FIG. 5 is a schematic diagram of a memory cell structure of a storage unit according to an embodiment of the present invention. Referring to Figure 5, for a processing slice storage unit, it corresponds to a bit line BL_1, BL_2... will contain multiple memory cell strings, which are vertically connected to the bit line BL to form a 3D structure. Each memory cell in the memory cell string is a memory cell array belonging to a layer, and stores one of the weight values in the weight form. The cell strings on the bit lines BL_1, BL_2... are activated by the selection line SSL. The memory cells corresponding to multiple selection lines SSL constitute a weight block, as indicated by Block_n. The input data is input by the bit line BL, and flowed into the corresponding memory cell under control, so that the convolution operation is performed. It is then combined and output by the output terminal SL_n. The storage unit will contain q blocks, as indicated by Block_n*q.

圖6是依據本發明一實施例,對於一個處理片針對多個權重區塊加總的機制示意圖。參閱圖6,在一個處理片的儲存單元300中,其分割為多個權重區塊302。每一個權重區塊302會與次輸入資料組進行迴旋運算,而平行輸出每個權重區塊302的運算值,如粗箭頭表示。其後通過感應放大器SA的感應分別輸出一感訊號,其例如是電流值。由於權重是依照二進位的排列且是平行輸出。要得到十進位的數值,本發明一實施例提出逐塊輸出電路的配置,使用加法器312將相鄰的相個輸出值相加,這兩個輸出值中屬於較高位元的輸出值先通過可以移位預定數位元的移位器308,將其移位到所對應的位元順序。例如一個權重區塊包含j/q個位元(記憶胞),其屬於較高位元的輸出值需要移位提高j/q個位元,因此第一階的移位與相加運算的移位器308具有移位j/q個位元的效果。經過第一階加法器的相加後,其代表2*j/q個位元的數值。如此,對於其後的第二階的移位與相加運算,其機制相同,但是移位器314的移位量是2*j/q個位元。依此類推,最後一階的 輸入值是僅是兩個,因此只需要一個移位器316,但是移位量是例如是2(log 2 q-1)*j/q位元,而得到一個處理片的迴旋運算結果。 FIG. 6 is a schematic diagram of a mechanism for summing multiple weight blocks for one processing slice according to an embodiment of the present invention. Referring to FIG. 6, in the storage unit 300 of a processing slice, it is divided into a plurality of weight blocks 302. Each weight block 302 performs a convolution operation with the secondary input data set, and outputs the calculated value of each weight block 302 in parallel, as indicated by the thick arrow. After that, a sense signal is output respectively through the sense of the sense amplifier SA, which is, for example, a current value. Because the weight is in accordance with the binary arrangement and is output in parallel. To obtain the decimal value, an embodiment of the present invention proposes a block-by-block output circuit configuration. The adder 312 is used to add the output values of adjacent phases, and the output value belonging to the higher bit of the two output values is passed first. The shifter 308, which can shift a predetermined number of bits, shifts it to the corresponding bit sequence. For example, a weight block contains j/q bits (memory cells), and the output value of higher bits needs to be shifted and increased by j/q bits, so the first-order shift and addition operation shift The device 308 has the effect of shifting by j/q bits. After the addition of the first-order adder, it represents a value of 2*j/q bits. In this way, for the subsequent second-level shift and addition operations, the mechanism is the same, but the shift amount of the shifter 314 is 2*j/q bits. By analogy, the input value of the last stage is only two, so only one shifter 316 is needed, but the shift amount is, for example, 2 (log 2 q-1) *j/q bits, and one Process the result of the convolution operation of the slice.

於此要另外說明,一層的權重形態權重區塊也可以分散由多個不同的處理片,其依照權重區塊的規劃與組合。也就是,一個處理片所儲存的權重區塊不需要同一層的權重資料。從另一方面來看,一層的權重資料的權重區塊是分散到多個處理片。因此處理片可以平行運算。也就是,所涉及的多個處理片的每一個僅針對屬於要處理一層層的區塊進行運算即可,其後再將屬與相同層的運算資料組合即可。 In addition, it should be explained that the weight block of the weight form of a layer can also be scattered by multiple different processing slices, which are based on the planning and combination of the weight blocks. That is, the weight blocks stored in a processing slice do not need the same layer of weight data. On the other hand, the weight blocks of the weight data of one layer are distributed to multiple processing slices. Therefore, processing slices can be operated in parallel. That is, each of the involved multiple processing slices only needs to perform calculations on the blocks belonging to one layer to be processed, and then combine calculation data belonging to the same layer.

以下描述將多個處理片整合的移位與相加運算。圖7是依據本發明一實施例,對於多個處理片之間的加總電路的操作機制示意圖。參閱圖7,p個處理片100_1、100_2、...、100_p分別依照圖6的輸出值進行移位與相加運算。於此的每一個處理片100_1、100_2、...、100_p是指同一層權重形態中分別對應次輸入資料的迴旋運算結果。 The following describes the shift and addition operations that integrate multiple processing slices. FIG. 7 is a schematic diagram of the operation mechanism of the summing circuit between multiple processing slices according to an embodiment of the present invention. Referring to FIG. 7, p processing slices 100_1, 100_2, ..., 100_p perform shift and addition operations respectively according to the output value of FIG. 6. Here, each processing slice 100_1, 100_2,..., 100_p refers to the convolution calculation result of the corresponding input data in the same layer of weight form.

類似圖6的情形,輸入資料組是二進位輸字串,但是對於每一個次輸入資料組例如都是i/p個位元。因此第一階的移位與相加運算也是取每一對相鄰的輸出由加法器352進行相加,其中屬於較高位元的數值先通過移位器350做i/p個位元的移位。下一階的移位與相加運算的移位器354是2*i/p個位元的移位。最後一階的移位器356的移位量是2(log 2 p-1)*i/p個位元。在最後一階的移位與相加運算後經可以得到如式(1)的總和值Sum。 Similar to the situation in FIG. 6, the input data group is a binary input string, but for each sub-input data group, for example, there are i/p bits. Therefore, the first-order shift and addition operation also takes each pair of adjacent outputs to be added by the adder 352, in which the higher-bit value is first shifted by i/p bits through the shifter 350. Bit. The shifter 354 for the next-order shift and addition operation is a shift of 2*i/p bits. The shift amount of the last-stage shifter 356 is 2 (log 2 p-1) *i/p bits. After the last-order shift and addition operations, the sum value Sum as in formula (1) can be obtained.

此階段的總和值Sum是初步的數值,在實際應用上,其後續需要進行正規處理,其例如使用正規化處理電路400,對總和值進行正規化處理,以得到正規化總和值。正規化處理電路例如包括式(3)的運算:

Figure 109103471-A0305-02-0020-1
常數α 404是縮放值通過乘法器402調整總和值Sum,其後再使用加法器406調整偏移量β 408。 The sum value Sum at this stage is a preliminary value. In practical applications, it needs to be normalized later. For example, the normalization processing circuit 400 is used to normalize the sum value to obtain the normalized sum value. The normalization processing circuit includes, for example, the operation of formula (3):
Figure 109103471-A0305-02-0020-1
The constant α 404 is the scaling value adjusted by the multiplier 402 to adjust the sum value Sum, and then the adder 406 is used to adjust the offset β 408.

得到正規化後的總和值再通過量化電路500,其使用除法器502除以基數d 504進行量化,如式(4):

Figure 109103471-A0305-02-0020-2
其中0.5表示取整數的運算。一般而言,如果輸入資料組愈能吻合此層的特徵形態,其數量化的值a’會愈大。 The normalized sum value is then passed through the quantization circuit 500, which uses the divider 502 to divide by the base d 504 for quantization, as shown in formula (4):
Figure 109103471-A0305-02-0020-2
Where 0.5 represents the operation of taking integers. In general, if the input data set can match more morphological characteristics of this layer, the number of values a 'will be larger.

完成一層的權種形態的迴旋運算,其利用字元線繼續選擇下一層的權種形態進行迴旋運算。 After completing the convolution operation of the weight form of one layer, it uses the character line to continue to select the weight form of the next layer to perform the convolution calculation.

圖8是依據本發明一實施例,人工智慧加速器的整體應用配置示意圖。參閱圖8,整體系統600的人工智慧加速器602與可以與主機的控制單元604進行雙向交流。主機的控制單元604例如從外部的記憶體700取得輸入資料,例如是影像的數位資料,而輸入給人工智慧加速器602進行特徵形態的辨別處理,而回覆 給主機的控制單元604。整體系統600的應用可以隨實際需要配置,不限於所舉的被置方式。 FIG. 8 is a schematic diagram of the overall application configuration of an artificial intelligence accelerator according to an embodiment of the present invention. Referring to FIG. 8, the artificial intelligence accelerator 602 of the overall system 600 and the control unit 604 that can communicate with the host in two-way communication. The control unit 604 of the host, for example, obtains input data, such as digital data of an image, from the external memory 700, and inputs it to the artificial intelligence accelerator 602 for identification processing of characteristic shapes, and then responds To the control unit 604 of the host. The application of the overall system 600 can be configured according to actual needs, and is not limited to the configuration described above.

本發明一實施例也提供人工智慧加速器的處理方法。圖9是依據本發明一實施例,人工智慧加速器的處理方法的流程示意圖。 An embodiment of the present invention also provides a processing method for an artificial intelligence accelerator. FIG. 9 is a schematic flowchart of an artificial intelligence accelerator processing method according to an embodiment of the present invention.

參閱圖9,本發明一實施例更提供一種用於人工智慧加速器的處理方法。人工智慧加速器接收二位元的一輸入資料組與多層的整體權重形態中被選擇的其一進行迴旋運算,該輸入資料組分為多個次資料組。處理方法包括步驟S100使用多個處理片,每一個該處理片包括執行。步驟S102使用接收端元件分別接收一個該次資料組。步驟S104使用權重儲存部儲存該整體權重形態的部分權重形態,其中該權重儲存部包含多個權重區塊,每一個該權重區塊依照位元順序儲存該部分權重形態的一區塊部份,其中該權重儲存部的記憶胞陣列結構相對於所對應的該次資料組,規劃成該次資料組分別與每一個該區塊部份進行迴旋運算得到依序的多個權重運算值。驟S106使用包含多個移位器及多個加法器的逐塊輸出電路,通過多階的移位與相加運算將該多個權重運算值總和得到由該次資料組與該部分權重形態直接進行迴旋運算所預期的權重輸出值。驟S108使用包含多個移位器及多個加法器的加總輸出電路,通過多階的移位與相加運算對該多個權重輸出值總和得到由該輸入資料組與該整體權重形態直接進行迴旋運算所預期的總和值。 Referring to FIG. 9, an embodiment of the present invention further provides a processing method for an artificial intelligence accelerator. The artificial intelligence accelerator receives a two-bit input data group and a selected one of the multi-layer overall weight forms to perform a convolution operation, and the input data group is a plurality of sub-data groups. The processing method includes step S100 using multiple processing slices, each of which includes execution. In step S102, the receiving end component is used to respectively receive one of the secondary data groups. Step S104: Use a weight storage part to store a partial weight form of the overall weight form, wherein the weight storage part includes a plurality of weight blocks, and each of the weight blocks stores a block part of the partial weight form in a bit order, The memory cell array structure of the weight storage part is planned to perform a convolution operation with each of the blocks to obtain a plurality of weight calculation values in sequence with respect to the corresponding sub-data group. In step S106, a block-by-block output circuit including multiple shifters and multiple adders is used, and the sum of the multiple weight calculation values is obtained by multi-level shift and addition operations to obtain a direct result from the sub-data group and the partial weight form. The weight output value expected by the convolution operation. In step S108, a summation output circuit including multiple shifters and multiple adders is used, and the sum of the multiple weight output values is obtained by multi-level shift and addition operations to obtain a direct result of the input data set and the overall weight form. The sum value expected by the convolution operation.

如上所述,本發明的實施例提出將儲存單元的權重資料分割由多個處理片進行迴旋運算,同時在每一個處理片的儲存單元也分割成多個權中驅塊分別處理。其後通過移位與相加運算,可以得到最後的整體總合值。由於處理片的電路較小,運算速度可以提升,另外處理片處理時所耗能量,例如產生熱,可以減少。 As described above, the embodiment of the present invention proposes to divide the weight data of the storage unit into multiple processing slices to perform the convolution operation, and at the same time, the storage unit of each processing slice is also divided into multiple weight-driven blocks for processing separately. Afterwards, by shifting and adding operations, the final overall total value can be obtained. Since the circuit of the processing chip is smaller, the calculation speed can be increased, and the energy consumed during processing of the processing chip, such as heat generation, can be reduced.

雖然本發明已以實施例揭露如上,然其並非用以限定本發明,任何所屬技術領域中具有通常知識者,在不脫離本發明的精神和範圍內,當可作些許的更動與潤飾,故本發明的保護範圍當視後附的申請專利範圍所界定者為準。 Although the present invention has been disclosed in the above embodiments, it is not intended to limit the present invention. Anyone with ordinary knowledge in the relevant technical field can make some changes and modifications without departing from the spirit and scope of the present invention. The scope of protection of the present invention shall be determined by the scope of the attached patent application.

100_1、100_2、100_(p-1)、100_p:處理片 100_1, 100_2, 100_(p-1), 100_p: processing slices

350、354、356:移位器 350, 354, 356: shifter

352、358:加法器 352, 358: Adder

400:處理電路 400: Processing circuit

402:乘法器 402: Multiplier

404:常數 404: constant

406:加法器 406: Adder

408:偏移量 408: offset

500:量化電路 500: quantization circuit

502:除法器 502: Divider

504:基數 504: base

Claims (10)

一種人工智慧加速器,接收二位元的一輸入資料組與多層的整體權重形態中被選擇的其一進行迴旋運算,該輸入資料組分為多個次資料組,該人工智慧加速器包括多個處理片,每一個該處理片包括接收端元件,分別接收一個該次資料組;權重儲存部,儲存該整體權重形態的部分權重形態,其中該權重儲存部包含多個權重區塊,每一個該權重區塊依照位元順序儲存該部分權重形態的一區塊部份,其中該權重儲存部的記憶胞陣列結構相對於所對應的該次資料組,規劃成將該次資料組分別與每一個該區塊部份進行迴旋運算得到依序的多個權重運算值;以及逐塊輸出電路,包含多個移位器及多個加法器,通過多階的移位與相加運算將該多個權重運算值總和得到由該次資料組與該權重形態直接進行迴旋運算所預期的權重輸出值;以及加總輸出電路,包含多個移位器及多個加法器,通過多階的移位與相加運算對該多個權重輸出值總和得到由該輸入資料組與該整體權重形態直接進行迴旋運算所預期的總和值。 An artificial intelligence accelerator that receives a two-bit input data group and a selected one of the multi-layered overall weight forms to perform a convolution operation. The input data group is a plurality of sub-data groups, and the artificial intelligence accelerator includes a plurality of processes Slices, each of the processing slices includes a receiving end component, which receives one of the sub-data groups; a weight storage part, which stores part of the weight form of the overall weight form, wherein the weight storage part includes a plurality of weight blocks, each of the weights The block stores a block part of the partial weight form according to the bit sequence, wherein the memory cell array structure of the weight storage part is planned to correspond to the corresponding sub-data group such that the sub-data group and each of the sub-data groups are respectively The block part performs the convolution operation to obtain multiple weight calculation values in sequence; and the block-by-block output circuit includes multiple shifters and multiple adders, and the multiple weights are obtained through multi-level shift and addition operations The calculated value is summed to obtain the weight output value expected by the direct convolution operation of the sub-data group and the weight form; and the summation output circuit includes multiple shifters and multiple adders, through multi-stage shift and phase The addition operation sums the output values of the multiple weights to obtain the sum value expected by the direct convolution operation between the input data group and the overall weight form. 如請求項1所述的人工智慧加速器,其中該輸入資料組包含i個位元,分為p個該次資料組,i與p是整數,每一個該次資料組包含i/p個位元。 The artificial intelligence accelerator according to claim 1, wherein the input data group contains i bits, divided into p sub-data groups, i and p are integers, and each sub-data group contains i/p bits . 如請求項1所述的人工智慧加速器,其中該輸入資料組包含i個位元,該多個處理片的數量是p個,該輸入資料組分為p個該次資料組,該i與p是大於或等於2的整數,該i大於該p,每一個該次資料組包含i/p個位元。 The artificial intelligence accelerator according to claim 1, wherein the input data group includes i bits, the number of the plurality of processing slices is p, the input data group is p of the sub-data group, and the i and p It is an integer greater than or equal to 2, the i is greater than the p, and each sub-data group contains i/p bits. 如請求項3所述的人工智慧加速器,其中該權重儲存部包含的該多個權重區塊的數量是q個,q大於或等於2的整數,該權重儲存部包含j個位元,該j、q是大於或等於2的整數,該j大於該q,每一個該權重區塊包含j/q個記憶胞。 The artificial intelligence accelerator according to claim 3, wherein the number of the plurality of weight blocks included in the weight storage part is q, and q is an integer greater than or equal to 2, and the weight storage part contains j bits, and the j , Q is an integer greater than or equal to 2, the j is greater than the q, and each weight block contains j/q memory cells. 如請求項4所述的人工智慧加速器,其中該多個處理片總合的該多個權重儲存部的記憶胞總量為p*q*2(i/p+j/q)The artificial intelligence accelerator according to claim 4, wherein the total amount of memory cells of the plurality of weight storage parts of the plurality of processing slices is p*q*2 (i/p+j/q) . 一種用於人工智慧加速器的處理方法,該人工智慧加速器接收二位元的一輸入資料組與多層的整體權重形態中被選擇的其一進行迴旋運算,該輸入資料組分為多個次資料組,該處理方法包括:使用多個處理片,每一個該處理片包括執行:使用接收端元件分別接收一個該次資料組;使用權重儲存部儲存該整體權重形態的部分權重形態,其中該權重儲存部包含多個權重區塊,每一個該權重區塊依照位元順序儲存該部分權重形態的一區塊部份,其中該權重儲存部的記憶胞陣列結構相對於所對應的該次資料組,規劃成將該次資料組分別與每一個該區塊部份進行迴旋運算得到依序的多個權重運算值;以及 使用包含多個移位器及多個加法器的逐塊輸出電路,通過多階的移位與相加運算將該多個權重運算值總和得到由該次資料組與該權重形態直接進行迴旋運算所預期的權重輸出值;以及使用包含多個移位器及多個加法器的加總輸出電路,通過多階的移位與相加運算對該多個權重輸出值總和得到由該輸入資料組與該整體權重形態直接進行迴旋運算所預期的總和值。 A processing method for an artificial intelligence accelerator. The artificial intelligence accelerator receives a two-bit input data group and a selected one of the multi-layer overall weight forms to perform a convolution operation, and the input data group is a plurality of sub-data groups , The processing method includes: using a plurality of processing slices, each of the processing slices includes executing: using a receiving end component to respectively receive one of the sub-data groups; using a weight storage unit to store part of the weight form of the overall weight form, wherein the weight is stored The part includes a plurality of weight blocks, and each of the weight blocks stores a block part of the partial weight form according to the bit sequence, wherein the memory cell array structure of the weight storage part is relative to the corresponding sub-data group, It is planned to perform a convolution operation on the data group with each part of the block to obtain a plurality of weight calculation values in sequence; and Using a block-by-block output circuit including multiple shifters and multiple adders, the sum of the multiple weight calculation values is obtained through multi-level shift and addition operations. The convolution calculation is directly performed on the data group and the weight form. The expected weight output value; and the sum output circuit including multiple shifters and multiple adders is used, and the sum of the multiple weight output values is obtained by multi-stage shift and addition operations to obtain the input data set The sum value expected by the convolution operation directly with the overall weight pattern. 如請求項6所述的用於人工智慧加速器的處理方法,其中該輸入資料組包含i個位元,分為p個該次資料組,i與p是整數,每一個該次資料組包含i/p個位元。 The processing method for an artificial intelligence accelerator according to claim 6, wherein the input data group contains i bits, divided into p sub-data groups, i and p are integers, and each sub-data group contains i /p bits. 如請求項6所述的用於人工智慧加速器的處理方法,其中該輸入資料組包含i個位元,該多個處理片的數量是p個,該輸入資料組分為p個該次資料組,該i與p是大於或等於2的整數,該i大於該p,每一個該次資料組包含i/p個位元。 The processing method for an artificial intelligence accelerator according to claim 6, wherein the input data group includes i bits, the number of the plurality of processing slices is p, and the input data group is p of the sub-data group , The i and p are integers greater than or equal to 2, the i is greater than the p, and each sub-data group contains i/p bits. 如請求項8所述的用於人工智慧加速器的處理方法,其中該權重儲存部包含的該多個權重區塊的數量是q個,q大於或等於2的整數,該權重儲存部包含j個位元,該j、q是大於或等於2的整數,該j大於該q,每一個該權重區塊包含j/q個記憶胞,其中該多個處理片總合的該多個權重儲存部的記憶胞總量為p*q*2(i/p+j/q)The processing method for an artificial intelligence accelerator according to claim 8, wherein the number of the plurality of weight blocks included in the weight storage unit is q, and q is an integer greater than or equal to 2, and the weight storage unit includes j Bit, the j and q are integers greater than or equal to 2, the j is greater than the q, and each weight block contains j/q memory cells, wherein the multiple weight storage parts of the multiple processing slices The total number of memory cells is p*q*2 (i/p+j/q) . 如請求項9所述的用於人工智慧加速器的處理方法,其中對於該逐塊輸出電路的操作,在每一階的該移位與相加運算中包括使用至少一個該移位器與至少一個該加法器, 其中對於每一階的多個輸入值以相鄰兩個為一處理單元,其中屬於較高位元的該輸入值通過該移位器後與較低位元的該輸入值由該加法器相加後輸出給下一階,其中最後一階輸出單一值當作對應該處理片的該權重輸出值。 The processing method for an artificial intelligence accelerator according to claim 9, wherein for the operation of the block-by-block output circuit, the shift and addition operation of each stage includes using at least one shifter and at least one The adder, For the multiple input values of each level, two adjacent ones are used as a processing unit. The input value belonging to the higher bit passes through the shifter and is added to the input value of the lower bit by the adder. Then output to the next stage, where a single value of the last stage output is regarded as the weighted output value corresponding to the processed slice.
TW109103471A 2020-02-05 2020-02-05 Artificial intelligence accelerator and operation thereof TWI727643B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW109103471A TWI727643B (en) 2020-02-05 2020-02-05 Artificial intelligence accelerator and operation thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW109103471A TWI727643B (en) 2020-02-05 2020-02-05 Artificial intelligence accelerator and operation thereof

Publications (2)

Publication Number Publication Date
TWI727643B true TWI727643B (en) 2021-05-11
TW202131316A TW202131316A (en) 2021-08-16

Family

ID=77036256

Family Applications (1)

Application Number Title Priority Date Filing Date
TW109103471A TWI727643B (en) 2020-02-05 2020-02-05 Artificial intelligence accelerator and operation thereof

Country Status (1)

Country Link
TW (1) TWI727643B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019173104A1 (en) * 2018-03-06 2019-09-12 DinoplusAI Holdings Limited Ai accelerator virtualization
US20190339981A1 (en) * 2017-07-30 2019-11-07 NeuroBlade, Ltd. Memory-based distributed processor architecture

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190339981A1 (en) * 2017-07-30 2019-11-07 NeuroBlade, Ltd. Memory-based distributed processor architecture
WO2019173104A1 (en) * 2018-03-06 2019-09-12 DinoplusAI Holdings Limited Ai accelerator virtualization

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Wei Gao and Pingqiang Zhou,"Customized High Performance and Energy Efficient Communication Networks for AI Chips",IEEE Access, Volume: 7,17 May 2019 *
Wei Gao and Pingqiang Zhou,"Customized High Performance and Energy Efficient Communication Networks for AI Chips",IEEE Access, Volume: 7,17 May 2019,

Also Published As

Publication number Publication date
TW202131316A (en) 2021-08-16

Similar Documents

Publication Publication Date Title
Kang Accelerator-aware pruning for convolutional neural networks
US20220027717A1 (en) Convolutional Neural Network Hardware Configuration
CN109063825B (en) Convolutional neural network accelerator
US20180247182A1 (en) Information Processing Apparatus, Image Recognition Apparatus, and Parameter Setting Method for Convolutional Neural Network
CN110543936B (en) Multi-parallel acceleration method for CNN full-connection layer operation
CN112434801B (en) Convolution operation acceleration method for carrying out weight splitting according to bit precision
GB2568081A (en) End-to-end data format selection for hardware implementation of deep neural network
CN110362293B (en) Multiplier, data processing method, chip and electronic equipment
GB2568082A (en) Hierarchical mantissa bit length selection for hardware implementation of deep neural network
WO2022041188A1 (en) Accelerator for neural network, acceleration method and device, and computer storage medium
US20230221924A1 (en) Apparatus and Method for Processing Floating-Point Numbers
Ahn et al. Deeper weight pruning without accuracy loss in deep neural networks: Signed-digit representation-based approach
CN112561049B (en) Resource allocation method and device of DNN accelerator based on memristor
TWI727643B (en) Artificial intelligence accelerator and operation thereof
US11500767B2 (en) Method and device for determining a global memory size of a global memory size for a neural network
CN111258544B (en) Multiplier, data processing method, chip and electronic equipment
EP4206993A1 (en) Configurable pooling processing unit for neural network accelerator
US11847429B2 (en) Apparatus and method for processing floating-point numbers
CN209879493U (en) Multiplier and method for generating a digital signal
CN113220626A (en) Artificial intelligence accelerator and processing method thereof
Yang et al. Value-driven synthesis for neural network ASICs
CN111290994B (en) Discrete three-dimensional processor
US20220222044A1 (en) Multiplication-and-accumulation circuits and processing-in-memory devices having the same
EP4345691A1 (en) Methods and systems for performing channel equalisation on a convolution layer in a neural network
CN111260069B (en) Data processing device, method, chip and electronic equipment