TW201928796A

TW201928796A - Integrated circuit chip device and related product having the advantages of small calculation amount and low power consumption

Info

Publication number: TW201928796A
Application number: TW107144037A
Authority: TW
Inventors: 劉少禮; 宋新開; 王秉睿; 張堯; 胡帥
Original assignee: 大陸商北京中科寒武紀科技有限公司
Priority date: 2017-12-14
Filing date: 2018-12-07
Publication date: 2019-07-16
Also published as: CN109960673B; TWI767097B; CN109960673A

Abstract

The present disclosure provides an integrated circuit chip device and a related product, the integrated circuit chip device comprising: a main processing circuit and a plurality of basic processing circuits. At least one of the main processing circuit or the plurality of basic processing circuits includes a data type operation circuit; the data type operation circuit is configured to perform conversion between floating point type data and fixed point type data; the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to other adjacent basic processing circuits, and the main processing circuit connects n basic processing circuits in the first row, n basic processing circuits in the mth row, and m basic processing circuits in the first column. The technical solution provided by the present disclosure has the advantages of small calculation amount and low power consumption.

Description

Integrated circuit chip device and related products

本披露涉及神經網絡領域，尤其涉及一種集成電路芯片裝置及相關產品。The present disclosure relates to the field of neural networks, and in particular, to an integrated circuit chip device and related products.

人工神經網絡（Artificial Neural Network， ANN ），是20世紀80 年代以來人工智能領域興起的研究熱點。它從信息處理角度對人腦神經元網絡進行抽象，建立某種簡單模型，按不同的連接方式組成不同的網絡。在工程與學術界也常直接簡稱為神經網絡或類神經網絡。神經網絡是一種運算模型，由大量的節點（或稱神經元）之間相互聯接構成。現有的神經網絡的運算基於CPU（Central Processing Unit，中央處理器）或GPU（Graphics Processing Unit，圖形處理器）來實現神經網絡的運算，此種運算的計算量大，功耗高。Artificial neural network (Artificial Neural Network, ANN) is a research hotspot that has emerged in the field of artificial intelligence since the 1980s. It abstracts the human brain neuron network from the perspective of information processing, establishes some simple model, and forms different networks according to different connection methods. In engineering and academia, it is often referred to as neural network or neural network. A neural network is a computing model that consists of a large number of nodes (or neurons) connected to each other. The existing neural network operations are based on a CPU (Central Processing Unit, Central Processing Unit) or a GPU (Graphics Processing Unit, Graphics Processor) to implement the operations of the neural network. Such operations have a large amount of calculation and high power consumption.

本披露實施例提供了一種集成電路芯片裝置及相關產品，可提升計算裝置的處理速度，提高效率。The embodiments of the present disclosure provide an integrated circuit chip device and related products, which can increase the processing speed and efficiency of a computing device.

第一方面，提供一種所述集成電路芯片裝置包括：主處理電路以及多個基礎處理電路；According to a first aspect, there is provided an integrated circuit chip device including: a main processing circuit and a plurality of basic processing circuits;

所述多個基礎處理電路呈陣列分布；每個基礎處理電路與相鄰的其他相鄰的基礎處理電路連接，所述主處理電路連接所述多個基礎處理電路中的k個基礎處理電路，所述k個基礎處理電路為：第1行的n個基礎處理電路、第m行的n個基礎處理電路以及第1列的m個基礎處理電路；The multiple basic processing circuits are distributed in an array; each basic processing circuit is connected to an adjacent other adjacent basic processing circuit, and the main processing circuit is connected to k basic processing circuits in the multiple basic processing circuits. The k basic processing circuits are: n basic processing circuits in the first row, n basic processing circuits in the m row, and m basic processing circuits in the first column;

所述多個基礎處理電路包括：數據類型轉換電路，用於執行浮點類型數據以及定點類型數據之間的轉換；The plurality of basic processing circuits include: a data type conversion circuit for performing conversion between floating point type data and fixed point type data;

所述主處理電路，用於執行神經網絡運算中的各個連續的運算以及和與所述k個基礎處理電路傳輸數據；The main processing circuit is configured to perform each continuous operation in a neural network operation and transmit data to and from the k basic processing circuits;

所述k個基礎處理電路，用於在所述主處理電路以及多個基礎處理電路之間的數據轉發；The k basic processing circuits are used for data forwarding between the main processing circuit and a plurality of basic processing circuits;

所述多個基礎處理電路，用於依據傳輸數據和運算的類型確定是否啓動所述數據類型轉換電路執行傳輸數據的數據類型的轉換，依據傳輸數據以並行方式執行神經網絡中的運算，並將運算結果通過與所述k個基礎處理電路傳輸給所述主處理電路。The plurality of basic processing circuits are configured to determine whether to enable the data type conversion circuit to perform data type conversion of the transmission data according to the transmission data and the type of operation, perform the operations in the neural network in a parallel manner according to the transmission data, and The operation result is transmitted to the main processing circuit through the k basic processing circuits.

第二方面，提供一種神經網絡運算裝置，所述神經網絡運算裝置包括一個或多個第一方面提供的集成電路芯片裝置。In a second aspect, a neural network computing device is provided. The neural network computing device includes one or more integrated circuit chip devices provided in the first aspect.

第三方面，提供一種組合處理裝置，所述組合處理裝置包括：第二方面提供的神經網絡運算裝置、通用互聯介面和通用處理裝置；According to a third aspect, a combined processing device is provided. The combined processing device includes a neural network computing device, a universal interconnection interface, and a universal processing device provided in the second aspect;

所述神經網絡運算裝置通過所述通用互聯介面與所述通用處理裝置連接。The neural network computing device is connected to the universal processing device through the universal interconnection interface.

第四方面，提供一種芯片，所述芯片集成第一方面的裝置、第二方面的裝置或第三方面的裝置。According to a fourth aspect, a chip is provided, which integrates the device of the first aspect, the device of the second aspect, or the device of the third aspect.

第五方面，提供一種電子設備，所述電子設備包括第四方面的芯片。According to a fifth aspect, an electronic device is provided, and the electronic device includes a chip according to the fourth aspect.

第六方面，提供一種神經網絡的運算方法，所述方法應用在集成電路芯片裝置內，所述集成電路芯片裝置包括：第一方面所述的集成電路芯片裝置，所述集成電路芯片裝置用於執行神經網絡的運算。According to a sixth aspect, a method for calculating a neural network is provided. The method is applied in an integrated circuit chip device. The integrated circuit chip device includes the integrated circuit chip device according to the first aspect, and the integrated circuit chip device is used for: Perform operations on neural networks.

可以看出，通過本披露實施例，提供數據轉換運算電路將數據塊的類型進行轉換後運算，節省了傳輸資源以及計算資源，所以其具有功耗低，計算量小的優點。It can be seen that, according to the embodiment of the present disclosure, a data conversion operation circuit is provided to perform a conversion operation on the type of the data block, which saves transmission resources and calculation resources, so it has the advantages of low power consumption and small calculation amount.

為了使本技術領域的人員更好地理解本披露方案，下面將結合本披露實施例中的圖式，對本披露實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本披露一部分實施例，而不是全部的實施例。基於本披露中的實施例，所屬技術領域中具有通常知識者在沒有作出創造性勞動前提下所獲得的所有其他實施例，都屬於本披露保護的範圍。In order to enable those skilled in the art to better understand the disclosure scheme, the technical solutions in the embodiments of the present disclosure will be described clearly and completely in combination with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are merely These embodiments are part of, but not all of the embodiments of this disclosure. Based on the embodiments in the present disclosure, all other embodiments obtained by persons with ordinary knowledge in the technical field without making creative labor fall into the scope of protection of the present disclosure.

在第一方面提供的裝置中，所述主處理電路，用於獲取待計算的數據塊以及運算指令，依據該運算指令對所述待計算的數據塊劃分成分發數據塊以及廣播數據塊；對所述分發數據塊進行拆分處理得到多個基本數據塊，將所述多個基本數據塊分發至與所述K個基礎處理電路，將所述廣播數據塊廣播至與所述k個基礎處理電路；In the apparatus provided by the first aspect, the main processing circuit is configured to obtain a data block to be calculated and an operation instruction, and divide the data block to be calculated into a transmission data block and a broadcast data block according to the operation instruction; Performing split processing on the distribution data block to obtain multiple basic data blocks, distributing the multiple basic data blocks to the K basic processing circuits, and broadcasting the broadcast data block to the k basic processing Circuit

所述多個基礎處理電路，用於依據接收到的基礎數據塊、廣播數據塊以及運算指令將基礎數據塊、廣播數據塊轉換成定點數據類型的基礎數據塊、廣播數據塊，對所述基本數據塊與所述廣播數據塊以定點數據類型執行內積運算得到以定點數據類型運算結果，將定點數據類型運算結果轉換成浮點數據類型運算結果通過所述k個基礎處理電路傳輸給所述主處理電路；The multiple basic processing circuits are configured to convert the basic data block and the broadcast data block into a fixed-point data type basic data block and a broadcast data block according to the received basic data block, broadcast data block, and operation instructions. The data block and the broadcast data block perform an inner product operation with a fixed-point data type to obtain a fixed-point data type operation result, and convert the fixed-point data type operation result to a floating-point data type operation result to the k basic processing circuits and transmit the result to the Mentioned main processing circuit;

所述主處理電路，用於對所述運算結果處理得到所述待計算的數據塊以及運算指令的指令結果。The main processing circuit is configured to process the operation result to obtain the data block to be calculated and an instruction result of an operation instruction.

可選的，所述主處理電路，具體用於將所述廣播數據塊通過一次廣播至所述k個基礎處理電路。Optionally, the main processing circuit is specifically configured to broadcast the broadcast data block to the k basic processing circuits through one broadcast.

在第一方面提供的裝置中，所述主處理電路，用於在如所述運算結果為內積處理的結果時，對所述運算結果累加後得到累加結果，將該累加結果排列得到所述待計算的數據塊以及運算指令的指令結果。In the device provided by the first aspect, the main processing circuit is configured to, when the operation result is a result of an inner product processing, accumulate the operation result to obtain an accumulation result, and arrange the accumulation result to obtain the result. The data block to be calculated and the instruction result of the operation instruction.

在第一方面提供的裝置中，所述主處理電路，具體用於將所述廣播數據塊分成多個部分廣播數據塊，將所述多個部分廣播數據塊通過多次廣播至所述k個基礎處理電路。In the apparatus provided by the first aspect, the main processing circuit is specifically configured to divide the broadcast data block into multiple partial broadcast data blocks, and broadcast the multiple partial broadcast data blocks to the k multiple times. Basic processing circuit.

在第一方面提供的裝置中，所述多個基礎處理電路，具體用於將所述部分廣播數據塊與所述基本數據塊轉換成定點數據類型，以定點數據類型執行一次內積處理後得到定點數據類型的內積處理結果，將所述定點數據類型的內積處理結果累加得到定點數據類型的部分運算結果，將所述定點類型的部分運算結果轉換成浮點類型運算結果通過所述k個基礎處理電路發送至所述主處理電路。In the apparatus provided by the first aspect, the plurality of basic processing circuits are specifically configured to convert the partial broadcast data block and the basic data block into a fixed-point data type, and obtain the result after performing an inner product process once with the fixed-point data type. The inner product processing result of the fixed-point data type, accumulating the inner product processing result of the fixed-point data type to obtain a partial operation result of the fixed-point data type, and converting the partial operation result of the fixed-point type into a floating-point type operation result through the k The basic processing circuits are sent to the main processing circuit.

在第一方面提供的裝置中，所述多個基礎處理電路，具體用於復用n次該部分廣播數據塊以定點數據類型執行該部分廣播數據塊與該n個基本數據塊內積運算得到定點數據類型n個部分處理結果，將定點數據類型n個部分處理結果分別累加後得到定點數據類型n個部分運算結果，啓動所述數據類型運算電路將所述定點數據類型n個部分運算結果轉換成浮點類型數據類型n個部分運算結果通過所述k個基礎處理電路發送至主處理電路，所述n為大於等於2的整數。In the apparatus provided by the first aspect, the plurality of basic processing circuits are specifically configured to multiplex the partial broadcast data block n times to obtain a fixed-point data type by performing an inner product operation between the partial broadcast data block and the n basic data blocks. Fixed-point data type n partial processing results, the fixed-point data type n partial processing results are accumulated respectively to obtain fixed-point data type n partial operation results, and the data type operation circuit is started to convert the fixed-point data type n partial operation results to conversion Floating-point data type n partial operation results are sent to the main processing circuit through the k basic processing circuits, where n is an integer greater than or equal to 2.

在第一方面提供的裝置中，所述主處理電路包括：主寄存器或主片上緩存電路；In the apparatus provided by the first aspect, the main processing circuit includes: a main register or a main on-chip cache circuit;

所述多個基礎處理電路包括：基本寄存器或基本片上緩存電路。The plurality of basic processing circuits include a basic register or a basic on-chip cache circuit.

在第一方面提供的裝置中，所述主處理電路包括：向量運算器電路、算數邏輯單元電路、累加器電路、矩陣轉置電路、直接內存存取電路、數據類型運算電路或數據重排電路中的一種或任意組合。In the device provided by the first aspect, the main processing circuit includes: a vector operator circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access circuit, a data type operation circuit, or a data rearrangement circuit One or any combination.

在第一方面提供的裝置中，所述數據為：向量、矩陣、三維數據塊、四維數據塊以及n維數據塊中一種或任意組合。In the device provided by the first aspect, the data is one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block.

在第一方面提供的裝置中，如所述運算指令為乘法指令，所述主處理電路確定乘數數據塊為廣播數據塊，被乘數數據塊為分發數據塊；In the apparatus provided by the first aspect, if the operation instruction is a multiplication instruction, the main processing circuit determines that the multiplier data block is a broadcast data block and the multiplicand data block is a distribution data block;

如所述運算指令為卷積指令，所述主處理電路確定輸入數據塊為廣播數據塊，卷積核為分發數據塊。If the operation instruction is a convolution instruction, the main processing circuit determines that the input data block is a broadcast data block and the convolution kernel is a distribution data block.

在第六方面提供的方法中，所述神經網絡的運算包括：卷積運算、矩陣乘矩陣運算、矩陣乘向量運算、偏執運算、全連接運算、GEMM運算、GEMV運算、激活運算中的一種或任意組合。In the method provided by the sixth aspect, the operation of the neural network includes one of convolution operation, matrix multiplication matrix operation, matrix multiplication vector operation, paranoid operation, fully connected operation, GEMM operation, GEMV operation, and activation operation or random combination.

參閱圖1a，圖1a為本披露提供的一種集成電路芯片裝置，該集成電路芯片裝置包括：主處理電路和多個基礎處理電路，所述多個基礎處理電路呈陣列排布（m*n陣列），其中，m、n的取值範圍為大於等於1的整數且m、n中至少有一個值大於等於2。對於m*n陣列分布的多個基礎處理電路，每個基礎處理電路與相鄰的基礎處理電路連接，所述主處理電路連接多個基礎處理電路的k個基礎處理電路，所述k個基礎處理電路可以為：第1行的n個基礎處理電路、第m行的n個基礎處理電路以及第1列的m個基礎處理電路。如圖1a所示的集成電路芯片裝置，主處理電路和/或多個基礎處理電路可以包括數據類型轉換運算電路，具體的多個基礎處理電路中可以有部分基礎處理電路包括數據類型轉換電路，例如，在一個可選的技術方案中，可以將k個基礎處理電路配置數據類型轉換電路，這樣n個基礎處理電路可以分別負責對本列的m個基礎處理電路的數據進行數據類型轉換步驟。此設置能夠提高運算效率，降低功耗，因為對於第1行的n個基礎處理電路來說，由於其最先接收到主處理電路發送的數據，那麼將該接收到的數據轉換成定點類型的數據可以減少後續基礎處理電路的計算量以及與後續基礎處理電路的數據傳輸的量，同理，對於第一列的m個基礎處理電路配置數據類型轉換電路也具有計算量小和功耗低的優點。另外，依據該結構，主處理電路可以採用動態的數據發送策略，例如，主處理電路向第1列的m個基礎處理電路廣播數據，主處理電路向第1行的n個基礎處理電路發送分發數據，此優點是，通過不同的數據輸入口傳遞不同的數據到基礎處理電路內，這樣基礎處理電路可以不區分該接收到的數據是何種數據，只需要確定該數據從哪個接收端口接收即可以獲知其屬於何種數據。Referring to FIG. 1a, FIG. 1a is an integrated circuit chip device provided by the present disclosure. The integrated circuit chip device includes a main processing circuit and a plurality of basic processing circuits. The plurality of basic processing circuits are arranged in an array (m * n array). ), Where the range of m and n is an integer greater than or equal to 1 and at least one of m and n is greater than or equal to 2. For a plurality of basic processing circuits distributed in an m * n array, each basic processing circuit is connected to an adjacent basic processing circuit, and the main processing circuit is connected to k basic processing circuits of the multiple basic processing circuits, and the k basic processing circuits are connected to each other. The processing circuits may be: n basic processing circuits in the first row, n basic processing circuits in the m row, and m basic processing circuits in the first column. As shown in the integrated circuit chip device shown in FIG. 1a, the main processing circuit and / or the plurality of basic processing circuits may include a data type conversion operation circuit, and some of the specific multiple basic processing circuits may include a data type conversion circuit. For example, in an optional technical solution, k basic processing circuits may be configured with data type conversion circuits, so that n basic processing circuits may be respectively responsible for performing data type conversion steps on the data of the m basic processing circuits in this column. This setting can improve the operation efficiency and reduce the power consumption, because for the n basic processing circuits in the first row, since it first receives the data sent by the main processing circuit, the received data is converted into fixed-point type. Data can reduce the amount of calculation of subsequent basic processing circuits and the amount of data transmission with subsequent basic processing circuits. Similarly, the configuration of data type conversion circuits for the m basic processing circuits in the first column also has a small amount of calculation and low power consumption. advantage. In addition, according to this structure, the main processing circuit can adopt a dynamic data transmission strategy. For example, the main processing circuit broadcasts data to the m basic processing circuits in the first column, and the main processing circuit sends and distributes the data to the n basic processing circuits in the first row. Data, this advantage is that different data is transmitted to the basic processing circuit through different data input ports, so that the basic processing circuit does not distinguish what kind of data the received data is, and only needs to determine from which receiving port the data is received. You can know what kind of data it belongs to.

所述主處理電路，用於執行神經網絡運算中的各個連續的運算以及和與其相連的所述基礎處理電路傳輸數據；上述連續的運算但不限於：累加運算、ALU運算、激活運算等等運算。The main processing circuit is configured to perform each continuous operation in a neural network operation and transmit data to the basic processing circuit connected thereto; the above-mentioned continuous operation is not limited to: an accumulation operation, an ALU operation, an activation operation, and the like .

所述多個基礎處理電路，用於依據傳輸的數據以並行方式執行神經網絡中的運算，並將運算結果通過與所述主處理電路連接的基礎處理電路傳輸給所述主處理電路。上述並行方式執行神經網絡中的運算包括但不限於：內積運算、矩陣或向量乘法運算等等。The plurality of basic processing circuits are configured to perform operations in the neural network in a parallel manner according to the transmitted data, and transmit the operation results to the main processing circuit through the basic processing circuit connected to the main processing circuit. The above-mentioned parallel manner for performing operations in the neural network includes, but is not limited to, inner product operations, matrix or vector multiplication operations, and the like.

主處理電路可以包括：數據發送電路、數據接收電路或介面，該數據發送電路可以集成數據分發電路以及數據廣播電路，當然在實際應用中，數據分發電路以及數據廣播電路也可以分別設置。對於廣播數據，即需要發送給每個基礎處理電路的數據。對於分發數據，即需要有選擇的發送給部分基礎處理電路的數據，具體的，如卷積運算，卷積運算的卷積輸入數據需要發送給所有的基礎處理電路，所有其為廣播數據，卷積核需要有選擇的發送給部分基礎數據塊，所以卷積核為分發數據。分發數據具體的選擇發送給那個基礎處理電路的方式可以由主處理電路依據負載以及其他分配方式進行具體的確定。對於廣播發送方式，即將廣播數據以廣播形式發送至每個基礎處理電路。（在實際應用中，通過一次廣播的方式將廣播數據發送至每個基礎處理電路，也可以通過多次廣播的方式將廣播數據發送至每個基礎處理電路，本披露具體實施方式並不限制上述廣播的次數），對於分發發送方式，即將分發數據有選擇的發送給部分基礎處理電路。The main processing circuit may include: a data sending circuit, a data receiving circuit, or an interface. The data sending circuit may integrate a data distribution circuit and a data broadcasting circuit. Of course, in practical applications, the data distribution circuit and the data broadcasting circuit may be separately provided. For broadcast data, that is, data that needs to be sent to each basic processing circuit. For distribution data, that is, data that needs to be selectively sent to some basic processing circuits, specifically, such as convolution operation, convolution input data of convolution operation needs to be sent to all basic processing circuits, all of which are broadcast data, volume The convolution kernel needs to be selectively sent to some basic data blocks, so the convolution kernel is to distribute data. The specific selection of the distribution data to be sent to that basic processing circuit can be determined by the main processing circuit according to the load and other distribution methods. For the broadcast transmission method, broadcast data is transmitted to each basic processing circuit in a broadcast form. (In practical applications, broadcast data is sent to each basic processing circuit by one broadcast, and broadcast data can also be sent to each basic processing circuit by multiple broadcasts. The specific implementation of this disclosure does not limit the above. The number of broadcasts). For the distribution sending method, the distribution data is selectively sent to some basic processing circuits.

主處理電路（如圖1d所示）可以包括寄存器和/或片上緩存電路，該主處理電路還可以包括:控制電路、向量運算器電路、ALU（arithmetic and logic unit，算數邏輯單元）電路、累加器電路、DMA(Direct Memory Access，直接內存存取)電路等電路，當然在實際應用中，上述主處理電路還可以添加，轉換電路（例如矩陣轉置電路）、數據重排電路或激活電路等等其他的電路。The main processing circuit (as shown in FIG. 1d) may include a register and / or an on-chip cache circuit. The main processing circuit may further include: a control circuit, a vector operator circuit, an arithmetic and logic unit (ALU) circuit, and an accumulation circuit. Circuit, DMA (Direct Memory Access) circuit, etc. Of course, in practical applications, the above main processing circuit can also be added, conversion circuit (such as matrix transposition circuit), data rearrangement circuit or activation circuit, etc. And other circuits.

每個基礎處理電路可以包括基礎寄存器和/或基礎片上緩存電路；每個基礎處理電路還可以包括：內積運算器電路、向量運算器電路、累加器電路等中一個或任意組合。上述內積運算器電路、向量運算器電路、累加器電路都可以是集成電路，上述內積運算器電路、向量運算器電路、累加器電路也可以為單獨設置的電路。Each basic processing circuit may include a basic register and / or a basic on-chip cache circuit; each basic processing circuit may further include one or any combination of an inner product operator circuit, a vector operator circuit, an accumulator circuit, and the like. The inner product operator circuit, the vector operator circuit, and the accumulator circuit may all be integrated circuits, and the inner product operator circuit, the vector operator circuit, and the accumulator circuit may also be separately provided circuits.

可選的，對於第m行n個基礎處理電路的累加器電路可以執行內積運算的累加運算，因為對於第m行基礎處理電路來說，其能夠接收到本列所有的基礎處理電路的乘積結果，而將內積運算的累加運算通過第m行的n個基礎處理電路執行內積運算的累加運算，這樣能夠對計算資源進行有效的分配，具有節省功耗的優點。此技術方案尤其對於m數量較大時更為適用。Optionally, the accumulator circuit for the nth basic processing circuit in the mth row can perform the accumulation operation of the inner product operation, because for the mth basic processing circuit, it can receive the product of all the basic processing circuits in this column. As a result, the accumulation operation of the inner product operation is performed by the n basic processing circuits in the m-th row, so that the calculation resources can be effectively allocated, which has the advantage of saving power consumption. This technical solution is especially applicable when the number of m is large.

對於數據類型轉換可以由主處理電路來分配執行的電路，具體的，可以通過顯示或隱式的方式來分配執行的電路，對於顯示方式，主處理電路可以配置一個特殊指示或指令，當基礎處理電路接收到該特殊指示或指令時，確定執行數據類型轉換，如基礎處理電路未接收到特殊指示或指令時，確定不執行數據類型轉換。又如，可以以暗示的方式來執行，例如，基礎處理電路接收到數據類型為浮點類型的數據且確定需要執行內積運算時，將該數據類型轉換成定點類型的數據。對於顯示配置的方式，特殊指令或指示可以配置一個遞減序列，該遞減序列的每經過一個基礎處理電路，數值減1，基礎處理電路讀取該遞減序列的值，如該值大於零，則執行數據類型轉換，如該值等於或小於零，則不執行數據類型轉換。此設置是依據陣列分配的基礎處理電路所配置的，例如對於第i列的m個基礎處理電路來說，主處理電路需要前面5個基礎處理電路執行數據類型轉換，則主處理電路下發一個特殊指令，該特殊指令包含有遞減序列，該遞減序列的初始值可以為5，則每經過一個基礎處理電路，遞減序列的值即減1，到第5個基礎處理電路時，該遞減序列的值為1，到第6個基礎處理電路時，該遞減序列為0，此時第6個基礎處理電路將不在執行該數據類型轉換，此種方式可以使得主處理電路可以動態的配置數據類型轉換的執行主體以及執行次數。For data type conversion, the main processing circuit can allocate and execute the circuit. Specifically, the display or implicit mode can be used to allocate and execute the circuit. For the display mode, the main processing circuit can be configured with a special instruction or instruction. When the circuit receives the special instruction or instruction, it determines to perform data type conversion. For example, when the basic processing circuit does not receive the special instruction or instruction, it determines to not perform data type conversion. As another example, it may be performed in an implied manner. For example, when the basic processing circuit receives data of a floating point type and determines that an inner product operation needs to be performed, the data type is converted into fixed-point data. For the display configuration mode, a special instruction or instruction can be configured with a decreasing sequence. The value of the decreasing sequence decreases by 1 each time the decreasing sequence passes through the basic processing circuit, and the basic processing circuit reads the value of the decreasing sequence. Data type conversion. If the value is equal to or less than zero, no data type conversion is performed. This setting is configured according to the basic processing circuit allocated by the array. For example, for the m basic processing circuits in the ith column, the main processing circuit needs the first five basic processing circuits to perform data type conversion, and the main processing circuit issues a A special instruction that contains a decreasing sequence. The initial value of the decreasing sequence can be 5, and the value of the decreasing sequence is reduced by 1 after each basic processing circuit. When the fifth basic processing circuit is reached, the The value is 1. When the sixth basic processing circuit is reached, the decrementing sequence is 0. At this time, the sixth basic processing circuit will not perform the data type conversion. In this way, the main processing circuit can dynamically configure the data type conversion. And the number of executions.

本披露一個實施例提供一種集成電路芯片裝置，包括一個主處理電路（也可以稱為主單元）和多個基礎處理電路(也可以稱為基礎單元)；實施例的結構如圖1b所示；其中，虛線框中是所述神經網絡運算裝置的內部結構；灰色填充的箭頭表示主處理電路和基礎處理電路陣列之間的數據傳輸通路，空心箭頭表示基礎處理電路陣列中各個基礎處理電路（相鄰基礎處理電路）之間的數據傳輸通路。其中，基礎處理電路陣列的長寬長度可以不同，即m、n的取值可以不同，當然也可以相同，本披露並不限制上述取值的具體值。An embodiment of the present disclosure provides an integrated circuit chip device, including a main processing circuit (also referred to as a main unit) and a plurality of basic processing circuits (also referred to as a basic unit); the structure of the embodiment is shown in FIG. 1b; The dashed box is the internal structure of the neural network computing device; the gray-filled arrows indicate the data transmission path between the main processing circuit and the basic processing circuit array, and the hollow arrows indicate the basic processing circuits (phases) in the basic processing circuit array. (Neighboring basic processing circuits). The length, width, and length of the basic processing circuit array may be different, that is, the values of m and n may be different, and of course, they may be the same. The present disclosure does not limit the specific values of the foregoing values.

基礎處理電路的電路結構如圖1c所示；圖中虛線框表示基礎處理電路的邊界，與虛線框交叉的粗箭頭表示數據輸入輸出通道（指向虛線框內是輸入通道，指出虛線框是輸出通道）；虛線框中的矩形框表示存儲單元電路（寄存器和/或片上緩存），包括輸入數據1，輸入數據2，乘法或內積結果，累加數據；菱形框表示運算器電路，包括乘法或內積運算器，加法器。The circuit structure of the basic processing circuit is shown in Figure 1c; the dashed box in the figure indicates the boundary of the basic processing circuit, and the thick arrows crossing the dashed box indicate the data input and output channels (pointing to the dashed box is the input channel, and the dotted box is the output channel ); The rectangular box in the dashed box indicates the storage unit circuit (register and / or on-chip cache), including input data 1, input data 2, multiplication or inner product results, and accumulated data; diamond box indicates the arithmetic circuit, including multiplication or internal Product operator, adder.

本實施例中，所述神經網絡運算裝置包括一個主處理電路和16個基礎處理電路（16個基礎處理電路僅僅為了舉例說明，在實際應用中，可以採用其他的數值）；In this embodiment, the neural network computing device includes a main processing circuit and 16 basic processing circuits (the 16 basic processing circuits are for illustration only, and other values may be used in practical applications);

本實施例中，基礎處理電路有兩個數據輸入介面，兩個數據輸出介面；在本例的後續描述中，將橫向的輸入介面（圖1b中指向本單元的橫向箭頭）稱作輸入0，竪向的輸入介面（圖1b中指向本單元的竪向箭頭）稱作輸入1；將每一個橫向的數據輸出介面（圖1b中從本單元指出的橫向箭頭）稱作輸出0，竪向的數據輸出介面（圖1b中從本單元指出的竪向箭頭）稱作輸出1。In this embodiment, the basic processing circuit has two data input interfaces and two data output interfaces; in the subsequent description of this example, the horizontal input interface (the horizontal arrow pointing to this unit in Figure 1b) is called input 0, The vertical input interface (the vertical arrow pointing to this unit in Figure 1b) is called input 1; each horizontal data output interface (the horizontal arrow pointed out from this unit in Figure 1b) is called output 0, and the vertical The data output interface (the vertical arrow pointed out from this unit in Figure 1b) is called output 1.

每一個基礎處理電路的數據輸入介面和數據輸出介面可以分別連接不同的單元，包括主處理電路與其他基礎處理電路；The data input interface and data output interface of each basic processing circuit can be connected to different units, including the main processing circuit and other basic processing circuits;

本例中，基礎處理電路0,4,8,12（編號見圖1b）這四個基礎處理電路的輸入0與主處理電路的數據輸出介面連接；In this example, the input 0 of the four basic processing circuits 0, 4, 8, 12 (numbered as shown in Figure 1b) is connected to the data output interface of the main processing circuit;

本例中，基礎處理電路0,1,2,3這四個基礎處理電路的輸入1與主處理電路的數據輸出介面連接；In this example, the input 1 of the four basic processing circuits 0,1,2,3 is connected to the data output interface of the main processing circuit;

本例中，基礎處理電路12,13,14,15這四個基礎處理電路的輸出1與主處理電路的數據輸入介面相連；In this example, the outputs 1 of the four basic processing circuits 12, 13, 14, 15 are connected to the data input interface of the main processing circuit;

本例中，基礎處理電路輸出介面與其他基礎處理電路輸入介面相連接的情況見圖1b所示，不再一一列舉；In this example, the connection between the output interface of the basic processing circuit and the input interface of other basic processing circuits is shown in Figure 1b, which is no longer listed one by one;

具體地，S單元的輸出介面S1與P單元的輸入介面P1相連接，表示P單元將可以從其P1介面接收到S單元發送到其S1介面的數據。Specifically, the output interface S1 of the S unit is connected to the input interface P1 of the P unit, which means that the P unit can receive data from its P1 interface and the data sent by the S unit to its S1 interface.

本實施例包含一個主處理電路，主處理電路與外部裝置相連接（即由輸入介面也有輸出介面），主處理電路的一部分數據輸出介面與一部分基礎處理電路的數據輸入介面相連接；主處理電路的一部分數據輸入介面與一部分基礎處理電路的數據輸出介面相連。This embodiment includes a main processing circuit, which is connected to an external device (that is, an input interface also has an output interface), a part of the data output interface of the main processing circuit is connected to a part of the data input interface of the basic processing circuit; the main processing circuit A part of the data input interface is connected to a part of the data output interface of the basic processing circuit.

集成電路芯片裝置的使用方法Method for using integrated circuit chip device

本披露提供的使用方法中所涉及到的數據可以是任意數據類型的數據，例如，可以是任意位寬的浮點數表示的數據也可以是任意位寬的定點數表示的數據。The data involved in the method of use provided by this disclosure may be data of any data type, for example, data represented by floating point numbers of arbitrary bit widths or data represented by fixed point numbers of arbitrary bit widths.

該定點類型數據的一種結構示意圖如圖1e所示，如圖1e所示，為一種定點類型數據的表達方法，對於計算系統，1個浮點數據的存儲位數為32bit，而對於定點數據，尤其是採用如圖1e所示的浮點類型的數據進行數據的表示，其1個定點數據的存儲位數可以做到16Bit以下，所以對於此轉換來說，可以極大的減少計算器之間的傳輸開銷，另外，對於計算器來說，較少比特位的數據存儲的空間也較小，即存儲開銷會較小，計算量也會減少，即計算開銷會減少，所以能夠減少計算開銷以及存儲的開銷，但是對於數據類型的轉換也是需要有部分的開銷的，下面簡稱轉換開銷，對於計算量大，數據存儲量大的數據，轉換開銷相對於後續的計算開銷、存儲開銷以及傳輸開銷來說幾乎可以忽略不計，所以對於計算量大，數據存儲量大的數據，本披露採用了將數據類型轉換成定點類型的數據的技術方案，反之，對於計算量小，數據存儲量小的數據，此時由於本身計算開銷、存儲開銷以及傳輸開銷就比較小，此時如果使用定點數據，由於定點數據的精度會略低於浮點數據，在計算量較小的前提下，需要保證計算的精度，所以這裡將定點類型的數據轉換成浮點數據，即通過增加較小的開銷來提高計算的精度的目的。A structure diagram of the fixed-point type data is shown in FIG. 1e. As shown in FIG. 1e, it is a method of expressing fixed-point type data. For a computing system, the storage digits of a floating-point data are 32 bits, and for fixed-point data. In particular, the floating-point type data shown in Figure 1e is used to represent the data. The storage digits of one fixed-point data can be less than 16Bit, so for this conversion, the calculator can be greatly reduced. In addition, for the calculator, the storage space for data with fewer bits is also smaller, that is, the storage overhead will be smaller, and the amount of calculation will be reduced, that is, the calculation overhead will be reduced, so it can reduce the calculation overhead and Storage overhead, but also requires some overhead for data type conversion. The following is referred to as conversion overhead. For data with a large amount of calculation and large amount of data storage, the conversion overhead is relative to subsequent calculation overhead, storage overhead, and transmission overhead. It is almost negligible, so for the data with a large amount of calculation and a large amount of data storage, this disclosure uses data type conversion The technical solution of fixed-point data. Conversely, for data with a small amount of calculation and small data storage, the calculation overhead, storage overhead, and transmission overhead are relatively small at this time. If fixed-point data is used at this time, due to the accuracy of fixed-point data It will be slightly lower than floating-point data. Under the premise of a small amount of calculation, the accuracy of the calculation needs to be guaranteed, so the fixed-point data is converted to floating-point data here, that is, the accuracy of the calculation is increased by adding a small overhead. the goal of.

需要在基礎處理電路中完成的運算，可以使用下述方法進行：The operations that need to be performed in the basic processing circuit can be performed using the following methods:

主處理電路先對數據的類型進行轉換然後再傳輸給基礎處理電路運算（例如，主處理電路可以將浮點數轉換成位寬更低的定點數再傳輸給基礎處理電路，其優點是可以減少傳輸數據的位寬，減少傳輸的總比特數量，基礎處理電路執行地位寬定點運算的效率也更高，功耗更低）The main processing circuit first converts the type of data and then transmits it to the basic processing circuit for operation (for example, the main processing circuit can convert floating-point numbers to fixed-point numbers with lower bit widths and then transmit them to the basic processing circuit. The advantage is that it can reduce The bit width of the transmitted data reduces the total number of bits transmitted, and the basic processing circuit is more efficient in performing wide-status fixed-point operations and consumes less power.

基礎處理電路可以收到數據後先進行數據類型轉化然後再進行計算（例如，基礎處理電路收到主處理電路傳輸過來的浮點數，然後轉換為定點數進行運算，提高運算效率，降低功耗）。The basic processing circuit can perform data type conversion and then calculate after receiving the data (for example, the basic processing circuit receives the floating-point number transmitted by the main processing circuit, and then converts it to a fixed-point number for operation, improving the operation efficiency and reducing power consumption. ).

基礎處理電路計算出結果之後可以先進行數據類型轉換然後再傳輸給主處理電路（例如，基礎處理電路計算出的浮點數運算結果可以先轉換為低位寬的定點數然後再傳輸給主處理電路，其好處是降低了傳輸過程的數據位寬，效率更高，而且節約了功耗）。After the basic processing circuit calculates the result, it can be converted to a data type and then transmitted to the main processing circuit. , Which has the advantage of reducing the data bit width of the transmission process, higher efficiency, and saving power consumption).

基礎處理電路的使用方法（如圖2a）；How to use the basic processing circuit (see Figure 2a);

主處理電路從裝置外部接收待計算的輸入數據；The main processing circuit receives input data to be calculated from outside the device;

可選地，主處理電路利用本單元的各種運算電路，向量運算電路，內積運算器電路、累加器電路等對數據進行運算處理；Optionally, the main processing circuit uses various arithmetic circuits, vector arithmetic circuits, inner product arithmetic circuits, and accumulator circuits of the unit to perform arithmetic processing on the data;

主處理電路通過數據輸出介面向基礎處理電路陣列（把所有基礎處理電路的集合稱作基礎處理電路陣列）發送數據(如圖2b所示)；The main processing circuit sends data to the basic processing circuit array (referring to the collection of all basic processing circuits as the basic processing circuit array) through the data output interface (as shown in Figure 2b);

此處的發送數據的方式可以是向一部分基礎處理電路直接發送數據，即多次廣播方式；The method for sending data here may be sending data directly to a part of the basic processing circuit, that is, multiple broadcast mode;

此處發送數據的方式可以向不同的基礎處理電路分別發送不同的數據，即分發方式；The way of sending data here can send different data to different basic processing circuits, that is, the distribution mode;

基礎處理電路陣列對數據進行計算；The basic processing circuit array calculates the data;

基礎處理電路接收到輸入數據後進行運算；The basic processing circuit performs operations after receiving the input data;

可選地，基礎處理電路接收到數據後將該數據從本單元的數據輸出介面傳輸出去；（傳輸給其他沒有直接從主處理電路接收到數據的基礎處理電路。）Optionally, after receiving the data, the basic processing circuit transmits the data from the data output interface of the unit; (transmit to other basic processing circuits that do not receive data directly from the main processing circuit.)

可選地，基礎處理電路將運算結果從數據輸出介面傳輸出去；（中間計算結果或者最終計算結果）Optionally, the basic processing circuit transmits the operation result from the data output interface; (intermediate calculation result or final calculation result)

主處理電路接收到從基礎處理電路陣列返回的輸出數據；The main processing circuit receives the output data returned from the basic processing circuit array;

可選地，主處理電路對從基礎處理電路陣列接收到的數據繼續進行處理（例如累加或激活操作）；Optionally, the main processing circuit continues to process the data received from the basic processing circuit array (such as an accumulation or activation operation);

主處理電路處理完畢，將處理結果從數據輸出介面傳輸給裝置外部。The main processing circuit finishes processing and transmits the processing result from the data output interface to the outside of the device.

使用所述電路裝置完成矩陣乘向量運算；Using the circuit device to complete a matrix multiplication vector operation;

（矩陣乘向量可以是矩陣中的每一行分別與向量進行內積運算，並將這些結果按對應行的順序擺放成一個向量。）(The matrix multiplication vector can be the inner product of each row in the matrix and the vector, and the results are placed into a vector in the order of the corresponding rows.)

下面描述計算尺寸是M行L列的矩陣S和長度是L的向量P的乘法的運算，如圖2c所示。The following describes the calculation of a multiplication of a matrix S of size M rows and L columns and a vector P of length L, as shown in FIG. 2c.

此方法用到所述神經網絡計算裝置的全部或者一部分基礎處理電路，假設用到了K個基礎處理電路；This method uses all or part of the basic processing circuits of the neural network computing device, assuming that K basic processing circuits are used;

主處理電路將矩陣S的部分或全部行中的數據發送到k個基礎處理電路中的每個基礎處理電路；The main processing circuit sends data in part or all of the rows of the matrix S to each of the k basic processing circuits;

在一種可選的方案中，主處理電路的控制電路將矩陣S中某行的數據每次發送一個數或者一部分數給某個基礎處理電路；（例如，對於每次發送一個數，可以為對於某一個基礎處理電路，第1次發送第3行第1個數，第2次發送第3行數據中的第2個數，第3次發送第3行的第3個數……，或者對於每次發送一部分數，第1次發送第3行前兩個數（即第1、2個數），第二次發送第3行第3和第4個數，第三次發送第3行第5和第6個數……；）In an optional solution, the control circuit of the main processing circuit sends data of a row in the matrix S one number or a part of the number to a basic processing circuit at a time; (for example, for sending a number each time, it can be A certain basic processing circuit sends the first number of the third row of data for the first time, the second number of the third row of data for the second time, the third number of the third row for the third time ..., or Each time a part of the number is sent, the first two numbers of the third line (ie, the first and second numbers) are sent for the first time, the third and third numbers of the third line are sent for the second time, and the third and third lines are sent. 5 and 6th numbers ...;)

在一種可選的方案中，主處理電路的控制電路將矩陣S中某幾行的數據每次各發送一個數者一部分數給某個基礎處理電路；（例如，對於某一個基礎處理電路，第1次發送第3,4,5行每行的第1個數，第2次發送第3,4,5行每行的第2個數，第3次發送第3,4,5行每行的第3個數……，或者第1次發送第3,4,5行每行前兩個數，第二次發送第3,4,5行每行第3和第4個數，第三次發送第3,4,5行每行第5和第6個數……。）In an optional solution, the control circuit of the main processing circuit sends data of some rows in the matrix S to each basic processing circuit at a time; (for example, for a basic processing circuit, the first Send the 3rd, 4th, and 5th lines of the first number each time, send the 2nd number of the 3rd, 4th, and 5th lines each time, and send the 3rd, 4th, and 5th lines each time The third number of ........., or the first time to send the first two numbers of each line 3,4,5, the second time to send the third and fourth numbers of each line 3,4,5, the third Send the 3rd, 4th, and 5th lines 5th and 6th numbers of each line ...)

主處理電路的控制電路將向量P中的數據逐次發送到第0個基礎處理電路；The control circuit of the main processing circuit sequentially sends the data in the vector P to the 0th basic processing circuit;

第0個基礎處理電路接收到向量P的數據之後，將該數據發送給與其相連接的下一個基礎處理電路，即基礎處理電路1；After the 0th basic processing circuit receives the data of the vector P, it sends the data to the next basic processing circuit connected to it, namely the basic processing circuit 1.

具體的，有些基礎處理電路不能直接從主處理電路處獲得計算所需的所有的數據，例如，圖2d中的基礎處理電路1，只有一個數據輸入介面與主處理電路相連，所以只能直接從主處理電路獲得矩陣S的數據，而向量P的數據就需要依靠基礎處理電路0輸出給基礎處理電路1，同理，基礎處理電路1也要收到數據後也要繼續把向量P的數據輸出給基礎處理電路2。Specifically, some basic processing circuits cannot directly obtain all the data required for calculation from the main processing circuit. For example, the basic processing circuit 1 in FIG. 2d has only one data input interface connected to the main processing circuit, so it can only be obtained directly from the main processing circuit. The main processing circuit obtains the data of the matrix S, and the data of the vector P needs to be output to the basic processing circuit 1 by the basic processing circuit 0. Similarly, the basic processing circuit 1 also continues to output the data of the vector P after receiving the data. To the base processing circuit 2.

每一個基礎處理電路對接收到的數據進行運算，該運算包括但不限於：內積運算、乘法運算、加法運算等等；Each basic processing circuit performs operations on the received data, including, but not limited to, inner product operations, multiplication operations, addition operations, and the like;

在一種可選方案中，基礎處理電路每次計算一組或多組兩個數據的乘法，然後將結果累加到寄存器和或片上緩存上；In an optional solution, the basic processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the results in a register and / or an on-chip buffer;

在一種可選方案中，基礎處理電路每次計算一組或多組兩個向量的內積，然後將結果累加到寄存器和或片上緩存上；In an optional solution, the basic processing circuit calculates an inner product of one or more groups of two vectors at a time, and then accumulates the results in a register and / or an on-chip buffer;

基礎處理電路計算出結果後，將結果從數據輸出介面傳輸出去（即傳輸給與其連接的其他基礎處理電路）；After the basic processing circuit calculates the result, the result is transmitted from the data output interface (that is, to other basic processing circuits connected to it);

在一種可選方案中，該計算結果可以是內積運算的最終結果或中間結果；In an optional solution, the calculation result may be a final result or an intermediate result of the inner product operation;

基礎處理電路接收到來自其他基礎處理電路的計算結果之後，將該數據傳輸給與其相連接的其他基礎處理電路或者主處理電路；After receiving the calculation results from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected to the data;

主處理電路接收到各個基礎處理電路內積運算的結果，將該結果處理得到最終結果（該處理可以為累加運算或激活運算等等）。The main processing circuit receives the result of the inner product operation of each basic processing circuit, and processes the result to obtain the final result (the processing may be an accumulation operation or an activation operation, etc.).

採用上述計算裝置實現矩陣乘向量方法的實施例：An embodiment of a method for implementing a matrix multiplication vector by using the above calculation device:

在一種可選方案中，方法所用到的多個基礎處理電路按照如圖2d或者圖2e所示的方式排列；In an optional solution, a plurality of basic processing circuits used in the method are arranged in a manner as shown in FIG. 2d or FIG. 2e;

如圖2c所示，主處理電路的數據轉換運算電路將矩陣S和矩陣P轉換成定點類型的數據；主處理單元的控制電路將矩陣S的M行數據分成K組，分別由第i個基礎處理電路負責第i組（該組數據中行的集合記為Ai）的運算；As shown in Figure 2c, the data conversion operation circuit of the main processing circuit converts the matrix S and the matrix P into fixed-point type data; the control circuit of the main processing unit divides the M rows of data of the matrix S into K groups, which are respectively formed by the i-th base The processing circuit is responsible for the operation of the i-th group (the set of rows in the data is denoted as Ai);

此處對M行數據進行分組的方法是任意不會重復分配的分組方式；The method for grouping M rows of data here is any grouping method that will not be repeatedly allocated;

在一種可選方案中，採用如下分配方式：將第j行分給第j%K（%為取余數運算）個基礎處理電路；In an optional solution, the following allocation method is adopted: the jth row is allocated to the j% K (% is the remainder operation) basic processing circuits;

在一種可選方案中，對於不能平均分組的情況也可以先對一部分行平均分配，對於剩下的行以任意方式分配。In an alternative, for a case where the grouping cannot be evenly divided, a part of the rows may be evenly distributed first, and the remaining rows may be distributed in an arbitrary manner.

主處理電路的控制電路每次將矩陣S中部分或全部行中的數據依次發送給對應的基礎處理電路；The control circuit of the main processing circuit sends the data in some or all of the rows of the matrix S to the corresponding basic processing circuit in turn each time;

在一種可選方案中，主處理電路的控制電路每次向第i個基礎處理電路發送其負責的第i組數據Mi中的一行數據中的一個或多個數據；In an optional solution, the control circuit of the main processing circuit sends one or more data in a row of data of the i-th group of data Mi to the i-th basic processing circuit each time;

在一種可選方案中，主處理電路的控制電路每次向第i個基礎處理電路發送其負責的第i組數據Mi中的部分或全部行中的每行的一個或多個數據；In an optional solution, the control circuit of the main processing circuit sends one or more data of each of some or all of the i-th group of data Mi to the i-th basic processing circuit each time;

主處理電路的控制電路將向量P中的數據依次向第1個基礎處理電路發送；The control circuit of the main processing circuit sends the data in the vector P to the first basic processing circuit in sequence;

在一種可選方案中，主處理電路的的控制電路每次可以發送向量P中的一個或多個數據；In an optional solution, the control circuit of the main processing circuit may send one or more data in the vector P at a time;

第i個基礎處理電路接收到向量P的數據之後發送給與其相連的第i+1個基礎處理電路；The i-th basic processing circuit receives the data of the vector P and sends it to the i + 1-th basic processing circuit connected to it;

每個基礎處理電路接收到來自矩陣S中某一行或者某幾行中的一個或多個數據以及來自向量P的一個或多個數據後，進行運算（包括但不限於乘法或加法）；After each basic processing circuit receives one or more data from a row or several rows in the matrix S and one or more data from the vector P, it performs operations (including but not limited to multiplication or addition);

在一種可選方案中，基礎處理電路接收到的數據也可以是中間結果，保存在寄存器和或片上緩存上；In an optional solution, the data received by the basic processing circuit may also be intermediate results, which are stored in registers and / or on-chip buffers;

基礎處理電路將本地的計算結果傳輸給與其相連接的下一個基礎處理電路或者主處理電路；The basic processing circuit transmits the local calculation result to the next basic processing circuit or main processing circuit connected to it;

在一種可選方案中，對應於圖2d的結構，只有每列的最後一個基礎處理電路的輸出介面與主處理電路相連接的，這種情況下，只有最後一個基礎處理電路可以直接將本地的計算結果傳輸給主處理電路，其他基礎處理電路的計算結果都要傳遞給自己的下一個基礎處理電路，下一個基礎處理電路傳遞給下下個基礎處理電路直至全部傳輸給最後一個基礎處理電路，最後一個基礎處理電路將本地的計算結果以及接收到的本列的其他基礎處理電路的結果執行累加計算得到中間結果，將中間結果發送至主處理電路；當然還可以為最後一個基礎處理電路可以將本列的其他基礎處理電路的結果以及本地的處理結果直接發送給主處理電路。In an optional solution, corresponding to the structure of FIG. 2d, only the output interface of the last basic processing circuit of each column is connected to the main processing circuit. In this case, only the last basic processing circuit can directly connect the local The calculation results are transmitted to the main processing circuit. The calculation results of other basic processing circuits must be passed to their next basic processing circuit, and the next basic processing circuit is passed to the next basic processing circuit until all of them are transmitted to the last basic processing circuit. The last basic processing circuit performs an accumulation calculation on the local calculation result and the results of other basic processing circuits received in this column to obtain an intermediate result, and sends the intermediate result to the main processing circuit. Of course, the last basic processing circuit can also send The results of the other basic processing circuits in this column and the local processing results are sent directly to the main processing circuit.

在一種可選方案中，對應於圖2e的結構，每一個基礎處理電路都有與主處理電路相連接的輸出介面，這種情況下，每一個基礎處理電路都直接將本地的計算結果傳輸給主處理電路；In an optional solution, corresponding to the structure of FIG. 2e, each basic processing circuit has an output interface connected to the main processing circuit. In this case, each basic processing circuit directly transmits the local calculation result to Main processing circuit

基礎處理電路接收到其他基礎處理電路傳遞過來的計算結果之後，傳輸給與其相連接的下一個基礎處理電路或者主處理電路。After the basic processing circuit receives the calculation results transmitted from other basic processing circuits, it transmits it to the next basic processing circuit or main processing circuit connected to it.

主處理電路接收到M個內積運算的結果，作為矩陣乘向量的運算結果。The main processing circuit receives the results of the M inner product operations as the operation results of the matrix multiplication vector.

使用所述電路裝置完成矩陣乘矩陣運算；Using the circuit device to complete a matrix multiplication matrix operation;

下面描述計算尺寸是M行L列的矩陣S和尺寸是L行N列的矩陣P的乘法的運算，（矩陣S中的每一行與矩陣P的每一列長度相同，如圖2f所示）The following describes the calculation of a multiplication of a matrix S whose size is M rows and L columns and a matrix P whose size is L rows and N columns. (Each row in matrix S has the same length as each column of matrix P, as shown in Figure 2f.)

本方法使用所述裝置如圖1b所示的實施例進行說明；The method is described by using the embodiment of the device as shown in FIG. 1b;

主處理電路的數據轉換運算電路將矩陣S和矩陣P轉換成定點類型的數據；The data conversion operation circuit of the main processing circuit converts the matrix S and the matrix P into fixed-point type data;

主處理電路的控制電路將矩陣S的部分或全部行中的數據發送到通過橫向數據輸入介面直接與主處理電路相連的那些基礎處理電路（例如，圖1b中最上方的灰色填充的竪向數據通路）；The control circuit of the main processing circuit sends the data in some or all of the rows of the matrix S to those basic processing circuits that are directly connected to the main processing circuit through the horizontal data input interface (for example, the gray-filled vertical data at the top in Figure 1b) path);

在一種可選方案中，主處理電路的控制電路將矩陣S中某行的數據每次發送一個數或者一部分數給某個基礎處理電路；（例如，對於某一個基礎處理電路，第1次發送第3行第1個數，第2次發送第3行數據中的第2個數，第3次發送第3行的第3個數……，或者第1次發送第3行前兩個數，第二次發送第3行第3和第4個數，第三次發送第3行第5和第6個數……；）In an optional solution, the control circuit of the main processing circuit sends data of a row in the matrix S one number or a part of the number to a certain basic processing circuit at a time; (for example, for a certain basic processing circuit, the first transmission The first number in the third line, the second number in the third line of data is sent for the second time, the third number in the third line is sent in the third time ..., or the first two numbers in the third line are sent for the first time , Send the 3rd and 4th numbers of the 3rd line for the second time, send the 5th and 6th numbers of the 3rd line for the third time ...;)

在一種可選方案中，主處理電路的控制電路將矩陣S中某幾行的數據每次各發送一個數者一部分數給某個基礎處理電路；（例如，對於某一個基礎處理電路，第1次發送第3,4,5行每行的第1個數，第2次發送第3,4,5行每行的第2個數，第3次發送第3,4,5行每行的第3個數……，或者第1次發送第3,4,5行每行前兩個數，第二次發送第3,4,5行每行第3和第4個數，第三次發送第3,4,5行每行第5和第6個數……；）In an optional solution, the control circuit of the main processing circuit sends data of some rows in the matrix S to each basic processing circuit at a time; (for example, for a basic processing circuit, the first Send the 3rd, 4th, and 5th lines the first number of each line, send the 2nd number of the 3rd, 4th, and 5th lines each time, and the 3rd time to send the 3rd, 4th, and 5th lines of each line The third number ..., or the first two numbers of each of the 3, 4, and 5 lines, the second number of 3, 4, and 5 numbers of the third and fourth numbers, the third time Send the 3rd, 4th, and 5th lines of each 5th and 6th numbers ...;)

主處理電路的控制電路將矩陣P中的部分或全部列中的數據發送到通過竪向數據輸入介面直接與主處理電路相連的那些基礎處理電路（例如，圖1b中基礎處理電路陣列左側的灰色填充的橫向數據通路）；The control circuit of the main processing circuit sends data in some or all of the columns in the matrix P to those basic processing circuits that are directly connected to the main processing circuit through the vertical data input interface (for example, the gray to the left of the basic processing circuit array in Figure 1b) Filled horizontal data path);

在一種可選方案中，主處理電路的控制電路將矩陣P中某列的數據每次發送一個數或者一部分數給某個基礎處理電路；（例如，對於某一個基礎處理電路，第1次發送第3列第1個數，第2次發送第3列數據中的第2個數，第3次發送第3列的第3個數……，或者第1次發送第3列前兩個數，第二次發送第3列第3和第4個數，第三次發送第3列第5和第6個數……；）In an optional solution, the control circuit of the main processing circuit sends data of a column in the matrix P one number or a part of the number to a basic processing circuit at a time; (for example, for a basic processing circuit, the first sending The first number in the third column, the second number in the third column of data is sent for the second time, the third number in the third column is sent for the third time ..., or the first two numbers in the third column are sent for the first time. , Send the 3rd and 3rd numbers in the 3rd column for the second time, and send the 5th and 6th numbers in the 3rd column for the third time ...;)

在一種可選方案中，主處理電路的控制電路將矩陣P中某幾列的數據每次各發送一個數者一部分數給某個基礎處理電路；（例如，對於某一個基礎處理電路，第1次發送第3,4,5列每列的第1個數，第2次發送第3,4,5列每列的第2個數，第3次發送第3,4,5列每列的第3個數……，或者第1次發送第3,4,5列每列前兩個數，第二次發送第3,4,5列每列第3和第4個數，第三次發送第3,4,5列每列第5和第6個數……；）In an optional solution, the control circuit of the main processing circuit sends data of some columns in the matrix P to the basic processing circuit one at a time; (for example, for a basic processing circuit, the first Send the 3rd, 4th, and 5th columns the first number of each column, send the 2nd number of the 3rd, 4th, and 5th columns each time, and the 3rd time to send the 3rd, 4th, and 5th columns of each column The third number ..., or the first two numbers in the 3rd, 4th, and 5th columns, the third number, the 3rd, 4th, and 5th columns in the third time Send the 3rd, 4th, and 5th columns of each 5th and 6th numbers ...;)

基礎處理電路接收到矩陣S的數據之後，將該數據通過其橫向的數據輸出介面傳輸給其相連接下一個基礎處理電路（例如，圖1b中基礎處理電路陣列中間的白色填充的橫向的數據通路）；基礎處理電路接收到矩陣P的數據後，將該數據通過其竪向的數據輸出介面傳輸給與其相連接的下一個基礎處理電路（例如，圖1b中基礎處理電路陣列中間的白色填充的竪向的數據通路）；After the basic processing circuit receives the data of the matrix S, it transmits the data to its next basic processing circuit through its horizontal data output interface (for example, the white-filled horizontal data path in the middle of the basic processing circuit array in Figure 1b). ); After receiving the data of matrix P, the basic processing circuit transmits the data to the next basic processing circuit connected to it through its vertical data output interface (for example, the white filled pad in the middle of the basic processing circuit array in Figure 1b) Vertical data path);

每一個基礎處理電路對接收到的數據進行運算；Each basic processing circuit performs operations on the received data;

基礎處理電路計算出結果後，可以將結果從數據輸出介面傳輸出去；After the basic processing circuit calculates the result, the result can be transmitted from the data output interface;

具體地，如果該基礎處理電路有直接與主處理電路相連接的輸出介面則從該介面傳輸結果，如果沒有，則向著能夠直接向主處理電路輸出的基礎處理電路的方向輸出結果（例如，圖1b中，最下面一行基礎處理電路將其輸出結果直接輸出給主處理電路，其他基礎處理電路從竪向的輸出介面向下傳輸運算結果）。Specifically, if the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface; if not, the result is output to the direction of the basic processing circuit that can directly output to the main processing circuit (for example, FIG. In 1b, the bottom line of the basic processing circuit directly outputs its output result to the main processing circuit, and the other basic processing circuits transmit the calculation results downward from the vertical output interface).

向著能夠直接向主處理電路輸出的方向輸出結果（例如，圖1b中，最下面一行基礎處理電路將其輸出結果直接輸出給主處理電路，其他基礎處理電路從竪向的輸出介面向下傳輸運算結果）；Output the result in a direction that can be directly output to the main processing circuit (for example, in Figure 1b, the bottom line of the basic processing circuit directly outputs its output result to the main processing circuit, and the other basic processing circuits transfer the calculation from the vertical output interface downward. result);

主處理電路接收到各個基礎處理電路內積運算的結果，即可得到輸出結果。The main processing circuit receives the result of the inner product operation of each basic processing circuit and can obtain the output result.

「矩陣乘矩陣」方法的實施例：An embodiment of the "matrix by matrix" method:

方法用到按照如圖1b所示方式排列的基礎處理電路陣列，假設有h行，w列；The method uses a basic processing circuit array arranged as shown in Figure 1b, assuming h rows and w columns;

主處理電路的控制電路將矩陣S的h行數據分成h組，分別由第i個基礎處理電路負責第i組（該組數據中行的集合記為Hi）的運算；The control circuit of the main processing circuit divides the h rows of data of the matrix S into h groups, and the i-th basic processing circuit is responsible for the operation of the i-th group (the set of rows in the data is denoted as Hi);

此處對h行數據進行分組的方法是任意不會重復分配的分組方式；Here, the method for grouping the data of h rows is any grouping method that will not be repeatedly allocated;

在一種可選方案中，採用如下分配方式：主處理電路的控制電路將第j行分給第j%h個基礎處理電路；In an optional solution, the following allocation method is adopted: the control circuit of the main processing circuit allocates the jth row to the j% h basic processing circuits;

主處理電路的控制電路將矩陣P的W列數據分成w組，分別由第i個基礎處理電路負責第i組（該組數據中行的集合記為Wi）的運算；The control circuit of the main processing circuit divides the W column data of the matrix P into w groups, and the i-th basic processing circuit is responsible for the operation of the i-th group (the set of rows in the data is denoted as Wi);

此處對W列數據進行分組的方法是任意不會重復分配的分組方式；The method for grouping W column data here is any grouping method that will not be repeatedly assigned;

在一種可選方案中，採用如下分配方式：主處理電路的控制電路將第j行分給第j%w個基礎處理電路；In an optional solution, the following allocation method is adopted: the control circuit of the main processing circuit divides the jth row to the j% wth basic processing circuits;

在一種可選方案中，對於不能平均分組的情況也可以先對一部分列平均分配，對於剩下的列以任意方式分配。In an optional solution, for a case where the grouping cannot be evenly divided, a part of the columns may be evenly distributed first, and the remaining columns may be allocated in an arbitrary manner.

主處理電路的控制電路將矩陣S的部分或全部行中的數據發送到基礎處理電路陣列中每行的第一個基礎處理電路；The control circuit of the main processing circuit sends data in part or all of the rows of the matrix S to the first basic processing circuit of each row in the basic processing circuit array;

在一種可選方案中，主處理電路的控制電路每次向基礎處理電路陣列中第i行的第一個基礎處理電路發送其負責的第i組數據Hi中的一行數據中的一個或多個數據；In an alternative, the control circuit of the main processing circuit sends one or more of a row of data in the i-th group of data Hi to the first basic processing circuit in the i-th row of the basic processing circuit array each time. data;

在一種可選方案中，主處理電路的控制電路每次向基礎處理電路陣列中第i行的第一個基礎處理電路發送其負責的第i組數據Hi中的部分或全部行中的每行的一個或多個數據；In an optional solution, the control circuit of the main processing circuit sends each of some or all of the i-th group of data Hi to the first basic processing circuit in the i-th row of the basic processing circuit array. One or more data;

主處理電路的控制電路將矩陣P的部分或全部列中的數據發送到基礎處理電路陣列中每列的第一個基礎處理電路；The control circuit of the main processing circuit sends data in some or all of the columns of the matrix P to the first basic processing circuit of each column in the basic processing circuit array;

在一種可選方案中，主處理電路的控制電路每次向基礎處理電路陣列中第i列的第一個基礎處理電路發送其負責的第i組數據Wi中的一列數據中的一個或多個數據；In an optional solution, the control circuit of the main processing circuit sends one or more of a column of data in the i-th group of data Wi that it is responsible to the first basic processing circuit in the i-th column of the basic processing circuit array. data;

在一種可選方案中，主處理電路的控制電路每次向基礎處理電路陣列中第i列的第一個基礎處理電路發送其負責的第i組數據Ni中的部分或全部列中的每列的一個或多個數據；In an optional solution, the control circuit of the main processing circuit sends each of some or all of the i-th group of data Ni to the first basic processing circuit in the i-th column of the basic processing circuit array. One or more data;

具體地，如果該基礎處理電路有直接與主處理電路相連接的輸出介面則從該介面傳輸結果，如果沒有，則向著能夠直接向主處理電路輸出的基礎處理電路的方向輸出結果（例如，最下面一行基礎處理電路將其輸出結果直接輸出給主處理電路，其他基礎處理電路從竪向的輸出介面向下傳輸運算結果）。Specifically, if the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface; if not, the result is output to the basic processing circuit that can directly output to the main processing circuit (for example, the most The bottom line of the basic processing circuit directly outputs its output result to the main processing circuit, and the other basic processing circuits transmit the calculation results downward from the vertical output interface).

向著能夠直接向主處理電路輸出的方向輸出結果（例如，最下面一行基礎處理電路將其輸出結果直接輸出給主處理電路，其他基礎處理電路從竪向的輸出介面向下傳輸運算結果）；Output the result in a direction that can be directly output to the main processing circuit (for example, the bottom row of the basic processing circuit outputs its output result directly to the main processing circuit, and the other basic processing circuits transmit the calculation result downward from the vertical output interface);

以上描述中使用的「橫向」，「竪向」等詞語只是為了表述圖1b所示的例子，實際使用只需要區分出每個單元的「橫向」「竪向」介面代表兩個不同的介面即可。The words "horizontal" and "vertical" used in the above description are just to describe the example shown in Figure 1b. In actual use, only the "horizontal" and "vertical" interfaces of each unit need to be distinguished to represent two different interfaces. can.

使用所述電路裝置完成全連接運算：Use the circuit device to complete a fully connected operation:

如果全連接層的輸入數據是一個向量（即神經網絡的輸入是單個樣本的情況），則以全連接層的權值矩陣作為矩陣S，輸入向量作為向量P，按照所述裝置的使用矩陣乘以向量方法執行運算；If the input data of the fully connected layer is a vector (ie, the input of the neural network is a single sample), then the weight matrix of the fully connected layer is used as the matrix S, and the input vector is used as the vector P. Perform operations in a vector method;

如果全連接層的輸入數據是一個矩陣（即神經網絡的輸入是多個樣本的情況），則以全連接層的權值矩陣作為矩陣S，輸入向量作為矩陣P，或者以全連接層的權值矩陣作為矩陣P，輸入向量作為矩陣S，按照所述裝置的矩陣乘以矩陣執行運算；If the input data of the fully connected layer is a matrix (that is, when the input of the neural network is multiple samples), then the weight matrix of the fully connected layer is used as the matrix S, and the input vector is used as the matrix P, or the weight of the fully connected layer is used. The value matrix is used as the matrix P, the input vector is used as the matrix S, and the operation is performed according to the matrix of the device multiplied by the matrix;

使用所述電路裝置完成卷積運算：Use the circuit device to complete a convolution operation:

下面描述卷積運算，下面的圖中一個方塊表示一個數據，輸入數據用圖3a表示（N個樣本，每個樣本有C個通道，每個通道的特徵圖的高為H，寬為W），權值也即卷積核用圖3b表示（有M個卷積核，每個卷積核有C個通道，高和寬分別為KH和KW）。對於輸入數據的N個樣本，卷積運算的規則都是一樣的，下面解釋在一個樣本上進行卷積運算的過程，在一個樣本上，M個卷積核中的每一個都要進行同樣的運算，每個卷積核運算得到一張平面特徵圖，M個卷積核最終計算得到M個平面特徵圖，（對一個樣本，卷積的輸出是M個特徵圖），對於一個卷積核，要在一個樣本的每一個平面位置進行內積運算，然後沿著H和W方向進行滑動，例如，圖3c表示一個卷積核在輸入數據的一個樣本中右下角的位置進行內積運算的對應圖；圖3d表示卷積的位置向左滑動一格和圖3e表示卷積的位置向上滑動一格。The convolution operation is described below. A square in the figure below represents a piece of data, and the input data is represented in Figure 3a (N samples, each sample has C channels, and the feature map of each channel has a height of H and a width of W). The weights, ie, the convolution kernels, are shown in Figure 3b (there are M convolution kernels, each convolution kernel has C channels, and the height and width are KH and KW, respectively). For N samples of the input data, the rules of the convolution operation are the same. The process of performing the convolution operation on one sample is explained below. On one sample, each of the M convolution kernels must be the same. Operation, each convolution kernel operates to obtain a planar feature map, and M convolution kernels finally calculate to obtain M planar feature maps (for a sample, the output of the convolution is M feature maps), for a convolution kernel , To perform an inner product operation at each plane position of a sample, and then slide in the H and W directions. For example, Figure 3c shows a convolution kernel that performs an inner product operation at the lower right corner of a sample of input data. Correspondence map; Figure 3d shows the position of the convolution slide one grid to the left and Figure 3e shows the position of the convolution slide one grid up.

主處理電路的數據轉換運算電路可以將權值的部分或全部卷積核中的數據轉換成定點類型的數據，主處理電路的控制電路將權值的部分或全部卷積核中的數據發送到通過橫向數據輸入介面直接與主處理電路相連的那些基礎處理電路（例如，圖1b中最上方的灰色填充的竪向數據通路）；The data conversion operation circuit of the main processing circuit can convert part or all of the data in the weighted convolution kernel into fixed-point type data, and the control circuit of the main processing circuit sends the data in part or all of the weighted convolution kernel to Those basic processing circuits that are directly connected to the main processing circuit through the horizontal data input interface (for example, the gray-filled vertical data path at the top in Figure 1b);

在一種可選方案中，主處理電路的控制電路將權值中某個卷積核的數據每次發送一個數或者一部分數給某個基礎處理電路；（例如，對於某一個基礎處理電路，第1次發送第3行第1個數，第2次發送第3行數據中的第2個數，第3次發送第3行的第3個數……，或者第1次發送第3行前兩個數，第二次發送第3行第3和第4個數，第三次發送第3行第5和第6個數……；）In an optional solution, the control circuit of the main processing circuit sends data of a certain convolution kernel in the weight value to a certain basic processing circuit at a time; (for example, for a certain basic processing circuit, the first Send the first number of the 3rd line once, send the 2nd number of the 3rd line data for the 2nd time, send the 3rd number of the 3rd line for the 3rd time ... or before the 3rd line of the 1st time Two numbers, the second time sends the 3rd and 4th numbers of the 3rd line, the third time sends the 5th and 6th numbers of the 3rd line ...;)

在一種可選方案中另一種情況是，主處理電路的控制電路將權值中某幾個卷積核的數據每次各發送一個數者一部分數給某個基礎處理電路；（例如，對於某一個基礎處理電路，第1次發送第3,4,5行每行的第1個數，第2次發送第3,4,5行每行的第2個數，第3次發送第3,4,5行每行的第3個數……，或者第1次發送第3,4,5行每行前兩個數，第二次發送第3,4,5行每行第3和第4個數，第三次發送第3,4,5行每行第5和第6個數……；）In another alternative, the control circuit of the main processing circuit sends the data of some convolution kernels in the weight value to the basic processing circuit one by one each time; (for example, for a certain A basic processing circuit that sends the first number of each line 3,4,5 for the first time, the second number of each line 3,4,5 for the second time, and the third number of 3, The 3rd number of each line 4,5 ... or the first two numbers of 3,4,5 lines are sent for the first time, the 3rd and 4th of 5th lines are sent for the second time 4 numbers, 3rd, 4th, 5th lines send 5th and 6th numbers per line ...;)

主處理電路的控制電路把輸入數據按照卷積的位置進行劃分，主處理電路的控制電路將輸入數據中的部分或全部卷積位置中的數據發送到通過竪向數據輸入介面直接與主處理電路相連的那些基礎處理電路（例如，圖1b中基礎處理電路陣列左側的灰色填充的橫向數據通路）；The control circuit of the main processing circuit divides the input data according to the position of the convolution. The control circuit of the main processing circuit sends the data in some or all of the convolution positions in the input data to the main processing circuit directly through the vertical data input interface. Connected basic processing circuits (for example, the gray-filled horizontal data path to the left of the basic processing circuit array in Figure 1b);

在一種可選方案中，主處理電路的控制電路將輸入數據中某個卷積位置的數據每次發送一個數或者一部分數給某個基礎處理電路；（例如，對於某一個基礎處理電路，第1次發送第3列第1個數，第2次發送第3列數據中的第2個數，第3次發送第3列的第3個數……，或者第1次發送第3列前兩個數，第二次發送第3列第3和第4個數，第三次發送第3列第5和第6個數……；）In an optional solution, the control circuit of the main processing circuit sends data of a convolution position in the input data to a basic processing circuit at a time or a portion of the data each time; (for example, for a basic processing circuit, the first Send the first number in the third column once, send the second number in the third column data for the second time, send the third number in the third column for the third time ..., or before the first column 3 Two numbers, the third and third numbers are sent in the third column, the third and third and fifth numbers are sent in the third column ...;)

在一種可選方案中另一種情況是，主處理電路的控制電路將輸入數據中某幾個卷積位置的數據每次各發送一個數或者一部分數給某個基礎處理電路；（例如，對於某一個基礎處理電路，第1次發送第3,4,5列每列的第1個數，第2次發送第3,4,5列每列的第2個數，第3次發送第3,4,5列每列的第3個數……，或者第1次發送第3,4,5列每列前兩個數，第二次發送第3,4,5列每列第3和第4個數，第三次發送第3,4,5列每列第5和第6個數……；）In another optional solution, the control circuit of the main processing circuit sends data of some convolution positions in the input data to a basic processing circuit each time or a part of the data; (for example, for a certain A basic processing circuit that sends the first number of each of the 3, 4, and 5 columns for the first time, sends the second number of each of the 3, 4, 5 columns for the second time, and sends the third number of 3, The 3rd number of each column in 4,5 columns ..., or the first two numbers of 3,4,5 columns are sent for the first time, and the 3rd and 4th of 5th columns are sent for the second time. 4 numbers, send the 3rd, 4th, 5th columns for the third time, 5th and 6th numbers for each column ...;)

基礎處理電路接收到權值的數據之後，將該數據通過其橫向的數據輸出介面傳輸給其相連接下一個基礎處理電路（例如，圖1b中基礎處理電路陣列中間的白色填充的橫向的數據通路）；基礎處理電路接收到輸入數據的數據後，將該數據通過其竪向的數據輸出介面傳輸給與其相連接的下一個基礎處理電路（例如，圖1b中基礎處理電路陣列中間的白色填充的竪向的數據通路）；After the basic processing circuit receives the weighted data, it transmits the data to its next basic processing circuit through its horizontal data output interface (for example, the white-filled horizontal data path in the middle of the basic processing circuit array in Figure 1b). ); After receiving the data of the input data, the basic processing circuit transmits the data to the next basic processing circuit connected to it through its vertical data output interface (for example, the white filled pad in the middle of the basic processing circuit array in Figure 1b) Vertical data path);

在一種可選方案中，基礎處理電路每次計算一組或多組兩個數據的乘法，然後將結果累加到寄存器和/或片上緩存上；In an optional solution, the basic processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the results in a register and / or an on-chip buffer;

在一種可選方案中，基礎處理電路每次計算一組或多組兩個向量的內積，然後將結果累加到寄存器和/或片上緩存上；In an optional solution, the basic processing circuit calculates an inner product of one or more groups of two vectors at a time, and then accumulates the results in a register and / or an on-chip buffer;

使用所述電路裝置完成加偏置操作的方法；Method for completing biasing operation using said circuit device;

利用主處理電路的向量運算器電路可以實現兩個向量或者兩個矩陣相加的功能；The vector operator circuit of the main processing circuit can realize the function of adding two vectors or two matrices;

利用主處理電路的向量運算器電路可以實現把一個向量加到一個矩陣的每一行上，或者每一個列上的功能。The vector operator circuit of the main processing circuit can realize the function of adding a vector to each row or each column of a matrix.

在一種可選方案中，所述矩陣可以來自所述裝置執行矩陣乘矩陣運算的結果；In an optional solution, the matrix may be obtained from a result of performing a matrix multiplication matrix operation by the device;

在一種可選方案中，所述向量可以來自所述裝置執行矩陣乘向量運算的結果；In an optional solution, the vector may be obtained from a result of performing a matrix multiplication vector operation by the device;

在一種可選方案中，所述矩陣可以來自所述裝置的主處理電路從外部接受的數據。In an alternative, the matrix may come from data received externally by the main processing circuit of the device.

在一種可選方案中，所述向量可以來自所述裝置的主處理電路從外部接受的數據。In an alternative, the vector may come from data received externally by the main processing circuit of the device.

包括但不限於以上這些數據來源。Including but not limited to these data sources.

使用所述電路裝置完成激活函數運算的方法：Method for completing activation function operation using said circuit device:

利用主處理電路的激活電路，輸入一向量，計算出該向量的激活向量；Use the activation circuit of the main processing circuit to input a vector and calculate the activation vector of this vector;

在一種可選方案中，主處理電路的激活電路將輸入向量中的每一個值通過一個激活函數（激活函數的輸入是一個數值，輸出也是一個數值），計算出一個數值輸出到輸出向量的對應位置；In an optional solution, the activation circuit of the main processing circuit passes each value in the input vector through an activation function (the input of the activation function is a numerical value and the output is also a numerical value), and calculates a corresponding value that is output to the output vector. position;

在一種可選方案中，激活函數可以是：y=max(m, x)，其中x是輸入數值，y是輸出數值，m是一個常數；In an alternative, the activation function may be: y = max (m, x), where x is the input value, y is the output value, and m is a constant;

在一種可選方案中，激活函數可以是：y=tanh(x)，其中x是輸入數值，y是輸出數值；In an alternative, the activation function may be: y = tanh (x), where x is an input value and y is an output value;

在一種可選方案中，激活函數可以是：y=sigmoid(x)，其中x是輸入數值，y是輸出數值；In an alternative, the activation function may be: y = sigmoid (x), where x is an input value and y is an output value;

在一種可選方案中，激活函數可以是一個分段線性函數；In an alternative, the activation function may be a piecewise linear function;

在一種可選方案中，激活函數可以是任意輸入一個數，輸出一個數的函數。In an alternative solution, the activation function can be a function that arbitrarily inputs a number and outputs a number.

在一種可選方案中，輸入向量的來源有（包括但不限於）：In an alternative, the source of the input vector is (including but not limited to):

所述裝置的外部數據來源；An external data source of the device;

在一種可選方案中，輸入數據來自所述裝置進行矩陣乘向量的運算結果；In an optional solution, the input data comes from a matrix multiplication vector operation result of the device;

在一種可選方案中，輸入數據來自所述裝置進行矩陣乘矩陣的運算結果；In an optional solution, the input data comes from a matrix multiplication matrix operation result of the device;

所述裝置的主處理電路計算結果；Calculation result of the main processing circuit of the device;

在一種可選方案中，輸入數據來自所述裝置主處理電路實現加偏置之後的計算結果。In an optional solution, the input data comes from a calculation result after the main processing circuit of the device implements the offset.

使用所述裝置實現BLAS（Basic Linear Algebra Subprograms）的方法；Method for realizing BLAS (Basic Linear Algebra Subprograms) using the device;

GEMM計算是指：BLAS庫中的矩陣-矩陣乘法的運算。該運算的通常表示形式為：C = alpha*op( S )*op( P ) + beta*C，其中，A和B為輸入的兩個矩陣，C為輸出矩陣，alpha和beta為標量，op代表對矩陣S或P的某種操作，此外，還會有一些輔助的整數作為參數來說明矩陣的A和B的寬高；GEMM calculation refers to the matrix-matrix multiplication operation in the BLAS library. The usual representation of this operation is: C = alpha * op (S) * op (P) + beta * C, where A and B are the two matrices of the input, C is the output matrix, alpha and beta are the scalars, and op Represents some kind of operation on the matrix S or P, in addition, there will be some auxiliary integers as parameters to explain the width and height of the matrix A and B;

使用所述裝置實現GEMM計算的步驟為：The steps for implementing GEMM calculation using the device are:

主處理電路在進行OP操作之前可以將輸入矩陣S和矩陣P進行數據類型的轉換；The main processing circuit can perform data type conversion on the input matrix S and matrix P before performing the OP operation;

主處理電路的轉換電路對輸入矩陣S和矩陣P進行各自相應的op操作；The conversion circuit of the main processing circuit performs respective op operations on the input matrix S and the matrix P;

在一種可選方案中，op可以為矩陣的轉置操作；利用主處理電路的向量運算功能或者數據重排列功能（前面提到了主處理電路具有數據重排列的電路），實現矩陣轉置操作，當然在實際應用中，上述OP也可以直接通過轉換電路來實現，例如矩陣轉置操作時，直接通過矩陣轉置電路來實現OP操作；In an alternative, op can be a matrix transposition operation; the vector operation function or data rearrangement function of the main processing circuit is used (the main processing circuit has a data rearrangement circuit mentioned earlier) to implement the matrix transposition operation, Of course, in practical applications, the above-mentioned OP can also be directly implemented by a conversion circuit. For example, in a matrix transposition operation, the OP operation is directly implemented by a matrix transposition circuit;

在一種可選方案中，某個矩陣的op可以為空，OP操作不進行；In an optional solution, the op of a matrix can be empty, and the OP operation is not performed.

利用矩陣乘矩陣的計算方法完成op（S）與op（P）之間的矩陣乘法計算；Use the matrix multiplication matrix calculation method to complete the matrix multiplication calculation between op (S) and op (P);

利用主處理電路的算術邏輯電路對op（S）*op（P）的結果中的每一個值進行乘以alpha的操作；Use the arithmetic logic circuit of the main processing circuit to multiply each value in the result of op (S) * op (P) by alpha;

在一種可選方案中，alpha為1的情況下乘以alpha操作不進行；In an optional solution, if the alpha is 1, the multiplication by the alpha operation is not performed.

利用主處理電路的算術邏輯電路實現beta*C的運算；Use the arithmetic logic circuit of the main processing circuit to implement the operation of beta * C;

在一種可選方案中，beta為1的情況下，不進行乘以beta操作；In an optional solution, if beta is 1, multiplication by beta is not performed.

利用主處理電路的算術邏輯電路，實現矩陣alpha*op( S )*op( P ) 和 beta*C之間對應位置相加的步驟；The arithmetic logic circuit of the main processing circuit is used to implement the step of adding corresponding positions between the matrix alpha * op (S) * op (P) and beta * C;

在一種可選方案中，beta為0的情況下，不進行相加操作；In an optional solution, if beta is 0, no addition operation is performed;

GEMV計算是指：BLAS庫中的矩陣-向量乘法的運算。該運算的通常表示形式為：C = alpha*op( S)*P+ beta*C，其中，S為輸入矩陣，P為輸入的向量，C為輸出向量，alpha和beta為標量，op代表對矩陣S的某種操作；GEMV calculation refers to the matrix-vector multiplication operation in the BLAS library. The usual representation of this operation is: C = alpha * op (S) * P + beta * C, where S is the input matrix, P is the input vector, C is the output vector, alpha and beta are scalars, and op is the matrix Some operation of S;

使用所述裝置實現GEMV計算的步驟為：The steps for implementing GEMV calculation using the device are:

主處理電路的轉換電路對輸入矩陣S進行相應的op操作；The conversion circuit of the main processing circuit performs corresponding op operations on the input matrix S;

在一種可選方案中，op可以為矩陣的轉置操作；利用主處理電路的矩陣轉置電路實現矩陣轉置操作；In an optional solution, op may be a matrix transposition operation; a matrix transposition circuit of the main processing circuit is used to implement the matrix transposition operation;

在一種可選方案中，某個矩陣的op可以為空，op操作不進行；In an optional solution, the op of a certain matrix may be empty, and the op operation is not performed.

用矩陣乘向量的計算方法完成矩陣op（S）與向量P之間的矩陣-向量乘法計算；The matrix-vector multiplication calculation between the matrix op (S) and the vector P is completed by a matrix multiplication vector calculation method;

利用主處理電路的算術邏輯電路對op（S）*P的結果中的每一個值進行乘以alpha的操作；Use the arithmetic logic circuit of the main processing circuit to multiply each value in the result of op (S) * P by alpha;

利用主處理電路的算術邏輯電路，實現beta*C的運算；Use the arithmetic logic circuit of the main processing circuit to achieve the operation of beta * C;

利用主處理電路的算術邏輯電路，實現矩陣alpha*op( S )* P和 beta*C之間對應位置相加的步驟；Use the arithmetic logic circuit of the main processing circuit to implement the step of adding the corresponding positions between the matrices alpha * op (S) * P and beta * C;

實現數據類型轉換Implementing data type conversion

利用主處理電路的數據類型轉換運算電路實現將數據類型的轉換；The data type conversion operation circuit of the main processing circuit is used to convert the data type;

在一種可選方案中，數據類型轉化的形式包括但不限於：浮點數轉定點數和定點數轉浮點數等；In an optional solution, the form of data type conversion includes, but is not limited to, floating-point number to fixed-point number, fixed-point number to floating-point number, and the like;

更新權值的方法：Method for updating weights:

利用主處理電路的向量運算器電路實現神經網絡訓練過程中的權值更新功能，具體地，權值更新是指使用權值的梯度來更新權值的方法。The vector operator circuit of the main processing circuit is used to implement the weight update function during the training of the neural network. Specifically, the weight update refers to a method of using the gradient of the weight to update the weight.

在一種可選方案中，使用主處理電路的向量運算器電路對權值和權值梯度這兩個向量進行加減運算得到運算結果，該運算結果即為更新權值。In an optional solution, a vector operator circuit of the main processing circuit is used to add and subtract two vectors of weight and weight gradient to obtain an operation result, and the operation result is to update the weight.

在一種可選方案中，使用主處理電路的向量運算器電路在權值以及權值梯度乘以或除以一個數得到中間權值和中間權值梯度值，向量運算器電路對中間權值和中間權值梯度值進行加減運算得到運算結果，該運算結果即為更新權值。In an optional solution, the vector operator circuit of the main processing circuit is used to multiply or divide the weight and the weight gradient by a number to obtain the intermediate weight and the intermediate weight gradient value. The vector operator circuit compares the intermediate weight and The intermediate weight gradient value is added and subtracted to obtain an operation result, and the operation result is an updated weight value.

在一種可選方案中，可以先使用權值的梯度計算出一組動量，然後再使用動量與權值進行加減計算得到更新後的權值。In an optional solution, a set of momentums may be calculated first using a gradient of weights, and then the momentums and weights may be added and subtracted to obtain updated weights.

實現全連接層的反向運算的方法Method for realizing reverse operation of fully connected layer

全連接層的反向運算可以分成兩部分，如圖4a中，實線箭頭表示全連接層的正向計算過程，虛線部分表示全連接層的反向計算過程。The reverse operation of the fully-connected layer can be divided into two parts. As shown in FIG. 4a, the solid arrow indicates the forward calculation process of the fully-connected layer, and the dotted line indicates the reverse calculation process of the fully-connected layer.

從圖4a可以看出來，可以使用裝置的使用所述裝置完成矩陣相乘運算的方法完成全連接層的反向運算；It can be seen from FIG. 4a that the reverse operation of the fully connected layer can be completed by using the device to complete the matrix multiplication operation using the device;

實現卷積層的反向運算；Realize the reverse operation of the convolution layer;

卷積層的反向運算可以分成兩部分，如圖4a中，實線箭頭表示卷積層的正向計算過程，如圖4b所示，表示卷積層的反向計算過程。The reverse operation of the convolutional layer can be divided into two parts. As shown in FIG. 4a, the solid arrow indicates the forward calculation process of the convolution layer, and as shown in FIG. 4b, the reverse calculation process of the convolution layer.

圖4a、圖4b所示的卷積層的反向運算，可以使用如圖1a所示裝置採用如圖1b所示的裝置完成卷積層的反向運算。在執行正向運算或反向運算實際為神經網絡的多個運算，該多個運算包括但不限於：矩陣乘以矩陣、矩陣乘以向量、卷積運算、激活運算等等運算中的一種或任意組合，上述運算的方式可以本披露中的描述，這裡不在贅述。The reverse operation of the convolution layer shown in Figs. 4a and 4b can be performed by using the device shown in Fig. 1a and the device shown in Fig. 1b. When performing a forward operation or a reverse operation is actually multiple operations of the neural network, the multiple operations include, but are not limited to, one of matrix multiplication by matrix, matrix multiplication by vector, convolution operation, activation operation, and so on. In any combination, the manner of the above operations can be described in this disclosure, and is not repeated here.

本披露還揭露了一個神經網絡運算裝置，其包括一個或多個在如圖1a或如圖1b所示的芯片，用於從其他處理裝置中獲取待運算數據和控制信息，執行指定的神經網絡運算，執行結果通過I/O介面傳遞給外圍設備。外圍設備譬如攝像頭，顯示器，鼠標，鍵盤，網卡，wifi介面，服務器。當包含一個以上神如圖1a或如圖1b所示的芯片時，如圖1a或如圖1b所示的芯片間可以通過特定的結構進行鏈接並傳輸數據，譬如，通過PCIE總線進行互聯並傳輸數據，以支持更大規模的神經網絡的運算。此時，可以共享同一控制系統，也可以有各自獨立的控制系統；可以共享內存，也可以每個加速器有各自的內存。此外，其互聯方式可以是任意互聯拓撲。This disclosure also discloses a neural network computing device, which includes one or more chips as shown in FIG. 1a or FIG. 1b, for obtaining data to be calculated and control information from other processing devices, and executing a specified neural network. The calculation and execution results are passed to the peripheral device through the I / O interface. Peripheral equipment such as camera, monitor, mouse, keyboard, network card, wifi interface, server. When more than one chip shown in Figure 1a or Figure 1b is included, the chips shown in Figure 1a or Figure 1b can be linked and transmitted through a specific structure, such as interconnection and transmission through the PCIE bus. Data to support larger-scale neural network operations. At this time, you can share the same control system, or you can have separate control systems; you can share memory, or each accelerator can have its own memory. In addition, its interconnection method can be any interconnection topology.

該神經網絡運算裝置具有較高的兼容性，可通過PCIE介面與各種類型的服務器相連接。The neural network computing device has high compatibility and can be connected to various types of servers through a PCIE interface.

本披露還揭露了一個組合處理裝置，其包括上述的神經網絡運算裝置，通用互聯介面，和其他處理裝置（即通用處理裝置）。神經網絡運算裝置與其他處理裝置進行交互，共同完成用戶指定的操作。如圖4c為組合處理裝置的示意圖。The present disclosure also discloses a combined processing device, which includes the aforementioned neural network computing device, a universal interconnection interface, and other processing devices (ie, general processing devices). The neural network computing device interacts with other processing devices to complete a user-specified operation. Figure 4c is a schematic diagram of a combined processing device.

其他處理裝置，包括中央處理器CPU、圖形處理器GPU、神經網絡處理器等通用／專用處理器中的一種或以上的處理器類型。其他處理裝置所包括的處理器數量不做限制。其他處理裝置作為神經網絡運算裝置與外部數據和控制的介面，包括數據搬運，完成對本神經網絡運算裝置的開啓、停止等基本控制；其他處理裝置也可以和神經網絡運算裝置協作共同完成運算任務。Other processing devices include one or more types of processors such as a central processing unit CPU, a graphics processor GPU, and a neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as the interface between the neural network computing device and external data and control, including data transfer, to complete the basic control of the neural network computing device, such as start and stop; other processing devices can also cooperate with the neural network computing device to complete computing tasks.

通用互聯介面，用於在所述神經網絡運算裝置與其他處理裝置間傳輸數據和控制指令。該神經網絡運算裝置從其他處理裝置中獲取所需的輸入數據，寫入神經網絡運算裝置片上的存儲裝置；可以從其他處理裝置中獲取控制指令，寫入神經網絡運算裝置片上的控制緩存；也可以讀取神經網絡運算裝置的存儲模塊中的數據並傳輸給其他處理裝置。A universal interconnection interface for transmitting data and control instructions between the neural network computing device and other processing devices. The neural network computing device obtains required input data from other processing devices and writes it to a storage device on the neural network computing device chip; it can obtain control instructions from other processing devices and write it to the control buffer on the neural network computing device chip; also The data in the storage module of the neural network computing device can be read and transmitted to other processing devices.

如圖4d所示，可選的，該結構還包括存儲裝置，用於保存在本運算單元／運算裝置或其他運算單元所需要的數據，尤其適用於所需要運算的數據在本神經網絡運算裝置或其他處理裝置的內部存儲中無法全部保存的數據。As shown in FIG. 4d, optionally, the structure further includes a storage device for storing data required by the operation unit / operation device or other operation units, and particularly suitable for the data required for operation in the neural network operation device Or all data that cannot be saved in the internal storage of other processing devices.

該組合處理裝置可以作為手機、機器人、無人機、視頻監控設備等設備的SOC片上系統，有效降低控制部分的核心面積，提高處理速度，降低整體功耗。此情況時，該組合處理裝置的通用互聯介面與設備的某些部件相連接。某些部件譬如攝像頭，顯示器，鼠標，鍵盤，網卡，wifi介面。The combined processing device can be used as an SOC system-on-chip for devices such as mobile phones, robots, drones, and video surveillance equipment, effectively reducing the core area of the control section, increasing processing speed, and reducing overall power consumption. In this case, the universal interconnection interface of the combined processing device is connected to some parts of the equipment. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.

本披露實施例提供了一種神經網絡處理器板卡，可用於眾多通用或專用的計算系統環境或配置中。例如：個人計算機、服務器計算機、手持設備或便攜式設備、平板型設備、智能家居、家電、多處理器系統、基於微處理器的系統、機器人、可編程的消費電子設備、網絡個人計算機（personal computer，PC）、小型計算機、大型計算機、包括以上任何系統或設備的分布式計算環境等等。The embodiments of the present disclosure provide a neural network processor board, which can be used in many general or special-purpose computing system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, smart homes, home appliances, multiprocessor systems, microprocessor-based systems, robots, programmable consumer electronics devices, network personal computers , PC), small computers, mainframe computers, distributed computing environments including any of the above systems or equipment, and more.

請參照圖5a，圖5a為本披露實施例提供的一種神經網絡處理器板卡的結構示意圖。如圖5a所示，上述神經網絡處理器板卡10包括神經網絡芯片封裝結構11、第一電氣及非電氣連接裝置12和第一基板（substrate）13。Please refer to FIG. 5a, which is a schematic structural diagram of a neural network processor board provided by an embodiment of the present disclosure. As shown in FIG. 5 a, the neural network processor board 10 includes a neural network chip package structure 11, a first electrical and non-electrical connection device 12, and a first substrate 13.

本披露對於神經網絡芯片封裝結構11的具體結構不作限定，可選的，如圖5b所示，上述神經網絡芯片封裝結構11包括：神經網絡芯片111、第二電氣及非電氣連接裝置112、第二基板113。The present disclosure does not limit the specific structure of the neural network chip packaging structure 11. Optionally, as shown in FIG. 5b, the aforementioned neural network chip packaging structure 11 includes: a neural network chip 111, a second electrical and non-electrical connection device 112, a first Two substrates 113.

本披露所涉及的神經網絡芯片111的具體形式不作限定，上述的神經網絡芯片111包含但不限於將神經網絡處理器集成的神經網絡晶片，上述晶片可以由硅材料、鍺材料、量子材料或分子材料等製成。根據實際情況（例如：較嚴苛的環境）和不同的應用需求可將上述神經網絡晶片進行封裝，以使神經網絡晶片的大部分被包裹住，而將神經網絡晶片上的引腳通過金線等導體連到封裝結構的外邊，用於和更外層進行電路連接。The specific form of the neural network chip 111 involved in this disclosure is not limited. The aforementioned neural network chip 111 includes, but is not limited to, a neural network chip that integrates a neural network processor. The above chip may be made of silicon material, germanium material, quantum material, or molecule. Materials. According to the actual situation (for example: harsh environment) and different application requirements, the above neural network chip can be packaged so that most of the neural network chip is wrapped, and the pins on the neural network chip are passed through gold wires. The isoconductor is connected to the outer side of the package structure for circuit connection with the outer layer.

本披露對於神經網絡芯片111的具體結構不作限定，可選的，請參照圖1a或圖1b所示的裝置。The present disclosure does not limit the specific structure of the neural network chip 111. For optional, please refer to the device shown in FIG. 1a or FIG. 1b.

本披露對於第一基板13和第二基板113的類型不做限定，可以是印制電路板(printed circuit board，PCB)或(printed wiring board，PWB)，還可能為其它電路板。對PCB的製作材料也不做限定。The disclosure does not limit the types of the first substrate 13 and the second substrate 113, and may be a printed circuit board (PCB) or a printed wiring board (PWB), or may be other circuit boards. There are no restrictions on the materials used to make the PCB.

本披露所涉及的第二基板113用於承載上述神經網絡芯片111，通過第二電氣及非電氣連接裝置112將上述的神經網絡芯片111和第二基板113進行連接得到的神經網絡芯片封裝結構11，用於保護神經網絡芯片111，便於將神經網絡芯片封裝結構11與第一基板13進行進一步封裝。The second substrate 113 according to the present disclosure is used to carry the neural network chip 111, and a neural network chip package structure 11 obtained by connecting the neural network chip 111 and the second substrate 113 through a second electrical and non-electrical connection device 112. It is used to protect the neural network chip 111 and facilitate the further packaging of the neural network chip packaging structure 11 and the first substrate 13.

對於上述具體的第二電氣及非電氣連接裝置112的封裝方式和封裝方式對應的結構不作限定，可根據實際情況和不同的應用需求選擇合適的封裝方式並進行簡單地改進，例如：倒裝芯片球柵陣列封裝（Flip Chip Ball Grid Array Package，FCBGAP），薄型四方扁平式封裝（Low-profile Quad Flat Package，LQFP）、帶散熱器的四方扁平封裝（Quad Flat Package with Heat sink，HQFP）、無引腳四方扁平封裝（Quad Flat Non-lead Package，QFN）或小間距四方扁平式封裝（Fine-pitch Ball Grid Package，FBGA）等封裝方式。There is no limitation on the above-mentioned specific packaging method of the second electrical and non-electrical connection device 112 and the corresponding structure of the packaging method. A suitable packaging method can be selected and simply improved according to the actual situation and different application needs, such as flip chip Flip Chip Ball Grid Array Package (FCBGAP), Low-profile Quad Flat Package (LQFP), Quad Flat Package with Heat sink (HQFP), None Packaging methods such as Quad Flat Non-lead Package (QFN) or Fine-pitch Ball Grid Package (FBGA).

倒裝芯片（Flip Chip），適用於對封裝後的面積要求高或對導線的電感、信號的傳輸時間敏感的情況下。除此之外可以用引線鍵合（Wire Bonding）的封裝方式，減少成本，提高封裝結構的靈活性。Flip chip (Flip Chip) is suitable for the case where the area after packaging is high or the inductance of the wire and the signal transmission time are sensitive. In addition, wire bonding can be used to reduce the cost and improve the flexibility of the packaging structure.

球柵陣列（Ball Grid Array），能夠提供更多引腳，且引腳的平均導線長度短，具備高速傳遞信號的作用，其中，封裝可以用引腳網格陣列封裝（Pin Grid Array，PGA）、零插拔力（Zero Insertion Force，ZIF）、單邊接觸連接（Single Edge Contact Connection，SECC）、觸點陣列（Land Grid Array，LGA）等來代替。Ball Grid Array, which can provide more pins, and the average lead length of the pins is short, which has the function of transmitting signals at high speed. Among them, the package can be packaged with a pin grid array (Pin Grid Array, PGA) , Zero Insertion Force (ZIF), Single Edge Contact Connection (SECC), and Land Grid Array (LGA).

可選的，採用倒裝芯片球柵陣列（Flip Chip Ball Grid Array）的封裝方式對神經網絡芯片111和第二基板113進行封裝，具體的神經網絡芯片封裝結構的示意圖可參照圖6。如圖6所示，上述神經網絡芯片封裝結構包括：神經網絡芯片21、焊盤22、焊球23、第二基板24、第二基板24上的連接點25、引腳26。Optionally, the Flip Chip Ball Grid Array (Flip Chip Ball Grid Array) packaging method is used to package the neural network chip 111 and the second substrate 113. For a schematic diagram of a specific neural network chip packaging structure, refer to FIG. 6. As shown in FIG. 6, the aforementioned neural network chip package structure includes: a neural network chip 21, a pad 22, a solder ball 23, a second substrate 24, a connection point 25 on the second substrate 24, and a pin 26.

其中，焊盤22與神經網絡芯片21相連，通過在焊盤22和第二基板24上的連接點25之間焊接形成焊球23，將神經網絡芯片21和第二基板24連接，即實現了神經網絡芯片21的封裝。Among them, the pad 22 is connected to the neural network chip 21, and a solder ball 23 is formed by welding between the pad 22 and the connection point 25 on the second substrate 24, and the neural network chip 21 and the second substrate 24 are connected. Packaging of the neural network chip 21.

引腳26用於與封裝結構的外部電路（例如，神經網絡處理器板卡10上的第一基板13）相連，可實現外部數據和內部數據的傳輸，便於神經網絡芯片21或神經網絡芯片21對應的神經網絡處理器對數據進行處理。對於引腳的類型和數量本披露也不作限定，根據不同的封裝技術可選用不同的引腳形式，並遵從一定規則進行排列。Pin 26 is used to connect with the external circuit of the package structure (for example, the first substrate 13 on the neural network processor board 10), and can realize the transmission of external data and internal data, which is convenient for the neural network chip 21 or the neural network chip 21 The corresponding neural network processor processes the data. The type and quantity of pins are not limited in this disclosure. Different pin forms can be selected according to different packaging technologies and arranged in accordance with certain rules.

可選的，上述神經網絡芯片封裝結構還包括絕緣填充物，置於焊盤22、焊球23和連接點25之間的空隙中，用於防止焊球與焊球之間產生干擾。Optionally, the aforementioned neural network chip package structure further includes an insulating filler placed in a gap between the pad 22, the solder ball 23, and the connection point 25 to prevent interference between the solder ball and the solder ball.

其中，絕緣填充物的材料可以是氮化硅、氧化硅或氧氮化硅；干擾包含電磁干擾、電感干擾等。The material of the insulating filler may be silicon nitride, silicon oxide, or silicon oxynitride; interference includes electromagnetic interference, inductive interference, and the like.

可選的，上述神經網絡芯片封裝結構還包括散熱裝置，用於散髮神經網絡芯片21運行時的熱量。其中，散熱裝置可以是一塊導熱性良好的金屬片、散熱片或散熱器，例如，風扇。Optionally, the aforementioned neural network chip package structure further includes a heat dissipation device for dissipating heat during operation of the neural network chip 21. The heat dissipation device may be a metal sheet, a heat sink, or a heat sink with good thermal conductivity, such as a fan.

舉例來說，如圖6a所示，神經網絡芯片封裝結構11包括：神經網絡芯片21、焊盤22、焊球23、第二基板24、第二基板24上的連接點25、引腳26、絕緣填充物27、散熱膏28和金屬外殼散熱片29。其中，散熱膏28和金屬外殼散熱片29用於散髮神經網絡芯片21運行時的熱量。For example, as shown in FIG. 6a, the neural network chip package structure 11 includes: a neural network chip 21, a pad 22, a solder ball 23, a second substrate 24, a connection point 25, a pin 26, The insulating filler 27, the heat dissipation paste 28, and the metal case heat sink 29. Among them, the heat dissipation paste 28 and the metal shell heat sink 29 are used to dissipate heat during the operation of the neural network chip 21.

可選的，上述神經網絡芯片封裝結構11還包括補強結構，與焊盤22連接，且內埋於焊球23中，以增強焊球23與焊盤22之間的連接強度。Optionally, the aforementioned neural network chip package structure 11 further includes a reinforcing structure connected to the pad 22 and embedded in the solder ball 23 to enhance the connection strength between the solder ball 23 and the pad 22.

其中，補強結構可以是金屬線結構或柱狀結構，在此不做限定。The reinforcing structure may be a metal wire structure or a columnar structure, which is not limited herein.

本披露對於第一電氣及非電氣裝置12的具體形式也不作限定，可參照第二電氣及非電氣裝置112的描述，即通過焊接的方式將神經網絡芯片封裝結構11進行封裝，也可以採用連接線連接或插拔方式連接第二基板113和第一基板13的方式，便於後續更換第一基板13或神經網絡芯片封裝結構11。The present disclosure also does not limit the specific form of the first electrical and non-electrical device 12, and may refer to the description of the second electrical and non-electrical device 112, that is, the neural network chip packaging structure 11 is packaged by soldering, and connection may also be adopted. The method of connecting the second substrate 113 and the first substrate 13 in a line connection or plugging manner is convenient for subsequent replacement of the first substrate 13 or the neural network chip package structure 11.

可選的，第一基板13包括用於擴展存儲容量的內存單元的介面等，例如：同步動態隨機存儲器（Synchronous Dynamic Random Access Memory，SDRAM）、雙倍速率同步動態隨機存儲器（Double Date Rate SDRAM，DDR）等，通過擴展內存提高了神經網絡處理器的處理能力。Optionally, the first substrate 13 includes an interface of a memory unit for expanding the storage capacity, such as: synchronous dynamic random access memory (SDRAM), double-rate synchronous dynamic random access memory (Double Date Rate SDRAM, DDR), etc., to improve the processing capacity of the neural network processor by expanding the memory.

第一基板13上還可包括快速外部設備互連總線（Peripheral Component Interconnect-Express，PCI-E或PCIe）介面、小封裝可熱插拔（Small Form-factor Pluggable，SFP）介面、以太網介面、控制器局域網總線（Controller Area Network，CAN）介面等等，用於封裝結構和外部電路之間的數據傳輸，可提高運算速度和操作的便利性。The first substrate 13 may further include a Peripheral Component Interconnect-Express (PCI-E or PCIe) interface, a Small Form-factor Pluggable (SFP) interface, an Ethernet interface, Controller Area Network (CAN) interfaces, etc., are used for data transmission between the package structure and external circuits, which can improve the operation speed and the convenience of operation.

將神經網絡處理器封裝為神經網絡芯片111，將神經網絡芯片111封裝為神經網絡芯片封裝結構11，將神經網絡芯片封裝結構11封裝為神經網絡處理器板卡10，通過板卡上的介面（插槽或插芯）與外部電路（例如：計算機主板）進行數據交互，即直接通過使用神經網絡處理器板卡10實現神經網絡處理器的功能，並保護神經網絡芯片111。且神經網絡處理器板卡10上還可添加其他模塊，提高了神經網絡處理器的應用範圍和運算效率。The neural network processor is packaged as a neural network chip 111, the neural network chip 111 is packaged as a neural network chip package structure 11, the neural network chip package structure 11 is packaged as a neural network processor board 10, and an interface on the board ( Slots or inserts) perform data interaction with external circuits (for example, computer motherboards), that is, the function of the neural network processor is realized directly by using the neural network processor board 10 and the neural network chip 111 is protected. In addition, other modules can be added to the neural network processor board 10, which improves the application range and operation efficiency of the neural network processor.

在一個實施例里，本公開公開了一個電子裝置，其包括了上述神經網絡處理器板卡10或神經網絡芯片封裝結構11。In one embodiment, the present disclosure discloses an electronic device including the neural network processor board 10 or the neural network chip package structure 11 described above.

電子裝置包括數據處理裝置、機器人、電腦、打印機、掃描儀、平板電腦、智能終端、手機、行車記錄儀、導航儀、傳感器、攝像頭、服務器、相機、攝像機、投影儀、手錶、耳機、移動存儲、可穿戴設備、交通工具、家用電器、和/或醫療設備。Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cameras, camcorders, projectors, watches, headphones, mobile storage , Wearables, vehicles, home appliances, and / or medical devices.

所述交通工具包括飛機、輪船和/或車輛；所述家用電器包括電視、空調、微波爐、冰箱、電飯煲、加濕器、洗衣機、電燈、燃氣灶、油煙機；所述醫療設備包括核磁共振儀、B超儀和/或心電圖儀。The vehicles include airplanes, ships, and / or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, cooker hoods, and the medical equipment includes nuclear magnetic resonance Instrument, B-mode and / or electrocardiograph.

以上所述的具體實施例，對本披露的目的、技術方案和有益效果進行了進一步詳細說明，所應理解的是，以上所述僅為本披露的具體實施例而已，並不用於限制本披露，凡在本披露的精神和原則之內，所做的任何修改、等同替換、改進等，均應包含在本披露的保護範圍之內。The specific embodiments described above further describe the purpose, technical solution and beneficial effects of the present disclosure. It should be understood that the above are only specific embodiments of the present disclosure and are not used to limit the present disclosure. Any modification, equivalent replacement, or improvement made within the spirit and principle of this disclosure shall be included in the protection scope of this disclosure.

A、B、S‧‧‧矩陣A, B, S‧‧‧ Matrix

P‧‧‧向量P‧‧‧ vector

10‧‧‧神經網絡處理器板卡10‧‧‧ Neural Network Processor Board

11‧‧‧神經網絡芯片封裝結構11‧‧‧ neural network chip packaging structure

12‧‧‧第一電氣及非電氣連接裝置12‧‧‧First electrical and non-electrical connection device

13‧‧‧第一基板13‧‧‧First substrate

111‧‧‧神經網絡芯片111‧‧‧ neural network chip

112‧‧‧第二電氣及非電氣連接裝置112‧‧‧Second electrical and non-electrical connection device

113‧‧‧第二基板113‧‧‧second substrate

1111‧‧‧存儲單元1111‧‧‧Storage Unit

1112‧‧‧直接內存存取單元1112‧‧‧Direct Memory Access Unit

1113‧‧‧指令緩存單元1113‧‧‧Instruction cache unit

1114‧‧‧權值緩存單元1114‧‧‧weight cache unit

1115‧‧‧輸入神經元緩存單元1115‧‧‧Input neuron buffer unit

1116‧‧‧輸出神經元緩存單元1116‧‧‧Output neuron buffer unit

1117‧‧‧控制單元1117‧‧‧Control Unit

1118‧‧‧運算單元1118‧‧‧ Computing Unit

21‧‧‧神經網絡芯片21‧‧‧Neural Network Chip

22‧‧‧焊盤22‧‧‧ pad

23‧‧‧焊球23‧‧‧Solder Ball

24‧‧‧第二基板24‧‧‧second substrate

25‧‧‧第二基板24上的連接點25‧‧‧ Connection points on the second substrate 24

26‧‧‧引腳26‧‧‧pin

27‧‧‧絕緣填充物27‧‧‧Insulation filler

28‧‧‧散熱膏28‧‧‧ Thermal Paste

29‧‧‧金屬外殼散熱片29‧‧‧ metal case heat sink

圖1a是一種集成電路芯片裝置結構示意圖。FIG. 1a is a schematic structural diagram of an integrated circuit chip device.

圖1b是另一種集成電路芯片裝置結構示意圖。FIG. 1b is a schematic structural diagram of another integrated circuit chip device.

圖1c是一種基礎處理電路的結構示意圖。Figure 1c is a schematic structural diagram of a basic processing circuit.

圖1d是一種主處理電路的結構示意圖。FIG. 1d is a schematic structural diagram of a main processing circuit.

圖1e為一種定點數據類型的示意結構圖。FIG. 1e is a schematic structural diagram of a fixed-point data type.

圖2a是一種基礎處理電路的使用方法示意圖。Figure 2a is a schematic diagram of a method of using a basic processing circuit.

圖2b是一種主處理電路傳輸數據示意圖。Figure 2b is a schematic diagram of data transmission by a main processing circuit.

圖2c是矩陣乘以向量的示意圖。Figure 2c is a schematic diagram of a matrix multiplied by a vector.

圖2d是一種集成電路芯片裝置結構示意圖。FIG. 2d is a schematic structural diagram of an integrated circuit chip device.

圖2e是又一種集成電路芯片裝置結構示意圖。FIG. 2e is a schematic structural diagram of another integrated circuit chip device.

圖2f是矩陣乘以矩陣的示意圖。Figure 2f is a schematic diagram of a matrix multiplied by a matrix.

圖3a為卷積輸入數據示意圖。Figure 3a is a schematic diagram of convolution input data.

圖3b為卷積核示意圖。Figure 3b is a schematic diagram of a convolution kernel.

圖3c為輸入數據的一個三維數據塊的運算窗口示意圖。FIG. 3c is a schematic diagram of a calculation window of a three-dimensional data block of input data.

圖3d為輸入數據的一個三維數據塊的另一運算窗口示意圖。FIG. 3d is a schematic diagram of another operation window of a three-dimensional data block of input data.

圖3e為輸入數據的一個三維數據塊的又一運算窗口示意圖.Figure 3e is a schematic diagram of another operation window of a three-dimensional data block of the input data.

圖4a為神經網絡正向運算示意圖。Figure 4a is a schematic diagram of a forward operation of a neural network.

圖4b為神經網絡反向運算示意圖。Figure 4b is a schematic diagram of the reverse operation of the neural network.

圖4c為本披露還揭露了一個組合處理裝置結構示意圖。FIG. 4c also discloses a schematic structural diagram of a combined processing device in the present disclosure.

圖4d為本披露還揭露了一個組合處理裝置另一種結構示意圖。FIG. 4d also discloses another structural schematic diagram of a combined processing device in the present disclosure.

圖5a為本披露實施例提供的一種神經網絡處理器板卡的結構示意圖；5a is a schematic structural diagram of a neural network processor board provided in an embodiment of the present disclosure;

圖5b為本披露實施例流提供的一種神經網絡芯片封裝結構的結構示意圖；5b is a schematic structural diagram of a neural network chip package structure provided by an embodiment of the present disclosure;

圖5c為本披露實施例流提供的一種神經網絡芯片的結構示意圖；5c is a schematic structural diagram of a neural network chip according to an embodiment of the present disclosure;

圖6為本披露實施例流提供的一種神經網絡芯片封裝結構的示意圖；6 is a schematic diagram of a neural network chip package structure provided by the embodiment of the present disclosure;

圖6a為本披露實施例流提供的另一種神經網絡芯片封裝結構的示意圖。FIG. 6a is a schematic diagram of another neural network chip package structure provided by the embodiment of the present disclosure.

Claims

An integrated circuit chip device, wherein the integrated circuit chip device includes: a main processing circuit and a plurality of basic processing circuits; the plurality of basic processing circuits are distributed in an array; each basic processing circuit and adjacent other basic processing circuits Connection, the main processing circuit is connected to k basic processing circuits in the plurality of basic processing circuits, and the k basic processing circuits are: n basic processing circuits in the first row, n basic processing circuits in the m row, and 1 column of m basic processing circuits; the plurality of basic processing circuits include a data type conversion circuit for performing conversion between floating point type data and fixed point type data; the main processing circuit for performing neural network operations Each of the continuous operations and data transmission with the k basic processing circuits; the k basic processing circuits are used for data forwarding between the main processing circuit and the multiple basic processing circuits; the multiple basic processing circuits , Used to determine whether to start the data type conversion circuit to execute the data type of the transmitted data according to the type of the transmitted data and the operation Conversion, carried out in parallel in the neural network computation based on the transmission data, and the calculation results of the main processing circuit and the base-k transferred to the processing circuit.

The integrated circuit chip device according to the first patent application scope, wherein the main processing circuit is configured to obtain a data block to be calculated and an operation instruction, and divide the data block to be calculated into a distribution data block according to the operation instruction. And a broadcast data block; split the distribution data block to obtain a plurality of basic data blocks, distribute the plurality of basic data blocks to the k basic processing circuits, and broadcast the broadcast data block to the k Basic processing circuit; the multiple basic processing circuits are configured to convert the basic data block and the broadcast data block into fixed-point data type basic data blocks according to the received basic data block, the broadcast data block, and the operation instruction; Broadcast data block, performing an inner product operation on the basic data block and the broadcast data block with a fixed-point data type to obtain a fixed-point data type operation result, and converting the fixed-point data type operation result into a floating-point data type operation result through the k The basic processing circuit is transmitted to the main processing circuit; the main processing circuit is configured to process the operation result to obtain the The calculated data block and the instruction result of the operation instruction.

The integrated circuit chip device according to item 2 of the scope of patent application, wherein the main processing circuit is specifically configured to broadcast the broadcast data block to the k basic processing circuits through one broadcast.

The integrated circuit chip device according to item 2 of the scope of patent application, wherein the main processing circuit is configured to, when the operation result is the result of the inner product processing, accumulate the operation result to obtain an accumulation result, and the accumulation result Arrange to obtain the data block to be calculated and the instruction result of the operation instruction.

The integrated circuit chip device according to item 2 of the scope of patent application, wherein the main processing circuit is specifically configured to divide the broadcast data block into multiple partial broadcast data blocks, and broadcast the multiple partial broadcast data blocks to the multiple times. k basic processing circuits.

The integrated circuit chip device according to item 5 of the scope of patent application, wherein the plurality of basic processing circuits are specifically configured to convert the partial broadcast data block and the basic data block into a fixed-point data type, and perform an inner product with the fixed-point data type. After processing, the inner product processing result of a certain point data type is obtained. The inner product processing result of the fixed point data type is accumulated to obtain a partial operation result of a certain point data type. The k basic processing circuits are sent to the main processing circuit.

The integrated circuit chip device according to item 6 of the scope of patent application, wherein the plurality of basic processing circuits are specifically configured to multiplex the partial broadcast data block n times to execute the partial broadcast data block and the n basic data in a fixed-point data type. The inner product operation of the block obtains n partial processing results of the fixed-point data type, and the n partial processing results of the fixed-point data type are accumulated respectively to obtain the n partial operation results of the fixed-point data type. The operation result is converted into a floating-point data type and n partial operation results are sent to the main processing circuit through the k basic processing circuits, where n is an integer greater than or equal to 2.

The integrated circuit chip device according to any one of claims 1-7, wherein the main processing circuit includes: a main register or a main on-chip cache circuit; and the plurality of basic processing circuits includes: a basic register or a basic on-chip cache circuit.

The integrated circuit chip device according to item 8 of the scope of patent application, wherein the main processing circuit includes a vector operator circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access circuit, and a data type operation circuit. Or one or any combination of data rearrangement circuits.

The integrated circuit chip device according to item 1 of the scope of patent application, wherein the data is one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block.

The integrated circuit chip device according to item 2 of the scope of patent application, wherein if the operation instruction is a multiplication instruction, the main processing circuit determines that the multiplier data block is the broadcast data block and the multiplicand data block is the distribution data block; The operation instruction is a convolution instruction. The main processing circuit determines that the input data block is the broadcast data block and the convolution kernel is the distribution data block.

A neural network computing device, wherein the neural network computing device includes one or more integrated circuit chip devices according to any one of claims 1-11 of the scope of patent application.

A combined processing device, wherein the combined processing device includes a neural network computing device, a universal interconnection interface, and a universal processing device, such as the scope of application for a patent; the neural network computing device communicates with the universal processing device through the universal interconnection interface. Device connection.

A chip in which the device is integrated with any one of items 1 to 13 of the scope of patent application.

A smart device, wherein the smart device includes a chip such as the scope of patent application No. 14.

A method for calculating a neural network, wherein the method is applied in an integrated circuit chip device, and the integrated circuit chip device includes the integrated circuit chip device such as any one of claims 1-11 in the scope of patent application, the integrated circuit chip device Used to perform operations on neural networks.

The method according to item 16 of the scope of patent application, wherein the operation of the neural network includes: convolution operation, matrix multiplication matrix operation, matrix multiplication vector operation, paranoid operation, fully connected operation, GEMM operation, GEMV operation, and activation operation. One or any combination.