TWI793224B

TWI793224B - Integrated circuit chip apparatus and related product

Info

Publication number: TWI793224B
Application number: TW107144033A
Authority: TW
Inventors: 發明人放棄姓名表示權
Original assignee: 大陸商中科寒武紀科技股份有限公司
Priority date: 2017-12-14
Filing date: 2018-12-07
Publication date: 2023-02-21
Also published as: CN111160543A; CN111091189A; CN109961135A; CN111105024B; CN111160543B; TW201931219A; CN109961135B; CN111105024A; CN111091189B

Abstract

本披露提供一種集成電路芯片裝置及相關產品，所述集成電路芯片裝置包括：主處理電路以及多個基礎處理電路；所述主處理電路或多個基礎處理電路中至少一個基礎處理電路包括：數據類型運算電路，所述數據類型運算電路，用於執行浮點類型數據以及定點類型數據之間的轉換。本披露提供的技術方案具有計算量小，功耗低的優點。The present disclosure provides an integrated circuit chip device and related products. The integrated circuit chip device includes: a main processing circuit and a plurality of basic processing circuits; at least one of the main processing circuit or a plurality of basic processing circuits includes: A type operation circuit, the data type operation circuit is used to perform conversion between floating-point type data and fixed-point type data. The technical solution provided by the present disclosure has the advantages of small calculation amount and low power consumption.

Description

Integrated circuit chip devices and related products

本披露涉及神經網絡領域，尤其涉及一種集成電路芯片裝置及相關產品。The present disclosure relates to the field of neural networks, in particular to an integrated circuit chip device and related products.

人工神經網絡（Artificial Neural Network， ANN ），是20世紀80 年代以來人工智能領域興起的研究熱點。它從信息處理角度對人腦神經元網絡進行抽象，建立某種簡單模型，按不同的連接方式組成不同的網絡。在工程與學術界也常直接簡稱為神經網絡或類神經網絡。神經網絡是一種運算模型，由大量的節點（或稱神經元）之間相互聯接構成。現有的神經網絡的運算基於CPU（Central Processing Unit，中央處理器）或GPU（Graphics Processing Unit，圖形處理器）來實現神經網絡的運算，此種運算的計算量大，功耗高。Artificial Neural Network (ANN) is a research hotspot in the field of artificial intelligence since the 1980s. It abstracts the human brain neuron network from the perspective of information processing, establishes a simple model, and forms different networks according to different connection methods. In engineering and academia, it is often referred to directly as a neural network or a neural network. A neural network is an operational model consisting of a large number of nodes (or neurons) connected to each other. The calculation of the existing neural network is based on a CPU (Central Processing Unit, central processing unit) or a GPU (Graphics Processing Unit, graphics processing unit) to realize the calculation of the neural network, which requires a large amount of calculation and high power consumption.

本披露實施例提供了一種集成電路芯片裝置及相關產品，可提升計算裝置的處理速度，提高效率。Embodiments of the present disclosure provide an integrated circuit chip device and related products, which can increase the processing speed of a computing device and improve efficiency.

第一方面，提供一種集成電路芯片裝置，所述集成電路芯片裝置包括：主處理電路以及多個基礎處理電路；所述主處理電路或多個基礎處理電路中至少一個基礎處理電路包括：數據類型運算電路，所述數據類型運算電路，用於執行浮點類型數據以及定點類型數據之間的轉換；In a first aspect, an integrated circuit chip device is provided, and the integrated circuit chip device includes: a main processing circuit and a plurality of basic processing circuits; at least one of the main processing circuit or a plurality of basic processing circuits includes: a data type An operation circuit, the data type operation circuit is used to perform conversion between floating-point type data and fixed-point type data;

所述主處理電路，用於執行神經網絡運算中的各個連續的運算以及與所述基礎處理電路傳輸數據；The main processing circuit is used to execute each continuous operation in the neural network operation and transmit data with the basic processing circuit;

所述多個基礎處理電路，用於依據所述主處理電路傳輸的數據以並行方式執行神經網絡中的運算，並將運算結果傳輸給所述主處理電路。The plurality of basic processing circuits are used to execute calculations in the neural network in parallel according to the data transmitted by the main processing circuit, and transmit the calculation results to the main processing circuit.

第二方面，提供一種神經網絡運算裝置，所述神經網絡運算裝置包括一個或多個第一方面提供的集成電路芯片裝置。In a second aspect, a neural network computing device is provided, and the neural network computing device includes one or more integrated circuit chip devices provided in the first aspect.

第三方面，提供一種組合處理裝置，所述組合處理裝置包括：第二方面提供的神經網絡運算裝置、通用互聯介面和通用處理裝置；A third aspect provides a combined processing device, which includes: the neural network computing device provided in the second aspect, a general interconnection interface, and a general processing device;

所述神經網絡運算裝置通過所述通用互聯介面與所述通用處理裝置連接。The neural network computing device is connected to the general processing device through the general interconnection interface.

第四方面，提供一種芯片，所述芯片集成第一方面的裝置、第二方面的裝置或第三方面的裝置。In a fourth aspect, a chip is provided, and the chip integrates the device of the first aspect, the device of the second aspect, or the device of the third aspect.

第五方面，提供一種電子設備，所述電子設備包括第四方面的芯片。In a fifth aspect, an electronic device is provided, and the electronic device includes the chip of the fourth aspect.

第六方面，提供一種神經網絡的運算方法，所述方法應用在集成電路芯片裝置內，所述集成電路芯片裝置包括：第一方面所述的集成電路芯片裝置，所述集成電路芯片裝置用於執行神經網絡的運算。In a sixth aspect, there is provided a method for calculating a neural network, the method is applied in an integrated circuit chip device, and the integrated circuit chip device includes: the integrated circuit chip device described in the first aspect, and the integrated circuit chip device is used for Perform neural network operations.

可以看出，通過本披露實施例，提供數據轉換運算電路將數據塊的類型進行轉換後運算，節省了傳輸資源以及計算資源，所以其具有功耗低，計算量小的優點。It can be seen that, through the embodiment of the present disclosure, the data conversion operation circuit is provided to perform operation after conversion of the type of the data block, which saves transmission resources and computing resources, so it has the advantages of low power consumption and small amount of calculation.

為了使本技術領域的人員更好地理解本披露方案，下面將結合本披露實施例中的圖式，對本披露實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本披露一部分實施例，而不是全部的實施例。基於本披露中的實施例，所屬技術領域中具有通常知識者在沒有作出創造性勞動前提下所獲得的所有其他實施例，都屬於本披露保護的範圍。In order to enable those skilled in the art to better understand the present disclosure, the following will clearly and completely describe the technical solutions in the present disclosure embodiments in conjunction with the drawings in the present disclosure embodiments. Obviously, the described embodiments are only It is a part of the embodiments of the present disclosure, but not all of them. Based on the embodiments in the present disclosure, all other embodiments obtained by persons with ordinary knowledge in the technical field without creative efforts fall within the protection scope of the present disclosure.

在第一方面提供的裝置中，所述集成電路芯片裝置還包括：分支處理電路，所述分支處理電路設置在主處理電路與至少一個基礎處理電路之間；所述分支處理電路，用於在主處理電路與至少一個基礎處理電路之間轉發數據。In the device provided in the first aspect, the integrated circuit chip device further includes: a branch processing circuit, the branch processing circuit is arranged between the main processing circuit and at least one basic processing circuit; the branch processing circuit is configured to Data is forwarded between the main processing circuit and the at least one base processing circuit.

在第一方面提供的裝置中，所述主處理電路，用於獲取待計算的數據塊以及運算指令，通過所述數據類型運算電路將待計算的數據塊轉換成定點類型的數據塊，依據該運算指令對定點類型的待計算的數據塊劃分成分發數據塊以及廣播數據塊；對所述分發數據塊進行拆分處理得到多個基本數據塊，將所述多個基本數據塊分發至所述至少一個基礎處理電路，將所述廣播數據塊廣播至所述至少一個基礎處理電路；In the device provided in the first aspect, the main processing circuit is used to obtain the data block to be calculated and the operation instruction, and convert the data block to be calculated into a fixed-point type data block through the data type operation circuit, according to the The arithmetic instruction divides the fixed-point data block to be calculated into a distribution data block and a broadcast data block; splitting the distribution data block to obtain multiple basic data blocks, and distributing the multiple basic data blocks to the at least one base processing circuit, broadcasting the broadcast data block to the at least one base processing circuit;

所述基礎處理電路，用於對所述基本數據塊與所述廣播數據塊以定點類型執行內積運算得到運算結果，將所述運算結果發送至所述主處理電路；The basic processing circuit is configured to perform an inner product operation on the basic data block and the broadcast data block in a fixed-point type to obtain an operation result, and send the operation result to the main processing circuit;

所述主處理電路，用於對所述運算結果處理得到所述待計算的數據塊以及運算指令的指令結果。The main processing circuit is configured to process the operation result to obtain the data block to be calculated and an instruction result of the operation instruction.

在第一方面提供的裝置中，所述主處理電路，具體用於將所述廣播數據塊通過一次廣播至所述多個基礎處理電路。In the device provided in the first aspect, the main processing circuit is specifically configured to broadcast the broadcast data block to the multiple basic processing circuits in one pass.

在第一方面提供的裝置中，所述基礎處理電路，具體用於將所述基本數據塊與所述廣播數據塊以定點數據類型執行內積處理得到內積處理結果，將所述內積處理結果累加得到運算結果，將所述運算結果發送至所述主處理電路。In the device provided in the first aspect, the basic processing circuit is specifically configured to perform inner product processing on the basic data block and the broadcast data block in a fixed-point data type to obtain an inner product processing result, and process the inner product The results are accumulated to obtain an operation result, and the operation result is sent to the main processing circuit.

在第一方面提供的裝置中，所述主處理電路，用於在如所述運算結果為內積處理的結果時，對所述運算結果累加後得到累加結果，將該累加結果排列得到所述待計算的數據塊以及運算指令的指令結果。In the device provided in the first aspect, the main processing circuit is configured to accumulate the operation results to obtain an accumulation result when the operation result is the result of inner product processing, and arrange the accumulation results to obtain the The data block to be calculated and the instruction result of the operation instruction.

在第一方面提供的裝置中，所述主處理電路，具體用於將所述廣播數據塊分成多個部分廣播數據塊，將所述多個部分廣播數據塊通過多次廣播至所述多個基礎處理電路。In the device provided in the first aspect, the main processing circuit is specifically configured to divide the broadcast data block into multiple partial broadcast data blocks, and broadcast the multiple partial broadcast data blocks to the multiple Basic processing circuit.

在第一方面提供的裝置中，所述基礎處理電路，具體用於將所述部分廣播數據塊與所述基本數據塊以定點數據類型執行一次內積處理後得到內積處理結果，將所述內積處理結果累加得到部分運算結果，將所述部分運算結果發送至所述主處理電路。In the device provided in the first aspect, the basic processing circuit is specifically configured to perform an inner product processing on the partial broadcast data block and the basic data block in a fixed-point data type to obtain an inner product processing result, and convert the The inner product processing results are accumulated to obtain a partial operation result, and the partial operation result is sent to the main processing circuit.

在第一方面提供的裝置中，所述基礎處理電路，具體用於復用n次該部分廣播數據塊執行該部分廣播數據塊與該n個基本數據塊內積運算得到n個部分處理結果，將n個部分處理結果分別累加後得到n個部分運算結果，將所述n個部分運算結果發送至主處理電路，所述n為大於等於2的整數。In the device provided in the first aspect, the basic processing circuit is specifically configured to multiplex the part of the broadcast data block n times and perform an inner product operation between the part of the broadcast data block and the n basic data blocks to obtain n partial processing results, The n partial processing results are respectively accumulated to obtain n partial operation results, and the n partial operation results are sent to the main processing circuit, and the n is an integer greater than or equal to 2.

在第一方面提供的裝置中，所述主處理電路包括：主寄存器或主片上緩存電路；In the device provided in the first aspect, the main processing circuit includes: a main register or a main on-chip cache circuit;

所述基礎處理電路包括：基本寄存器或基本片上緩存電路。The basic processing circuit includes: a basic register or a basic on-chip cache circuit.

在第一方面提供的裝置中，所述主處理電路包括：向量運算器電路、算數邏輯單元電路、累加器電路、矩陣轉置電路、直接內存存取電路、數據類型運算電路或數據重排電路中的一種或任意組合。In the device provided in the first aspect, the main processing circuit includes: a vector arithmetic unit circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access circuit, a data type operation circuit or a data rearrangement circuit one or any combination of them.

在第一方面提供的裝置中，所述主處理電路，用於獲取待計算的數據塊以及運算指令，依據該運算指令對所述待計算的數據塊劃分成分發數據塊以及廣播數據塊；對所述分發數據塊進行拆分處理得到多個基本數據塊，將所述多個基本數據塊分發至所述至少一個基礎處理電路，將所述廣播數據塊廣播至所述至少一個基礎處理電路；In the device provided in the first aspect, the main processing circuit is configured to obtain a data block to be calculated and an operation instruction, and divide the data block to be calculated into a distribution data block and a broadcast data block according to the operation instruction; performing splitting processing on the distribution data block to obtain multiple basic data blocks, distributing the multiple basic data blocks to the at least one basic processing circuit, and broadcasting the broadcast data block to the at least one basic processing circuit;

所述基礎處理電路，用於對所述基本數據塊與所述廣播數據塊轉換成定點類型的數據塊，依據定點類型的數據塊執行內積運算得到運算結果，將所述運算結果轉換成浮點數據後發送至所述主處理電路；The basic processing circuit is used to convert the basic data block and the broadcast data block into a fixed-point type data block, perform an inner product operation according to the fixed-point type data block to obtain an operation result, and convert the operation result into a float Send the point data to the main processing circuit;

在第一方面提供的裝置中，所述分支處理電路包含多個分支處理電路，所述主處理電路與所述多個分支處理電路分別連接，每個分支處理電路與至少一個基礎處理電路連接。In the device provided in the first aspect, the branch processing circuit includes a plurality of branch processing circuits, the main processing circuit is respectively connected to the plurality of branch processing circuits, and each branch processing circuit is connected to at least one basic processing circuit.

在第一方面提供的裝置中，所述數據為：向量、矩陣、三維數據塊、四維數據塊以及n維數據塊中一種或任意組合。In the device provided in the first aspect, the data is one or any combination of vectors, matrices, three-dimensional data blocks, four-dimensional data blocks, and n-dimensional data blocks.

在第一方面提供的裝置中，如所述運算指令為乘法指令，所述主處理電路確定乘數數據塊為廣播數據塊，被乘數數據塊為分發數據塊；In the device provided in the first aspect, if the operation instruction is a multiplication instruction, the main processing circuit determines that the multiplier data block is a broadcast data block, and the multiplicand data block is a distribution data block;

如所述運算指令為卷積指令，所述主處理電路確定輸入數據塊為廣播數據塊，卷積核為分發數據塊。If the operation instruction is a convolution instruction, the main processing circuit determines that the input data block is a broadcast data block, and the convolution kernel is a distribution data block.

在第六方面提供的方法中，所述神經網絡的運算包括：卷積運算、矩陣乘矩陣運算、矩陣乘向量運算、偏執運算、全連接運算、GEMM運算、GEMV運算、激活運算中的一種或任意組合。In the method provided in the sixth aspect, the operation of the neural network includes: one or more of convolution operation, matrix multiplication matrix operation, matrix multiplication vector operation, paranoid operation, full connection operation, GEMM operation, GEMV operation, and activation operation random combination.

參閱圖1a，圖1a為一種集成電路芯片裝置的結構示意圖，如圖1a所示，該芯片裝置包括：主處理電路、基本處理電路和分支處理電路（可選的）。其中，Referring to FIG. 1a, FIG. 1a is a schematic structural diagram of an integrated circuit chip device. As shown in FIG. 1a, the chip device includes: a main processing circuit, a basic processing circuit and a branch processing circuit (optional). in,

主處理電路可以包括寄存器和/或片上緩存電路，該主處理電路還可以包括:控制電路、向量運算器電路、ALU（arithmetic and logic unit，算數邏輯單元）電路、累加器電路、DMA(Direct Memory Access，直接內存存取)電路等電路，當然在實際應用中，上述主處理電路還可以添加，轉換電路（例如矩陣轉置電路）、數據重排電路或激活電路等等其他的電路；The main processing circuit may include a register and/or an on-chip buffer circuit, and the main processing circuit may also include: a control circuit, a vector operator circuit, an ALU (arithmetic and logic unit, arithmetic logic unit) circuit, an accumulator circuit, a DMA (Direct Memory Access, direct memory access) circuits and other circuits, of course, in practical applications, the above-mentioned main processing circuit can also be added, conversion circuits (such as matrix transposition circuits), data rearrangement circuits or activation circuits and other circuits;

可選的，主處理電路可以包括：數據類型轉換運算電路，數據類型轉換運算電路可以用於將接收或發送的數據從浮點類型數據轉換成定點類型數據，當然在實際應用中，也可以將定點類型數據轉換成浮點類型數據。本發明並不限制上述數據類型轉換運算電路的具體形式。Optionally, the main processing circuit may include: a data type conversion operation circuit, and the data type conversion operation circuit may be used to convert the received or sent data from floating-point type data to fixed-point type data. Of course, in practical applications, the Fixed-point type data is converted to floating-point type data. The present invention does not limit the specific form of the above-mentioned data type conversion operation circuit.

主處理電路還包括數據發送電路、數據接收電路或介面，該數據發送電路可以集成數據分發電路以及數據廣播電路，當然在實際應用中，數據分發電路以及數據廣播電路也可以分別設置；在實際應用中上述數據發送電路以及數據接收電路也可以集成在一起形成數據收發電路。對於廣播數據，即需要發送給每個基礎處理電路的數據。對於分發數據，即需要有選擇的發送給部分基礎處理電路的數據，具體的選擇方式可以由主處理電路依據負載以及計算方式進行具體的確定。對於廣播發送方式，即將廣播數據以廣播形式發送至每個基礎處理電路。（在實際應用中，通過一次廣播的方式將廣播數據發送至每個基礎處理電路，也可以通過多次廣播的方式將廣播數據發送至每個基礎處理電路，本申請具體實施方式並不限制上述廣播的次數），對於分發發送方式，即將分發數據有選擇的發送給部分基礎處理電路。The main processing circuit also includes a data sending circuit, a data receiving circuit or an interface. The data sending circuit can integrate a data distribution circuit and a data broadcast circuit. Of course, in practical applications, the data distribution circuit and the data broadcast circuit can also be set separately; The above-mentioned data sending circuit and data receiving circuit can also be integrated together to form a data sending and receiving circuit. For broadcast data, that is the data that needs to be sent to each underlying processing circuit. For distribution data, that is, data that needs to be selectively sent to some basic processing circuits, the specific selection method can be specifically determined by the main processing circuit according to the load and calculation method. For the broadcast sending mode, the broadcast data is sent to each basic processing circuit in a broadcast form. (In practical applications, the broadcast data is sent to each basic processing circuit through one broadcast, and the broadcast data can also be sent to each basic processing circuit through multiple broadcasts. The specific implementation of this application does not limit the above-mentioned The number of broadcasts), for the distribution transmission method, the distribution data is selectively sent to some basic processing circuits.

在實現分發數據時，主處理電路的控制電路向部分或者全部基礎處理電路傳輸數據(該數據可以相同，也可以不同，具體的，如果採用分發的方式發送數據，各個接收數據的基礎處理電路收到的數據可以不同，當然也可以有部分基礎處理電路收到的數據相同；When distributing data, the control circuit of the main processing circuit transmits data to some or all of the basic processing circuits (the data can be the same or different. Specifically, if the data is sent in a distributed manner, each basic processing circuit that receives data receives The received data can be different, of course, some basic processing circuits can also receive the same data;

具體地，廣播數據時，主處理電路的控制電路向部分或者全部基礎處理電路傳輸數據，各個接收數據的基礎處理電路可以收到相同的數據。Specifically, when broadcasting data, the control circuit of the main processing circuit transmits data to some or all of the basic processing circuits, and each basic processing circuit receiving data can receive the same data.

可選的，上述主處理電路的向量運算器電路可以執行向量運算,包括但不限於：兩個向量加減乘除，向量與常數加、減、乘、除運算，或者對向量中的每個元素執行任意運算。其中，連續的運算具體可以為，向量與常數加、減、乘、除運算、激活運算、累加運算等等。Optionally, the vector operator circuit of the above-mentioned main processing circuit can perform vector operations, including but not limited to: addition, subtraction, multiplication, and division of two vectors, addition, subtraction, multiplication, and division operations between vectors and constants, or performing operations on each element in the vector Arbitrary operations. Wherein, the continuous operation may specifically be, vector and constant addition, subtraction, multiplication, division operation, activation operation, accumulation operation and so on.

每個基礎處理電路可以包括基礎寄存器和/或基礎片上緩存電路；每個基礎處理電路還可以包括：內積運算器電路、向量運算器電路、累加器電路等中一個或任意組合。上述內積運算器電路、向量運算器電路、累加器電路都可以是集成電路，上述內積運算器電路、向量運算器電路、累加器電路也可以為單獨設置的電路。Each basic processing circuit may include a basic register and/or a basic on-chip cache circuit; each basic processing circuit may also include: one or any combination of an inner product operator circuit, a vector operator circuit, an accumulator circuit, and the like. The above-mentioned inner product operator circuit, vector operator circuit, and accumulator circuit can all be integrated circuits, and the above-mentioned inner product operator circuit, vector operator circuit, and accumulator circuit can also be independently arranged circuits.

該芯片裝置可選的還可以包括一個或多個分支處理電路，如具有分支處理電路時，其中主處理電路與分支處理電路連接，該分支處理電路與基本處理電路連接，該基本處理電路的內積運算器電路用於執行數據塊之間的內積運算，該主處理電路的控制電路控制數據接收電路或數據發送電路收發外部數據，以及通過控制電路控制數據發送電路將外部數據分發至分支處理電路，該分支處理電路用於收發主處理電路或基本處理電路的數據。如圖1a所示的結構適合複雜數據的計算，因為對於主處理電路來說，其連接的單元的數量有限，所以需要在主處理電路與基本處理電路之間添加分支處理電路以實現更多的基本處理電路的接入，從而實現對複雜數據塊的計算。分支處理電路和基礎處理電路的連接結構可以是任意的，不局限在圖1a的H型結構。可選的，主處理電路到基礎處理電路是廣播或分發的結構，基礎處理電路到主處理電路是收集（gather）的結構。廣播，分發和收集的定義如下，對於分發或廣播結構，此時的基礎處理電路的數量大於主處理電路，即1個主處理電路對應多個基礎處理電路，即從主處理電路到多個基礎處理電路為廣播或分發的結構，反之，從多個基礎處理電路到主處理電路可以為收集結構。The chip device may also optionally include one or more branch processing circuits. For example, when there are branch processing circuits, the main processing circuit is connected to the branch processing circuits, the branch processing circuits are connected to the basic processing circuit, and the internal processing circuit of the basic processing circuit The product operator circuit is used to perform the inner product operation between data blocks, the control circuit of the main processing circuit controls the data receiving circuit or the data sending circuit to send and receive external data, and the control circuit controls the data sending circuit to distribute the external data to the branch processing A circuit, the branch processing circuit is used to send and receive data from the main processing circuit or the basic processing circuit. The structure shown in Figure 1a is suitable for the calculation of complex data, because for the main processing circuit, the number of connected units is limited, so it is necessary to add branch processing circuits between the main processing circuit and the basic processing circuit to achieve more The basic processing circuit is connected to realize the calculation of complex data blocks. The connection structure of the branch processing circuit and the basic processing circuit can be arbitrary, and is not limited to the H-shaped structure in FIG. 1a. Optionally, the main processing circuit to the basic processing circuit is a broadcast or distribution structure, and the basic processing circuit to the main processing circuit is a gather (gather) structure. Broadcasting, distribution and collection are defined as follows, for distribution or broadcasting structure, the number of basic processing circuits at this time is greater than the main processing circuit, that is, 1 main processing circuit corresponds to multiple basic processing circuits, that is, from the main processing circuit to multiple basic processing circuits The processing circuit is a broadcast or distribution structure, conversely, from multiple basic processing circuits to the main processing circuit can be a collection structure.

基礎處理電路，接收主處理電路分發或者廣播的數據保存到基礎處理電路的片上緩存中，可以進行運算產生結果，可以向主處理電路發送數據。The basic processing circuit receives the data distributed or broadcast by the main processing circuit and saves it in the on-chip cache of the basic processing circuit, and can perform calculations to generate results, and can send data to the main processing circuit.

基礎處理電路中所涉及到的數據可以是任意數據類型的數據，可以是任意位寬的浮點數表示的數據也可以是任意位寬的定點數表示的數據；涉及到的所有運算電路和存儲電路都可以是能夠處理的任意數據類型的運算電路和存儲電路，可以是任意位寬的浮點數的運算電路和存儲電路也可以是任意位寬的定點數的運算電路和存儲電路。The data involved in the basic processing circuit can be data of any data type, data represented by floating-point numbers of any bit width or data represented by fixed-point numbers of any bit width; all involved arithmetic circuits and storage The circuits can be arithmetic circuits and storage circuits of any type of data that can be processed, can be arithmetic circuits and storage circuits of floating-point numbers of any bit width, and can be arithmetic circuits and storage circuits of fixed-point numbers of any bit width.

可選的，每個基礎處理電路均可以包括數據類型轉換運算電路，也可以在部分基礎處理電路配置數據類型轉換運算電路；數據類型轉換運算電路可以用於將接收或發送的數據從浮點類型數據轉換成定點類型數據，也可以將定點類型數據轉換成浮點類型數據。本發明並不限制上述數據類型轉換運算電路的具體形式。Optionally, each basic processing circuit can include a data type conversion operation circuit, and a data type conversion operation circuit can also be configured in part of the basic processing circuit; the data type conversion operation circuit can be used to convert the received or sent data from the floating point type to Data can be converted into fixed-point type data, and fixed-point type data can also be converted into floating-point type data. The present invention does not limit the specific form of the above-mentioned data type conversion operation circuit.

可選的，該基礎處理電路的向量運算器電路可以對數據類型轉換後的兩個向量執行的向量運算，當然在實際應用中，基礎處理電路的內積運算器電路可以對數據類型轉換後的兩個向量執行內積運算，累加器電路也可以對內積運算的結果進行累加。Optionally, the vector operator circuit of the basic processing circuit can perform vector operations on the two vectors after the data type conversion. Of course, in practical applications, the inner product operator circuit of the basic processing circuit can perform the vector operation on the two vectors after the data type conversion The inner product operation is performed on two vectors, and the accumulator circuit can also accumulate the result of the inner product operation.

在一種可選方案中，兩個向量可以存放在片上緩存和/或寄存器中，基礎處理電路可以根據實際計算的需要提取兩個向量執行運算。該運算包括但不限於：內積運算、乘法運算、加法運算或其他的運算。In an optional solution, the two vectors can be stored in on-chip caches and/or registers, and the basic processing circuit can extract the two vectors to perform operations according to actual calculation needs. The operation includes but not limited to: inner product operation, multiplication operation, addition operation or other operations.

在一種可選方案中，內積運算的結果可以累加到片上緩存和/或寄存器上；其可選方案的優點是，減少了基礎處理電路和主處理電路之間的數據傳輸量，提高了運算效率，降低了數據傳輸功耗。In an optional solution, the result of the inner product operation can be accumulated to the on-chip cache and/or register; the advantage of the optional solution is that it reduces the amount of data transmission between the basic processing circuit and the main processing circuit, and improves the calculation efficiency. efficiency, reducing data transfer power consumption.

在一種可選方案中，內積運算的結果不進行累加，直接作為結果傳輸；此技術方案的優點是，減少了基礎處理電路內部的運算量，提高基礎處理電路的運算效率。In an optional solution, the result of the inner product operation is not accumulated, but directly transmitted as the result; the advantage of this technical solution is that it reduces the amount of calculation inside the basic processing circuit and improves the calculation efficiency of the basic processing circuit.

在一種可選方案中，每個基礎處理電路可以執行多組兩個向量的內積運算，也可以對多組內積運算的結果分別進行累加；In an optional solution, each basic processing circuit can perform multiple sets of inner product operations of two vectors, and can also separately accumulate the results of multiple sets of inner product operations;

在一種可選方案中，多組的兩個向量數據可以存放在片上緩存和/或寄存器中；In an optional solution, multiple sets of two vector data may be stored in on-chip caches and/or registers;

在一種可選方案中，多組內積運算的結果可以分別累加到片上緩存和/或寄存器中；In an optional solution, the results of multiple sets of inner product operations can be respectively accumulated in on-chip caches and/or registers;

在一種可選方案中，各組內積運算的結果可以不進行累加，直接作為結果傳輸；In an optional solution, the results of the inner product operations of each group may not be accumulated, but directly transmitted as the result;

在一種可選方案中，每個基礎處理電路可以執行同一個向量與多個向量分別進行內積運算的操作（「一對多」內積，即多組內積里每組的兩個向量中有一個向量是共享的），並將每個向量對應的內積結果分別進行累加。此技術方案可以實現同一套權值對不同的輸入數據進行多次計算，增加了數據復用，減少基礎處理電路內部數據的數據傳輸量，提高計算效率，降低功耗。In an optional solution, each basic processing circuit can perform the inner product operation of the same vector and multiple vectors ("one-to-many" inner product, that is, the inner product of multiple groups of two vectors in each group One vector is shared), and the inner product results corresponding to each vector are accumulated separately. This technical solution can realize multiple calculations of different input data with the same set of weights, increase data multiplexing, reduce the amount of data transmission of internal data in the basic processing circuit, improve calculation efficiency, and reduce power consumption.

具體地，計算內積使用的數據中，各組共享的向量和每組的另一個向量（即每組之間不同的那個向量）的數據來源可以不同：Specifically, in the data used to calculate the inner product, the data sources of the shared vector of each group and the other vector of each group (that is, the vector that is different between each group) can be different:

在一種可選方案中，在計算內積時，各組共享的向量來自主處理電路或者分支處理電路的廣播或者分發；In an optional solution, when calculating the inner product, the vectors shared by each group are broadcast or distributed from the main processing circuit or the branch processing circuit;

在一種可選方案中，在計算內積時，各組共享的向量來自片上緩存；In an optional solution, when calculating the inner product, the vectors shared by each group come from the on-chip cache;

在一種可選方案中，在計算內積時，各組共享的向量來自寄存器；In an optional solution, when calculating the inner product, the vector shared by each group comes from a register;

在一種可選方案中，在計算內積時，每組的另一個非共享向量來自主處理電路或者分支處理電路的廣播或者分發；In an optional solution, when calculating the inner product, another non-shared vector of each group is broadcast or distributed from the main processing circuit or the branch processing circuit;

在一種可選方案中，在計算內積時，每組的另一個非共享向量來自從片上緩存；In an alternative, when calculating the inner product, another non-shared vector for each group comes from the on-chip cache;

在一種可選方案中，在計算內積時，每組的另一個非共享向量來自寄存器；In an alternative, another non-shared vector for each group comes from a register when computing the inner product;

在一種可選方案中，在進行多組的內積運算時，每組共享的向量在基礎處理電路的片上緩存和/寄存器中保留任意份數；In an optional solution, when multiple groups of inner product operations are performed, each group of shared vectors retains any number of copies in the on-chip cache and/or register of the basic processing circuit;

在一種可選方案中，共享向量可以對應每組內積各保留一份；In an optional solution, the shared vector can be reserved for each group of inner products;

在一種可選方案中，共享向量可以只保留一份；In an optional solution, only one copy of the shared vector can be reserved;

具體地，多組內積運算的結果可以分別累加到片上緩存和/或寄存器中；Specifically, the results of multiple sets of inner product operations can be respectively accumulated in on-chip caches and/or registers;

具體地，各組內積運算的結果可以不進行累加，直接作為結果傳輸；Specifically, the results of the inner product operations of each group may not be accumulated, but directly transmitted as the result;

參閱圖1a所示的結構，其包含一主處理電路（可以執行向量操作），多基礎處理電路（可以執行內積操作）。這樣組合的好處是：裝置不僅能使用基礎處理電路執行矩陣和向量乘法運算，也能使用主處理電路執行其他任意的向量運算，使裝置在有限的硬件電路的配置下，能夠更快的完成更多的運算，減少了與裝置外部進行數據傳輸的次數，提高了計算效率，降低了功耗。另外，本芯片在基礎處理電路和/或主處理電路均可以設置數據類型轉換運算電路，這樣在進行神經網絡計算時能夠將浮點類型數據轉換成定點類型數據，也可以將定點類型數據轉換成浮點類型數據，並且本芯片可以依據各個電路（主要是主處理電路和基礎處理電路）的運算量（即負載量）動態的分配由那個電路將數據類型進行轉換，這樣能夠減少數據計算的複雜程式，降低功耗，並且動態的分配數據類型的轉換能夠實現不影響芯片的計算效率。該分配的方式包括但不限於：負載均衡、負載最小值分配等等方式。Refer to the structure shown in Figure 1a, which includes a main processing circuit (capable of performing vector operations), and multiple basic processing circuits (capable of performing inner product operations). The advantage of this combination is that the device can not only use the basic processing circuit to perform matrix and vector multiplication operations, but also use the main processing circuit to perform other arbitrary vector operations, so that the device can complete more quickly under the limited hardware circuit configuration. More calculations reduce the number of data transmissions with the outside of the device, improve computing efficiency, and reduce power consumption. In addition, the chip can be equipped with a data type conversion operation circuit in the basic processing circuit and/or the main processing circuit, so that the floating point type data can be converted into fixed point type data when performing neural network calculations, and the fixed point type data can also be converted into fixed point type data. Floating-point type data, and this chip can dynamically allocate the calculation amount (ie load) of each circuit (mainly the main processing circuit and the basic processing circuit), and the circuit will convert the data type, which can reduce the complexity of data calculation programs, reduce power consumption, and dynamically allocate data type conversions to achieve computing efficiency without affecting the chip. The distribution method includes but is not limited to: load balancing, load minimum value distribution, and the like.

該定點類型數據的一種結構示意圖如圖1d所示，如圖1d所示，為一種定點類型數據的表達方法，對於計算系統，1個浮點數據的存儲位數為32bit，而對於定點數據，尤其是採用如圖1d所示的浮點類型的數據進行數據的表示，其1個定點數據的存儲位數可以做到16Bit以下，所以對於此轉換來說，可以極大的減少計算器之間的傳輸開銷，另外，對於計算器來說，較少比特位的數據存儲的空間也較小，即存儲開銷會較小，計算量也會減少，即計算開銷會減少，所以能夠減少計算開銷以及存儲的開銷，但是對於數據類型的轉換也是需要有部分的開銷的，下面簡稱轉換開銷，對於計算量大，數據存儲量大的數據，轉換開銷相對於後續的計算開銷、存儲開銷以及傳輸開銷來說幾乎可以忽略不計，所以對於計算量大，數據存儲量大的數據，本披露採用了將數據類型轉換成定點類型的數據的技術方案，反之，對於計算量小，數據存儲量小的數據，此時由於本身計算開銷、存儲開銷以及傳輸開銷就比較小，此時如果使用定點數據，由於定點數據的精度會略低於浮點數據，在計算量較小的前提下，需要保證計算的精度，所以這裡將定點類型的數據轉換成浮點數據，即通過增加較小的開銷來提高計算的精度的目的。A structural diagram of the fixed-point data is shown in Figure 1d. As shown in Figure 1d, it is an expression method for fixed-point data. For a computing system, the number of storage bits for a floating-point data is 32 bits, while for fixed-point data , especially using floating-point data as shown in Figure 1d for data representation, the number of storage bits for one fixed-point data can be less than 16Bit, so for this conversion, it can greatly reduce the number of calculations between calculators. In addition, for calculators, the data storage space with fewer bits is also smaller, that is, the storage overhead will be smaller, and the amount of calculation will be reduced, that is, the calculation overhead will be reduced, so it can reduce the calculation overhead and Storage overhead, but some overhead is also required for data type conversion, hereinafter referred to as conversion overhead. For data with a large amount of calculation and data storage, the conversion overhead is relatively higher than the subsequent calculation overhead, storage overhead, and transmission overhead. It is almost negligible, so for data with a large amount of calculation and a large amount of data storage, this disclosure adopts the technical solution of converting the data type into a fixed-point type of data. On the contrary, for data with a small amount of calculation and a small amount of data storage, At this time, since the calculation overhead, storage overhead, and transmission overhead are relatively small, if fixed-point data is used at this time, the precision of fixed-point data will be slightly lower than that of floating-point data. Accuracy, so the data of the fixed-point type is converted into floating-point data here, that is, the purpose of improving the accuracy of the calculation by adding a small overhead.

參閱圖1b所示的裝置，圖1b所示的裝置為無分支處理電路的計算裝置，如圖1b所示的裝置，其包括：主處理電路以及N個基礎處理電路，其中，主處理電路（具體的結構如圖1c所示）與N個基礎處理電路可以直接或間接連接，如為間接連接的方式時，一種可選的方案如圖1a所示可以包括N/4個分支處理電路，每個分支處理電路分別連接4個基礎處理電路，對於主處理電路以及N個基礎處理電路分別包含的電路可以參見上述如圖1a所示的描述，這裡不再贅述，這裡需要說明的是，上述基礎處理電路還可以設置在分支處理電路內，另外，每個分支處理電路連接基礎處理電路的數量也可以不局限於4個，廠家可以根據實際需要進行配置。該上述主處理電路和/或N個基礎處理電路均可以包括數據類型轉換運算電路，具體的，可以是主處理電路包括數據類型運算電路，也可以是N個基礎處理電路或其中的一部分包括數據類型轉換電路，也可以是主處理電路和N個基礎處理電路或其中的一部分均包括。上述主處理電路可以根據神經網絡計算指令動態的分配數據類型轉換步驟的操作實體，具體的，主處理電路可以根據自身的負載確定是否對接收到的數據執行數據類型轉換步驟，具體的，可以將負載的值設置多個區間，每個區間對應分配數據類型轉換步驟的執行主體，例如，以3個區間為例，區間1的負載值較低，可以由主處理電路單獨執行數據類型轉換步驟，區間2負載值位於區間1以及區間3之間，可以由主處理電路或N個基礎處理電路共同執行數據類型轉換步驟，區間3負載值較高，可以由N個基礎處理電路執行數據類型轉換步驟。對此，可以以明示的方式來執行，例如主處理電路可以配置一個特殊指示或指令，當基礎處理電路接收到該特殊指示或指令時，確定執行數據類型轉換步驟，如基礎處理電路未接收到特殊指示或指令時，確定不執行數據類型轉換步驟。又如，可以以暗示的方式來執行，例如，基礎處理電路接收到數據類型為浮點類型的數據且確定需要執行內積運算時，將該數據類型轉換成定點類型的數據。Referring to the device shown in FIG. 1b, the device shown in FIG. 1b is a computing device without branch processing circuits. The device shown in FIG. 1b includes: a main processing circuit and N basic processing circuits, wherein the main processing circuit ( The specific structure is shown in Figure 1c) and N basic processing circuits can be directly or indirectly connected, such as indirect connection, an optional solution can include N/4 branch processing circuits as shown in Figure 1a, each The four branch processing circuits are respectively connected to four basic processing circuits. For the main processing circuit and the circuits respectively included in the N basic processing circuits, please refer to the above description shown in FIG. The processing circuit can also be set in the branch processing circuit. In addition, the number of each branch processing circuit connected to the basic processing circuit is not limited to 4, and the manufacturer can configure it according to actual needs. The above-mentioned main processing circuit and/or the N basic processing circuits may all include a data type conversion operation circuit. Specifically, the main processing circuit may include a data type operation circuit, or the N basic processing circuits or a part of them may include a data type conversion circuit. The type conversion circuit may also include the main processing circuit and N basic processing circuits or a part thereof. The above-mentioned main processing circuit can dynamically allocate the operation entity of the data type conversion step according to the neural network calculation instruction. Specifically, the main processing circuit can determine whether to perform the data type conversion step on the received data according to its own load. Specifically, the The value of the load is set in multiple intervals, and each interval corresponds to the execution subject of the assigned data type conversion step. For example, taking three intervals as an example, the load value of interval 1 is relatively low, and the data type conversion step can be performed independently by the main processing circuit. The load value of interval 2 is between interval 1 and interval 3, and the data type conversion step can be performed jointly by the main processing circuit or N basic processing circuits. The load value of interval 3 is relatively high, and the data type conversion step can be performed by N basic processing circuits . In this regard, it can be implemented in an explicit manner. For example, the main processing circuit can configure a special instruction or instruction. When the basic processing circuit receives the special instruction or instruction, it will determine to perform the data type conversion step. If the basic processing circuit does not receive When there are special instructions or instructions, it is determined not to perform the data type conversion step. As another example, it may be performed in an implicit manner, for example, when the basic processing circuit receives data of a floating-point type and determines that an inner product operation needs to be performed, it converts the data type into fixed-point data.

下面提供一種採用如圖1a所示的裝置實現計算的方法，該計算的方法具體可以為神經網絡的計算方式，例如神經網絡的正向運算，神經網絡的訓練，在實際應用中，正向運算依據不同的輸入數據可以執行矩陣乘矩陣、卷積運算、激活運算、變換運算等等運算，上述運算均可以採用如圖1a所示的裝置實現。The following provides a method for realizing calculation using the device shown in Figure 1a. The calculation method can specifically be the calculation method of the neural network, such as the forward operation of the neural network, the training of the neural network, and in practical applications, the forward operation Operations such as matrix multiplication, convolution operation, activation operation, transformation operation, etc. can be performed according to different input data, and the above operations can be realized by the device shown in FIG. 1a.

主處理電路的數據轉換運算電路先對數據的類型進行轉換然後由控制電路傳輸給基礎處理電路運算，例如，主處理電路的數據轉換運算電路可以將浮點數轉換成位寬更低的定點數再傳輸給基礎處理電路，其優點是可以減少傳輸數據的位寬，減少傳輸的總比特數量，基礎處理電路執行地位寬定點運算的效率也更高，功耗更低。The data conversion operation circuit of the main processing circuit first converts the type of data and then the control circuit transmits it to the basic processing circuit for operation. For example, the data conversion operation circuit of the main processing circuit can convert floating-point numbers into fixed-point numbers with lower bit width Then transmit to the basic processing circuit, the advantage is that the bit width of the transmitted data can be reduced, the total number of bits transmitted can be reduced, and the basic processing circuit can perform position-wide fixed-point operations with higher efficiency and lower power consumption.

如基礎處理電路接收到的數據為浮點數據，那麼基礎處理電路可以收到數據後由數據轉換運算電路先進行數據類型轉化然後再進行計算，例如，基礎處理電路收到主處理電路傳輸過來的浮點數，數據轉換運算電路然後轉換為定點數，然後基礎處理電路的內積運算器電路、向量運算器電路或累加器電路進行運算，提高運算效率，降低功耗。If the data received by the basic processing circuit is floating-point data, then the basic processing circuit can convert the data type after receiving the data and then perform calculations by the data conversion operation circuit. For example, the basic processing circuit receives the transmission from the main processing circuit The floating-point number, the data conversion operation circuit is then converted into a fixed-point number, and then the inner product operator circuit, vector operator circuit or accumulator circuit of the basic processing circuit performs operations to improve operation efficiency and reduce power consumption.

基礎處理電路計算出結果之後可以先進行數據類型轉換然後再傳輸給主處理電路，例如，基礎處理電路計算出的浮點數運算結果可以先轉換為低位寬的定點數然後再傳輸給主處理電路，其好處是降低了傳輸過程的數據位寬，效率更高，而且節約了功耗。After the basic processing circuit calculates the result, it can perform data type conversion first and then transmit it to the main processing circuit. For example, the floating-point number calculation result calculated by the basic processing circuit can be converted into a low-bit-width fixed-point number and then transmitted to the main processing circuit. , which has the advantage of reducing the data bit width in the transmission process, higher efficiency, and saving power consumption.

主處理電路將待計算的數據傳輸到全部或者一部分基礎處理電路上；以矩陣乘以向量計算為例，主處理電路的控制電路可以將矩陣數據拆分每列作為一個基礎數據，例如m*n矩陣，可以拆分成n個m行的向量，主處理電路的控制電路將拆分後的n個m行的向量分發給多個基礎處理電路。對於向量，主處理電路的控制電路可以將向量整體廣播給每個基礎處理電路。如果m的值比較大，那麼控制電路可以先將m*n矩陣拆分成x*n個向量，以x=2為例，具體的可以拆分成，2n個向量，每個向量包含m/2行，即將n個m行的向量中每個向量均分成2個向量，以第一行為例，如n個m行的向量的第一個向量為1000行，那麼均分成2個向量可以為，將前500行組成第一向量，將後500行組成第二向量，控制電路通過2個廣播將2個向量廣播給多個基礎處理電路。The main processing circuit transmits the data to be calculated to all or part of the basic processing circuit; taking matrix multiplication by vector calculation as an example, the control circuit of the main processing circuit can split the matrix data into each column as a basic data, such as m*n The matrix can be split into n vectors with m rows, and the control circuit of the main processing circuit distributes the split n vectors with m rows to multiple basic processing circuits. For vectors, the control circuit of the main processing circuit may broadcast the vector as a whole to each basic processing circuit. If the value of m is relatively large, then the control circuit can first split the m*n matrix into x*n vectors, taking x=2 as an example, it can be split into 2n vectors, each vector contains m/ 2 lines, that is, each vector of n m-line vectors is divided into 2 vectors. Taking the first line as an example, if the first vector of n m-line vectors is 1000 lines, then it can be divided into 2 vectors as , the first 500 lines form the first vector, and the last 500 lines form the second vector, and the control circuit broadcasts the two vectors to multiple basic processing circuits through two broadcasts.

所述數據傳輸的方式可以是廣播或者分發，或者其他任何可能的傳輸方式；The data transmission method may be broadcast or distribution, or any other possible transmission method;

基礎處理電路接收到數據後，執行運算，得到運算結果；After the basic processing circuit receives the data, it executes the calculation and obtains the calculation result;

基礎處理電路將運算結果傳輸回主處理電路；The basic processing circuit transmits the operation result back to the main processing circuit;

所述運算結果可以是中間運算結果，也可以是最終運算結果。The operation result may be an intermediate operation result or a final operation result.

使用如圖1a所示裝置完成矩陣乘向量的運算；Use the device shown in Figure 1a to complete the operation of multiplying the vector by the matrix;

（矩陣乘向量可以是矩陣中的每一行分別與向量進行內積運算，並將這些結果按對應行的順序擺放成一個向量。）(Matrix multiplication by vector can be that each row in the matrix performs an inner product operation with the vector, and arranges these results into a vector in the order of the corresponding rows.)

下面描述計算尺寸是M行L列的矩陣S和長度是L的向量P的乘法的運算，如下圖2a所示，（矩陣S中的每一行與向量P長度相同，他們中的數據按位置一一對應）所述神經網絡計算裝置擁有K個基礎處理電路：The following describes the operation of calculating the multiplication of a matrix S with M rows and L columns and a vector P with a length of L, as shown in Figure 2a below, (each row in the matrix S has the same length as the vector P, and the data in them is by position One-to-one correspondence) The neural network computing device has K basic processing circuits:

參閱圖2，圖2提供了了一種矩陣乘向量的實現方法，具體可以包括：Referring to Figure 2, Figure 2 provides an implementation method of matrix multiplication by vector, which may specifically include:

步驟S201，主處理電路的數據轉換運算電路將矩陣S中的每一行數據轉換成定點類型的數據，主處理電路的控制電路分發到K個基礎處理電路中的某一個上，基礎處理電路將接收到的分發數據保存在基礎處理電路的片上緩存和/或寄存器中；Step S201, the data conversion operation circuit of the main processing circuit converts each row of data in the matrix S into fixed-point data, the control circuit of the main processing circuit distributes it to one of the K basic processing circuits, and the basic processing circuit receives The received distribution data is stored in the on-chip cache and/or register of the basic processing circuit;

在一種可選方案中，如果矩陣S的行數M＜=K則，主處理電路的控制電路給K個基礎處理電路分別分發S矩陣的一行；In an optional solution, if the number of rows of the matrix S M<=K, the control circuit of the main processing circuit distributes one row of the S matrix to the K basic processing circuits respectively;

在一種可選方案中，如果矩陣S的行數M＞K，則主處理電路的控制電路給每個基礎處理電路分別分發S矩陣中一行或多行的數據。In an optional solution, if the number of rows M>K of the matrix S, the control circuit of the main processing circuit distributes the data of one or more rows in the matrix S to each basic processing circuit.

分發到第i個基礎處理電路的S中的行的集合為Ai，共有Mi個行，如圖2c表示第i個基礎處理電路上將要執行的計算。The set of rows in S distributed to the i-th basic processing circuit is Ai, and there are a total of Mi rows. Figure 2c shows the calculation to be performed on the i-th basic processing circuit.

在一種可選方案中，在每個基礎處理電路中，例如第i個基礎處理電路中，可以將接收到的分發數據例如矩陣Ai保存在第i個基礎處理電路的寄存器和/或片上緩存中；優點是減少了之後的分發數據的數據傳輸量，提高了計算效率，降低了功耗。In an optional solution, in each basic processing circuit, for example, in the i-th basic processing circuit, the received distribution data such as the matrix Ai may be stored in the register and/or on-chip cache of the i-th basic processing circuit ; The advantage is to reduce the amount of data transmission of the subsequent distribution data, improve the calculation efficiency, and reduce the power consumption.

步驟S202，主處理電路的數據類型運算電路將向量P轉換成定點類型的數據，主處理電路的控制電路將定點類型的向量P中各部分以廣播的方式傳輸給K個基礎處理電路；Step S202, the data type operation circuit of the main processing circuit converts the vector P into fixed-point data, and the control circuit of the main processing circuit transmits each part of the fixed-point vector P to K basic processing circuits in a broadcast manner;

在一種可選方案中，主處理電路的控制電路可以將向量P中各部分只廣播一次到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對這一次得到的向量P的數據進行充分地復用，完成對應與矩陣Ai中每一行的內積運算。優點是，減少從主處理電路到基礎處理電路的向量P的重復傳輸的數據傳輸量，提高執行效率，降低傳輸功耗。In an optional solution, the control circuit of the main processing circuit can only broadcast each part of the vector P to the registers or on-chip caches of each basic processing circuit once, and the i-th basic processing circuit can process the data of the vector P obtained this time Full multiplexing is performed to complete the inner product operation corresponding to each row in the matrix Ai. The advantage is that it reduces the data transmission amount of repeated transmission of the vector P from the main processing circuit to the basic processing circuit, improves execution efficiency, and reduces transmission power consumption.

在一種可選方案中，主處理電路的控制電路可以將向量P中各部分多次廣播到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對每次得到的向量P的數據不進行復用，分次完成對應於矩陣Ai中的每一行的內積運算；優點是，減少基礎處理電路內部的單次傳輸的向量P的數據傳輸量，並可以降低基礎處理電路緩存和/或寄存器的容量，提高執行效率，降低傳輸功耗，降低成本。In an optional solution, the control circuit of the main processing circuit can broadcast each part of the vector P to the registers or on-chip caches of each basic processing circuit for multiple times, and the i-th basic processing circuit can obtain the data of the vector P each time Without multiplexing, the inner product operation corresponding to each row in the matrix Ai is completed in batches; the advantage is that the data transmission amount of the single-transmission vector P inside the basic processing circuit is reduced, and the basic processing circuit cache and/or Or the capacity of registers, improve execution efficiency, reduce transmission power consumption, and reduce costs.

在一種可選方案中，主處理電路的控制電路可以將向量P中各部分多次廣播到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對每次得到的向量P的數據進行部分復用，完成對應於矩陣Ai中的每一行的內積運算；優點是，減少從主處理電路到基礎處理電路的數據傳輸量，也減少基礎處理電路內部的數據傳輸量，提高執行效率，降低傳輸功耗。In an optional solution, the control circuit of the main processing circuit can broadcast each part of the vector P to the registers or on-chip caches of each basic processing circuit for multiple times, and the i-th basic processing circuit can obtain the data of the vector P each time Perform partial multiplexing to complete the inner product operation corresponding to each row in the matrix Ai; the advantage is that it reduces the amount of data transmission from the main processing circuit to the basic processing circuit, and also reduces the amount of data transmission inside the basic processing circuit, improving execution efficiency , to reduce transmission power consumption.

步驟S203，K個基礎處理電路的內積運算器電路計算矩陣S和向量P的數據的內積，例如第i個基礎處理電路，計算矩陣Ai的數據和向量P的數據的內積；Step S203, the inner product operator circuit of the K basic processing circuits calculates the inner product of the data of the matrix S and the vector P, for example, the ith basic processing circuit calculates the inner product of the data of the matrix Ai and the data of the vector P;

步驟S204，K個基礎處理電路的累加器電路將內積運算的結果進行累加得到累加結果，將累加結果以定點類型形式傳輸回主處理電路。In step S204, the accumulator circuits of the K basic processing circuits accumulate the results of the inner product operation to obtain an accumulation result, and transmit the accumulation result back to the main processing circuit in the form of a fixed-point type.

在一種可選方案中，可以將每次基礎處理電路執行內積運算得到的部分和（部分和即累加結果的一部分，例如累加結果為：F1*G1+ F2*G2+ F3*G3+ F4*G4+ F5*G5,那麼部分和可以為：F1*G1+ F2*G2+ F3*G3的值）傳輸回主處理電路進行累加；優點是，減少了基礎處理電路內部的運算量，提高基礎處理電路的運算效率。In an optional solution, the partial sum (partial sum is a part of the accumulation result obtained by performing the inner product operation of the basic processing circuit each time, for example, the accumulation result is: F1*G1+ F2*G2+ F3*G3+ F4*G4+ F5* G5, then the partial sum can be: the value of F1*G1+ F2*G2+ F3*G3) is transmitted back to the main processing circuit for accumulation; the advantage is that it reduces the amount of computation inside the basic processing circuit and improves the computing efficiency of the basic processing circuit.

在一種可選方案中，也可以將每次基礎處理電路執行的內積運算得到的部分和保存在基礎處理電路的寄存器和/或片上緩存中，累加結束之後傳輸回主處理電路；優點是，減少了基礎處理電路和主處理電路之間的數據傳輸量，提高了運算效率，降低了數據傳輸功耗。In an optional solution, the partial sum obtained by the inner product operation performed by the basic processing circuit each time can also be stored in the register and/or on-chip cache of the basic processing circuit, and transferred back to the main processing circuit after the accumulation is completed; the advantage is that The amount of data transmission between the basic processing circuit and the main processing circuit is reduced, the operation efficiency is improved, and the power consumption of data transmission is reduced.

在一種可選方案中，也可以將每次基礎處理電路執行的內積運算得到的部分和在部分情況下保存在基礎處理電路的寄存器和/或片上緩存中進行累加，部分情況下傳輸到主處理電路進行累加，累加結束之後傳輸回主處理電路；優點是，減少了基礎處理電路和主處理電路之間的數據傳輸量，提高了運算效率，降低了數據傳輸功耗，減少了基礎處理電路內部的運算量，提高基礎處理電路的運算效率。In an optional solution, the sum obtained from the inner product operation performed by the basic processing circuit each time may also be stored in the register and/or on-chip cache of the basic processing circuit for accumulation in some cases, and transferred to the main The processing circuit accumulates, and after the accumulation is completed, it is transmitted back to the main processing circuit; the advantage is that the amount of data transmission between the basic processing circuit and the main processing circuit is reduced, the calculation efficiency is improved, the power consumption of data transmission is reduced, and the basic processing circuit is reduced. The amount of internal calculations can be increased to improve the calculation efficiency of the basic processing circuit.

參閱圖2b，使用如圖1a所示的裝置完成矩陣乘矩陣的運算；Referring to Fig. 2b, use the device as shown in Fig. 1a to complete the operation of matrix multiplication matrix;

下面描述計算尺寸是M行L列的矩陣S和尺寸是L行N列的矩陣W的乘法的運算，（矩陣S中的每一行與矩陣W的每一列長度相同，如圖2d所示）所述神經網絡計算裝置擁有K個基礎處理電路：The following describes the operation of calculating the multiplication of a matrix S with a size of M rows and L columns and a matrix W with a size of L rows and N columns, (each row in the matrix S has the same length as each column of the matrix W, as shown in Figure 2d) The neural network computing device has K basic processing circuits:

步驟S201b，主處理電路的控制電路將矩陣S中的每一行數據分發到K個基礎處理電路中的某一個上，基礎處理電路將接收到的數據保存在片上緩存和/或寄存器中；Step S201b, the control circuit of the main processing circuit distributes each row of data in the matrix S to one of the K basic processing circuits, and the basic processing circuit stores the received data in the on-chip cache and/or register;

在一種可選方案中，如果S的行數M＜=K則，主處理電路的控制電路給M個基礎處理電路分別分發S矩陣的一行；In an optional solution, if the number of rows of S M<=K, the control circuit of the main processing circuit distributes one row of the S matrix to the M basic processing circuits respectively;

在一種可選方案中，如果S的行數M＞K，主處理電路的控制電路給每個基礎處理電路分別分發S矩陣中一行或多行的數據。In an optional solution, if the number of rows of S is M>K, the control circuit of the main processing circuit distributes the data of one or more rows in the matrix S to each basic processing circuit.

S中有Mi行分發到第i個基礎處理電路，這Mi行的集合稱為Ai，如圖2e表示第i個基礎處理電路上將要執行的計算。There are Mi rows in S distributed to the i-th basic processing circuit, and the set of Mi rows is called Ai, as shown in Figure 2e, which shows the computation to be performed on the i-th basic processing circuit.

在一種可選方案中，在每個基礎處理電路中，例如第i個基礎處理電路中：In an optional solution, in each basic processing circuit, for example, in the i-th basic processing circuit:

接收的由主處理電路分發的矩陣Ai，將矩陣Ai保存在第i個基礎處理電路寄存器和/或片上緩存中；優點是減少了之後的數據傳輸量，提高了計算效率，降低了功耗。The matrix Ai distributed by the main processing circuit is received, and the matrix Ai is stored in the i-th basic processing circuit register and/or on-chip cache; the advantage is that the subsequent data transmission amount is reduced, the calculation efficiency is improved, and the power consumption is reduced.

步驟S202b，主處理電路的控制電路將矩陣W中各部分以廣播的方式傳輸給各個基礎處理電路；Step S202b, the control circuit of the main processing circuit transmits each part in the matrix W to each basic processing circuit in a broadcast manner;

在一種可選方案中，可以將矩陣W中各部分只廣播一次到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對這一次得到的矩陣W的數據進行充分地復用，完成對應與矩陣Ai中每一行的內積運算；本實施例中的復用具體可以為基礎處理電路在計算中重復使用，例如矩陣W的數據的復用，可以是對矩陣W的數據在多次使用。In an alternative solution, each part of the matrix W can be broadcast only once to the registers or on-chip caches of each basic processing circuit, and the i-th basic processing circuit fully multiplexes the data of the matrix W obtained this time, Complete the inner product operation corresponding to each row in the matrix Ai; the multiplexing in this embodiment can be the repeated use of the basic processing circuit in the calculation, for example, the multiplexing of the data of the matrix W can be the data of the matrix W in multiple times used.

在一種可選方案中，主處理電路的控制電路可以將矩陣W中各部分多次廣播到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對每次得到的矩陣W的數據不進行復用，分次完成對應於矩陣Ai中的每一行的內積運算；In an optional solution, the control circuit of the main processing circuit can broadcast each part of the matrix W to the registers or on-chip caches of each basic processing circuit for multiple times, and the i-th basic processing circuit performs a calculation on the data of the matrix W obtained each time No multiplexing is performed, and the inner product operation corresponding to each row in the matrix Ai is completed in stages;

在一種可選方案中，主處理電路的控制電路可以將矩陣W中各部分多次廣播到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對每次得到的矩陣W的數據進行部分復用，完成對應於矩陣Ai中的每一行的內積運算；In an optional solution, the control circuit of the main processing circuit can broadcast each part of the matrix W to the registers or on-chip caches of each basic processing circuit for multiple times, and the i-th basic processing circuit performs a calculation on the data of the matrix W obtained each time Perform partial multiplexing to complete the inner product operation corresponding to each row in the matrix Ai;

在一種可選方案中，每個基礎處理電路，例如第i個基礎處理電路，計算矩陣Ai的數據和矩陣W的數據的內積；In an optional solution, each basic processing circuit, such as the i-th basic processing circuit, calculates the inner product of the data of the matrix Ai and the data of the matrix W;

步驟S203b，每個基礎處理電路的累加器電路將內積運算的結果進行累加並傳輸回主處理電路。In step S203b, the accumulator circuit of each basic processing circuit accumulates the result of the inner product operation and transmits it back to the main processing circuit.

在一種可選方案中，基礎處理電路可以將每次執行內積運算得到的部分和傳輸回主處理電路進行累加；In an optional solution, the basic processing circuit may transfer the partial sum obtained by performing the inner product operation each time back to the main processing circuit for accumulation;

在一種可選方案中，也可以將每次基礎處理電路執行的內積運算得到的部分和保存在基礎處理電路的寄存器和/或片上緩存中，累加結束之後傳輸回主處理電路；In an optional solution, the partial sum obtained by the inner product operation performed by the basic processing circuit each time may also be stored in the register and/or on-chip cache of the basic processing circuit, and transferred back to the main processing circuit after the accumulation is completed;

在一種可選方案中，也可以將每次基礎處理電路執行的內積運算得到的部分和在部分情況下保存在基礎處理電路的寄存器和/或片上緩存中進行累加，部分情況下傳輸到主處理電路進行累加，累加結束之後傳輸回主處理電路。In an optional solution, the sum obtained from the inner product operation performed by the basic processing circuit each time may also be stored in the register and/or on-chip cache of the basic processing circuit for accumulation in some cases, and transferred to the main The processing circuit performs accumulation, and after the accumulation is completed, it is transmitted back to the main processing circuit.

參閱圖3a，使用如圖1a所示的裝置完成全連接運算：Referring to Figure 3a, use the device shown in Figure 1a to complete the full connection operation:

如果全連接層的輸入數據是一個向量（即神經網絡的輸入是單個樣本的情況），則以全連接層的權值矩陣作為矩陣S，輸入向量作為向量P，按照所述裝置的使用方法一執行如圖2所示的矩陣乘向量的運算；If the input data of the fully connected layer is a vector (that is, the input of the neural network is a single sample), then the weight matrix of the fully connected layer is used as the matrix S, and the input vector is used as the vector P, according to the use method of the device. Carry out the operation of matrix multiplication vector as shown in Figure 2;

如果全連接層的輸入數據是一個矩陣（即神經網絡的輸入是多個樣本作為batch的情況），則以全連接層的權值矩陣作為矩陣S，輸入向量作為矩陣W，或者以全連接層的權值矩陣作為矩陣W，輸入向量作為矩陣S，按照所述裝置的使用如圖2c所示的矩陣乘矩陣的執行運算；If the input data of the fully connected layer is a matrix (that is, the input of the neural network is the case where multiple samples are used as a batch), the weight matrix of the fully connected layer is used as the matrix S, and the input vector is used as the matrix W, or the fully connected layer The weight matrix is used as matrix W, and the input vector is used as matrix S, according to the execution operation of the matrix multiplication matrix shown in Figure 2c according to the use of the device;

參閱圖3b，使用如圖1a所示的裝置完成卷積運算：Referring to Figure 3b, the convolution operation is performed using the setup shown in Figure 1a:

對於一個卷積層，記其卷積核的數量為M；For a convolutional layer, record the number of convolution kernels as M;

步驟S301，主處理電路的控制電路將卷積層權值中的每一個卷積核的權值分發到K個基礎處理電路中的某一個上，保存在基礎處理電路的片上緩存和/或寄存器中；Step S301, the control circuit of the main processing circuit distributes the weight of each convolution kernel in the convolution layer weights to one of the K basic processing circuits, and stores them in the on-chip cache and/or register of the basic processing circuit ;

在一種可選方案中，如果卷積核的個數M＜=K則，主處理電路的控制電路給M個基礎處理電路分別分發一個卷積核的權值；In an optional solution, if the number of convolution kernels M<=K, the control circuit of the main processing circuit distributes the weight of a convolution kernel to the M basic processing circuits respectively;

在一種可選方案中，如果卷積核的個數M＞K，主處理電路的控制電路給每個基礎處理電路分別分發一個或多個卷積核的權值。In an optional solution, if the number of convolution kernels M>K, the control circuit of the main processing circuit distributes weights of one or more convolution kernels to each basic processing circuit.

共有Mi個卷積核分發到第i個基礎處理電路，這些卷積核權值的集合稱為Ai。A total of Mi convolution kernels are distributed to the i-th basic processing circuit, and the set of these convolution kernel weights is called Ai.

將收到的由主處理電路分發的卷積核權值Ai保存在其寄存器和/或片上緩存中；Save the received convolution kernel weight Ai distributed by the main processing circuit in its register and/or on-chip cache;

步驟S302，主處理電路的控制電路將輸入數據T中各部分以廣播的方式傳輸給各個基礎處理電路；Step S302, the control circuit of the main processing circuit transmits each part of the input data T to each basic processing circuit in a broadcast manner;

在一種可選方案中，主處理電路的控制電路可以將輸入數據T中各部分只廣播一次到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對這一次得到的輸入數據T的數據進行充分地復用，完成對應與Ai中每一個卷積核的內積運算；In an optional solution, the control circuit of the main processing circuit can broadcast each part of the input data T to the registers or on-chip caches of each basic processing circuit only once, and the i-th basic processing circuit performs a The data is fully multiplexed, and the inner product operation corresponding to each convolution kernel in Ai is completed;

在一種可選方案中，主處理電路的控制電路可以將輸入數據T中各部分多次廣播到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對每次得到的輸入數據T的數據不進行復用，分次完成對應於Ai中的每一個卷積核的內積運算；In an optional solution, the control circuit of the main processing circuit can broadcast each part of the input data T to the registers or on-chip buffers of each basic processing circuit for multiple times, and the i-th basic processing circuit performs each input data T The data of is not multiplexed, and the inner product operation corresponding to each convolution kernel in Ai is completed in stages;

在一種可選方案中，主處理電路的控制電路可以將輸入數據T中各部分多次廣播到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對每次得到的輸入數據T的數據進行部分復用，完成對應於Ai中的每一個卷積核的內積運算；In an optional solution, the control circuit of the main processing circuit can broadcast each part of the input data T to the registers or on-chip buffers of each basic processing circuit for multiple times, and the i-th basic processing circuit performs each input data T The data of Ai is partially multiplexed, and the inner product operation corresponding to each convolution kernel in Ai is completed;

步驟S303，每個基礎處理電路計算卷積核和輸入數據T的數據內積，例如第i個基礎處理電路，計算Ai的每一個卷積核和輸入數據T的數據的內積；Step S303, each basic processing circuit calculates the data inner product of the convolution kernel and the input data T, for example, the i-th basic processing circuit calculates the inner product of each convolution kernel of Ai and the data of the input data T;

步驟S304，每個基礎處理電路的累加器電路將內積運算的結果進行累加並傳輸回主處理電路：Step S304, the accumulator circuit of each basic processing circuit accumulates the result of the inner product operation and transmits it back to the main processing circuit:

在一種可選方案中，可基礎處理電路以將每次執行內積運算得到的部分和傳輸回主處理電路進行累加；In an optional solution, the basic processing circuit can transfer the partial sum obtained by performing the inner product operation each time back to the main processing circuit for accumulation;

在一種可選方案中，基礎處理電路也可以將每次執行的內積運算得到的部分和保存在基礎處理電路的寄存器和/或片上緩存中，累加結束之後傳輸回主處理電路；In an optional solution, the basic processing circuit may also store the partial sum obtained by the inner product operation performed each time in the register and/or on-chip cache of the basic processing circuit, and transfer it back to the main processing circuit after the accumulation is completed;

在一種可選方案中，基礎處理電路也可以將每次執行的內積運算得到的部分和在部分情況下保存在基礎處理電路的寄存器和/或片上緩存中進行累加，部分情況下傳輸到主處理電路進行累加，累加結束之後傳輸回主處理電路；In an optional solution, the basic processing circuit can also store the sum of the inner product operation performed each time in the register and/or on-chip cache of the basic processing circuit for accumulation, and in some cases transfer it to the main The processing circuit performs accumulation, and after the accumulation is completed, it is transmitted back to the main processing circuit;

使用如圖1a所示的裝置更新權值的方法：The method of updating weights using the device shown in Figure 1a:

利用主處理電路的向量運算器電路實現神經網絡訓練過程中的權值更新功能，具體地，權值更新是指使用權值的梯度來更新權值的方法。The vector arithmetic unit circuit of the main processing circuit is used to implement the weight update function in the training process of the neural network. Specifically, the weight update refers to a method of using the gradient of the weight to update the weight.

在一種可選方案中，使用主處理電路的向量運算器電路對權值和權值梯度這兩個向量進行加減運算得到運算結果，該運算結果即為更新權值。In an optional solution, the vector operator circuit of the main processing circuit is used to perform addition and subtraction operations on the two vectors of the weight and the weight gradient to obtain an operation result, which is the updated weight.

在一種可選方案中，使用主處理電路的向量運算器電路在權值以及權值梯度乘以或除以一個數得到中間權值和中間權值梯度值，向量運算器電路對中間權值和中間權值梯度值進行加減運算得到運算結果，該運算結果即為更新權值。In an optional solution, the vector operator circuit of the main processing circuit is used to multiply or divide the weight and the weight gradient by a number to obtain the intermediate weight and the intermediate weight gradient value, and the vector operator circuit is used for the intermediate weight and The intermediate weight gradient value is added and subtracted to obtain the operation result, and the operation result is the update weight.

在一種可選方案中，可以先使用權值的梯度計算出一組動量，然後再使用動量與權值進行加減計算得到更新後的權值；In an optional solution, a set of momentum can be calculated using the gradient of the weight, and then the momentum and the weight can be added and subtracted to obtain the updated weight;

使用如圖1a所示的裝置實現全連接層的反向運算的方法The method of realizing the reverse operation of the fully connected layer using the device shown in Figure 1a

全連接層的反向運算可以分成兩部分，如下圖4a所示，實線箭頭表示全連接層的正向計算過程，如圖4b所示，表示全連接層的反向計算過程。The reverse operation of the fully connected layer can be divided into two parts, as shown in Figure 4a below, the solid arrow indicates the forward calculation process of the fully connected layer, as shown in Figure 4b, indicates the reverse calculation process of the fully connected layer.

圖4a、圖4b所示的全連接層的反向運算，可以使用如圖1a所示的裝置如圖2b所示的矩陣乘矩陣方法來完成；The reverse operation of the fully connected layer shown in Figure 4a and Figure 4b can be completed by using the matrix multiplication matrix method shown in Figure 2b as shown in Figure 1a;

使用如圖1a所示的裝置實現卷積層的反向運算；Use the device shown in Figure 1a to realize the reverse operation of the convolutional layer;

卷積層的反向運算可以分成兩部分，如下圖5a中，實線箭頭表示卷積層的正向計算過程，如圖5b所示，表示卷積層的反向計算過程。The reverse operation of the convolutional layer can be divided into two parts. In Figure 5a below, the solid arrow indicates the forward calculation process of the convolutional layer. As shown in Figure 5b, it indicates the reverse calculation process of the convolutional layer.

圖5a、圖5b所示的卷積層的反向運算，可以使用如圖1a所示裝置採用如圖3b所示的方法完成卷積層的反向運算。For the reverse operation of the convolutional layer shown in Figure 5a and Figure 5b, the device shown in Figure 1a can be used to complete the reverse operation of the convolutional layer using the method shown in Figure 3b.

使用如圖1a所示的裝置實現BLAS（Basic Linear Algebra Subprograms）函數的方法A method of implementing BLAS (Basic Linear Algebra Subprograms) functions using the device shown in Figure 1a

GEMM計算是指：BLAS庫中的矩陣-矩陣乘法的運算。該運算的通常表示形式為：C = alpha*op( S )*op( P ) + beta*C，其中，S和P為輸入的兩個矩陣，C為輸出矩陣，alpha和beta為標量，op代表對矩陣S或P的某種操作，此外，還會有一些輔助的整數作為參數來說明矩陣的S和P的寬高；GEMM calculation refers to: the matrix-matrix multiplication operation in the BLAS library. The usual representation of this operation is: C = alpha*op( S )*op( P ) + beta*C, where S and P are the two input matrices, C is the output matrix, alpha and beta are scalars, op Represents some kind of operation on the matrix S or P. In addition, there will be some auxiliary integers as parameters to illustrate the width and height of the matrix S and P;

使用如圖1a的裝置實現GEMM計算的步驟包括：The steps to realize GEMM calculation using the device as shown in Fig. 1a include:

主處理電路的數據類型轉換運算電路可以對矩陣S以及矩陣W進行數據類型轉換；The data type conversion operation circuit of the main processing circuit can perform data type conversion on the matrix S and the matrix W;

主處理電路的轉換電路對輸入矩陣S和矩陣W進行各自相應的op操作；The conversion circuit of the main processing circuit performs respective op operations on the input matrix S and the matrix W;

在一種可選方案中，op可以為矩陣的轉置操作；可以利用主處理電路的矩陣轉置電路實現該矩陣轉置操作；In an optional solution, op can be a matrix transposition operation; the matrix transposition operation can be realized by using the matrix transposition circuit of the main processing circuit;

在一種可選方案中，在執行完矩陣S和矩陣W的OP操作以後，還可以由主處理電路的數據轉換運算電路執行數據類型轉換操作，即數據轉換運算電路將op( S )以及op(P )的數據類型由浮點類型數據轉換成定點類型數據，然後執行如圖2b所示的矩陣乘法運算。In an optional solution, after performing the OP operation of matrix S and matrix W, the data type conversion operation can also be performed by the data conversion operation circuit of the main processing circuit, that is, the data conversion operation circuit converts op( S ) and op( The data type of P) is converted from floating-point type data to fixed-point type data, and then the matrix multiplication operation as shown in Figure 2b is performed.

在一種可選方案中，某個矩陣的op可以為空，op操作不進行；In an optional solution, the op of a matrix can be empty, and the op operation is not performed;

用如圖1a所示的裝置的使用如圖2b中所述矩陣乘矩陣的計算方法完成op（S）與op（P）之間的矩陣乘法計算；Completing the matrix multiplication calculation between op(S) and op(P) with the use of the device shown in Figure 1a using the calculation method of matrix multiplication matrix as described in Figure 2b;

利用主處理電路的算術邏輯單元對op（S）*op（P）的結果中的每一個值進行乘以alpha的操作；Use the arithmetic logic unit of the main processing circuit to multiply each value in the result of op(S)*op(P) by alpha;

在一種可選方案中，alpha為1的情況下乘以alpha的操作不進行；In an optional solution, the operation of multiplying alpha is not performed when alpha is 1;

利用主處理電路的算術邏輯單元實現beta*C的運算；Use the arithmetic logic unit of the main processing circuit to realize the operation of beta*C;

在一種可選方案中，beta為1的情況下，不進行乘以beta的操作；In an optional solution, when beta is 1, the operation of multiplying by beta is not performed;

利用主處理電路的向量運算器電路實現矩陣alpha*op( s)*op( P ) 和 beta*C之間對應位置相加的步驟得到GEMM計算的結果。Use the vector operator circuit of the main processing circuit to realize the step of adding corresponding positions between the matrix alpha*op( s)*op( P ) and beta*C to obtain the result of GEMM calculation.

在一種可選方案中，beta為0的情況下，不進行這步操作；In an optional solution, when beta is 0, this step is not performed;

GEMV計算是指：BLAS庫中的矩陣-向量乘法的運算。該運算的通常表示形式為：C = alpha*op( S )*P+ beta*C，其中，S為輸入矩陣，P為輸入的向量，C為輸出向量，alpha和beta為標量，op代表對矩陣S的某種操作；GEMV calculation refers to the operation of matrix-vector multiplication in the BLAS library. The usual representation of this operation is: C = alpha*op( S )*P+ beta*C, where S is the input matrix, P is the input vector, C is the output vector, alpha and beta are scalars, and op represents the pair matrix some operation of S;

使用如圖1a的裝置實現GEMV計算的步驟為：The steps to realize GEMV calculation using the device shown in Figure 1a are:

主處理電路的數據類型轉換運算電路可以對輸入矩陣S以及矩陣W進行數據類型轉換；The data type conversion operation circuit of the main processing circuit can perform data type conversion on the input matrix S and the matrix W;

主處理電路的轉換電路對輸入矩陣S進行相應的op操作；The conversion circuit of the main processing circuit performs a corresponding op operation on the input matrix S;

在一種可選方案中，op可以為矩陣的轉置操作；利用主處理電路的轉換電路實現矩陣轉置操作；In an optional solution, op can be a matrix transposition operation; the matrix transposition operation is realized by using the conversion circuit of the main processing circuit;

在一種可選方案中，某個矩陣的op可以為空，轉置操作不進行；In an optional solution, the op of a certain matrix can be empty, and the transpose operation is not performed;

用如圖1a所示裝置使用如圖2a中所述矩陣乘向量的計算方法完成矩陣op（S）與向量P之間的矩陣-向量乘法計算；Complete the matrix-vector multiplication calculation between the matrix op(S) and the vector P by using the calculation method of matrix multiplication vector as described in Figure 2a with the device shown in Figure 1a;

利用主處理電路的算術邏輯單元對op（S）*P的結果中的每一個值進行乘以alpha的操作；Use the arithmetic logic unit of the main processing circuit to multiply each value in the result of op(S)*P by alpha;

利用主處理電路的向量運算器電路實現矩陣alpha*op( S )* P和 beta*C之間對應位置相加的步驟得到GEMV的結果。Use the vector operator circuit of the main processing circuit to realize the step of adding corresponding positions between the matrix alpha*op( S )*P and beta*C to obtain the result of GEMV.

在一種可選方案中，beta為0的情況下，不進行相加的步驟操作；In an optional solution, when beta is 0, the step of adding is not performed;

使用如圖1a的裝置實現激活函數的方法Method for implementing an activation function using a device as shown in Figure 1a

利用主處理電路的激活電路輸入一向量，計算出該向量的激活向量；The activation circuit of the main processing circuit is used to input a vector, and the activation vector of the vector is calculated;

在一種可選方案中，主處理電路激活電路將輸入向量中的每一個值通過一個激活函數（激活函數的輸入是一個數值，輸出也是一個數值），計算出一個數值輸出到輸出向量的對應位置；In an optional solution, the activation circuit of the main processing circuit passes each value in the input vector through an activation function (the input of the activation function is a value, and the output is also a value), and a value is calculated and output to the corresponding position of the output vector ;

在一種可選方案中，激活函數可以是：y=max(m, x)，其中x是輸入數值，y是輸出數值，m是一個常數；In an alternative, the activation function can be: y=max(m, x), where x is the input value, y is the output value, and m is a constant;

在一種可選方案中，激活函數可以是：y=tanh(x)，其中x是輸入數值，y是輸出數值；In an alternative, the activation function can be: y=tanh(x), where x is the input value and y is the output value;

在一種可選方案中，激活函數可以是：y=sigmoid(x)，其中x是輸入數值，y是輸出數值；In an alternative, the activation function can be: y=sigmoid(x), where x is the input value and y is the output value;

在一種可選方案中，激活函數可以是一個分段線性函數；In an alternative, the activation function can be a piecewise linear function;

在一種可選方案中，激活函數可以是任意輸入一個數，輸出一個數的函數。In an optional solution, the activation function can be any function that inputs a number and outputs a number.

在一種可選方案中，輸入向量的來源有（包括但不限於）：In an alternative, sources of input vectors include (but are not limited to):

所述裝置的外部數據來源；an external data source for the device;

在一種可選方案中，輸入數據來自所述裝置進行矩陣乘向量的運算結果；In an optional solution, the input data comes from the operation result of matrix multiplication by the device;

在一種可選方案中，輸入數據來自所述裝置進行矩陣乘矩陣的運算結果；In an optional solution, the input data comes from the operation result of matrix multiplication performed by the device;

所述裝置的主處理電路計算結果；The calculation result of the main processing circuit of the device;

在一種可選方案中，輸入數據來自所述裝置主處理電路實現加偏置之後的計算結果。In an optional solution, the input data comes from a calculation result after biasing is implemented by the main processing circuit of the device.

需要說明的是，上述激活操作可以由主處理電路內的算數邏輯電路和累加器電路來實現，也可以在主處理電路單獨增加一個激活電路來實現激活操作。It should be noted that the above activation operation can be realized by the arithmetic logic circuit and the accumulator circuit in the main processing circuit, or an activation circuit can be added separately in the main processing circuit to realize the activation operation.

使用如圖1a的裝置實現加偏置操作：Use the device shown in Figure 1a to realize the biasing operation:

利用主處理電路的向量運算器電路可以實現兩個向量或者兩個矩陣相加的功能；The function of adding two vectors or two matrices can be realized by using the vector operator circuit of the main processing circuit;

利用主處理電路的向量運算器電路可以實現把一個向量加到一個矩陣的每一行上，或者每一個列上的功能。The function of adding a vector to each row or column of a matrix can be realized by using the vector operator circuit of the main processing circuit.

在一種可選方案中，所述矩陣可以來自所述裝置執行矩陣乘矩陣運算的結果；In an optional solution, the matrix may be obtained from the result of a matrix multiplication matrix operation performed by the device;

在一種可選方案中，所述矩陣可以來自所述裝置執行矩陣乘向量運算的結果；In an optional solution, the matrix may be obtained from the result of matrix multiplication and vector operation performed by the device;

在一種可選方案中，所述矩陣可以來自所述裝置的主處理電路從外部接受的數據。In an optional solution, the matrix may come from external data received by the main processing circuit of the device.

在一種可選方案中，所述向量可以來自所述裝置的主處理電路從外部接受的數據。In an optional solution, the vector may come from data received externally by the main processing circuit of the device.

包括但不限於以上這些數據來源。Including but not limited to the above data sources.

使用如圖1a的裝置實現數據類型轉換：Use the device shown in Figure 1a to realize data type conversion:

利用主處理電路的數據類型轉換運算電路實現將數據類型的轉換；Using the data type conversion operation circuit of the main processing circuit to realize the conversion of the data type;

在一種可選方案中，使用主處理電路的數據類型轉換運算電路實現一組數據的數據類型轉換；In an optional solution, the data type conversion of a set of data is realized by using the data type conversion operation circuit of the main processing circuit;

在一種可選方案中，數據類型轉化的形式包括但不限於：浮點數轉定點數和定點數轉浮點數等；In an optional solution, the form of data type conversion includes but is not limited to: converting floating-point numbers to fixed-point numbers and fixed-point numbers to floating-point numbers;

本發明還提供一種芯片，該芯片包含計算裝置，該計算裝置包括：The present invention also provides a chip, the chip includes a computing device, and the computing device includes:

包括一個主處理電路，主處理電路中所涉及到的數據可以是任意數據類型的數據，在一種可選方案中，可以是任意位寬的浮點數表示的數據也可以是任意位寬的定點數表示的數據；涉及到的所有運算電路和存儲電路都可以是任意數據類型的運算電路和存儲電路，在一種可選方案中，可以是任意位寬的浮點數的運算電路和存儲電路也可以是任意位寬的定點數的運算電路和存儲電路。It includes a main processing circuit. The data involved in the main processing circuit can be data of any data type. In an optional solution, the data represented by floating-point numbers of any bit width can also be fixed-point numbers of any bit width. The data represented by the number; all the operation circuits and storage circuits involved can be operation circuits and storage circuits of any data type, and in an optional solution, the operation circuits and storage circuits that can be floating-point numbers of any bit width can also be It can be an arithmetic circuit and a storage circuit of a fixed-point number of any bit width.

在一種可選方案中，主處理電路包括數據類型轉換運算電路；In an optional solution, the main processing circuit includes a data type conversion operation circuit;

在一種可選方案中，主處理電路包括執行數據類型轉換的向量運算單元；In an optional solution, the main processing circuit includes a vector operation unit for performing data type conversion;

具體地，包含接收輸入數據的數據輸入介面；Specifically, it includes a data input interface for receiving input data;

在一種可選方案中，所述接收的數據來源可以是：所述神經網絡運算電路裝置的外部或所述神經網絡運算電路裝置的部分或全部基礎處理電路；In an optional solution, the received data source may be: the exterior of the neural network computing circuit device or part or all of the basic processing circuits of the neural network computing circuit device;

在一種可選方案中，所述數據輸入介面可以有多個；具體地，可以包含輸出數據的數據輸出介面；In an optional solution, there may be multiple data input interfaces; specifically, it may include a data output interface for outputting data;

在一種可選方案中，所述輸出的數據的去向可以是：所述神經網絡運算裝置的外部或所述神經網絡運算電路裝置的部分或全部基礎處理電路；In an optional solution, the destination of the output data may be: the exterior of the neural network computing device or part or all of the basic processing circuits of the neural network computing circuit device;

在一種可選方案中，所述數據輸出介面可以有多個；In an optional solution, there may be multiple data output interfaces;

在一種可選方案中，所述主處理電路包括片上緩存和/或寄存器；In an optional solution, the main processing circuit includes on-chip caches and/or registers;

在一種可選方案中，所述主處理電路中包含運算單元，可以執行數據運算；In an optional solution, the main processing circuit includes an operation unit, which can perform data operations;

在一種可選方案中，所述主處理電路中包含算術運算單元；In an optional solution, the main processing circuit includes an arithmetic operation unit;

在一種可選方案中，所述主處理電路中包含向量運算單元，可以同時對一組數據執行運算；具體地，所述算術運算和/或向量運算可以是任意類型的運算，包括但不限於：兩個數相加減乘除，一個數與常數加減乘除，對一個數執行指數運算，冪次運算，對數運算，以及各種非線性運算，對兩個數執行比較運算，邏輯運算等。兩個向量相加減乘除，一個向量中的每一個元素與常數加減乘除，對向量中的每一個元素執行指數運算，冪次運算，對數運算，以及各種非線性運算等，對一個向量中的每兩個對應的元素執行比較運算，邏輯運算等。In an optional solution, the main processing circuit includes a vector operation unit, which can perform operations on a set of data at the same time; specifically, the arithmetic operation and/or vector operation can be any type of operation, including but not limited to : Addition, subtraction, multiplication, and division of two numbers, addition, subtraction, multiplication, and division of a number and a constant, exponential operations, power operations, logarithmic operations, and various non-linear operations on a number, comparison operations on two numbers, logical operations, etc. Addition, subtraction, multiplication, and division of two vectors, addition, subtraction, multiplication, and division of each element in a vector with a constant, perform exponential operations, power operations, logarithmic operations, and various nonlinear operations on each element in a vector, etc., for each element in a vector Two corresponding elements perform comparison operations, logical operations, etc.

在一種可選方案中，所述主處理電路包括數據重排列單元，用於按照一定的順序向基礎處理電路傳輸數據，或者按照一定的順序原地重新排列數據；In an optional solution, the main processing circuit includes a data rearranging unit, configured to transmit data to the basic processing circuit in a certain order, or rearrange the data in situ in a certain order;

在一種可選方案中，所述數據排列的順序包括：對一個多維數據塊進行維度順序的變換；所述數據排列的順序還可以包括：對一個數據塊進行分塊以發送到不同的基礎處理電路。In an optional solution, the order of data arrangement includes: transforming the dimension order of a multi-dimensional data block; the order of data arrangement may also include: dividing a data block into blocks to send to different basic processing circuit.

該計算裝置還包括多個基礎處理電路：每一個基礎處理電路用於計算兩個向量的內積，計算的方法是，基礎處理電路收到的兩組數，將這兩組數中的元素對應相乘，並且將相乘的結果累加起來；內積的結果傳輸出去，這裡傳輸出去根據基礎處理電路的位置，有可能傳輸給其他基礎處理電路，也可以直接傳輸給主處理電路。The calculation device also includes a plurality of basic processing circuits: each basic processing circuit is used to calculate the inner product of two vectors, and the calculation method is that the two sets of numbers received by the basic processing circuit correspond to the elements in the two sets of numbers Multiply, and accumulate the results of the multiplication; the result of the inner product is transmitted, and the transmission here may be transmitted to other basic processing circuits according to the position of the basic processing circuit, or directly to the main processing circuit.

基礎處理電路中所涉及到的數據可以是任意數據類型的數據，在一種可選方案中，可以是任意位寬的浮點數表示的數據也可以是任意位寬的定點數表示的數據；涉及到的所有運算電路和存儲電路都可以是任意數據類型的運算電路和存儲電路，在一種可選方案中，可以是任意位寬的浮點數的運算電路和存儲電路也可以是任意位寬的定點數的運算電路和存儲電路。The data involved in the basic processing circuit can be data of any data type, and in an optional solution, it can be data represented by floating-point numbers of any bit width or data represented by fixed-point numbers of any bit width; All the operation circuits and storage circuits mentioned above can be operation circuits and storage circuits of any data type. In an optional solution, the operation circuits and storage circuits that can be floating-point numbers of any bit width can also be of any bit width. Fixed-point arithmetic circuit and storage circuit.

在一種可選方案中，基礎處理電路包括數據類型轉換運算電路；In an optional solution, the basic processing circuit includes a data type conversion operation circuit;

在一種可選方案中，基礎處理電路包括執行數據類型轉換的向量運算單元；In an optional solution, the basic processing circuit includes a vector operation unit for performing data type conversion;

具體地，包括由片上緩存和/或寄存器構成的存儲單元；Specifically, including a storage unit composed of an on-chip cache and/or register;

具體地，包括一個或多個接收數據的數據輸入介面；Specifically, including one or more data input interfaces for receiving data;

在一種可選方案中，包括兩個數據輸入介面，每次從兩個數據輸入介面處可以分別獲得一個或多個數據；In an optional solution, two data input interfaces are included, and one or more data can be respectively obtained from the two data input interfaces each time;

在一種可選方案中，基礎處理電路可以將從數據輸入介面接收到輸入數據後保存在寄存器和/或片上緩存中；In an optional solution, the basic processing circuit may store the input data received from the data input interface in registers and/or on-chip caches;

上述數據輸入介面接收數據的來源可以是：其他基礎處理電路和/或主處理電路。The source of data received by the above data input interface may be: other basic processing circuits and/or main processing circuits.

所述神經網絡運算電路裝置的主處理電路；The main processing circuit of the neural network computing circuit device;

所述神經網絡運算電路裝置的其他基礎處理電路（所述神經網絡運算電路裝置擁有多個基礎處理電路）；Other basic processing circuits of the neural network computing circuit device (the neural network computing circuit device has multiple basic processing circuits);

具體地，包括一個或多個傳輸輸出數據的數據輸出介面；Specifically, including one or more data output interfaces for transmitting output data;

在一種可選方案中，可以將一個或多個數據從數據輸出介面傳輸出去；In an alternative, one or more data can be transmitted from the data output interface;

具體地，通過數據輸出介面傳輸出去的數據可以是：從數據輸入介面接收到的數據、保存在片上緩存和/或寄存器中的數據、乘法器運算結果、累加器運算結果或內積運算器運算結果中的一種或任意組合。Specifically, the data transmitted through the data output interface may be: the data received from the data input interface, the data stored in the on-chip cache and/or register, the result of the multiplier operation, the result of the accumulator operation or the operation of the inner product operator One or any combination of results.

在一種可選方案中，包含三個數據輸出介面，其中的兩個分別對應於兩個數據輸入介面，每一層出上一層從數據輸入介面接收到的數據，第三個數據輸出介面負責輸出運算結果；In an optional solution, it includes three data output interfaces, two of which correspond to two data input interfaces, each layer outputs the data received from the data input interface of the previous layer, and the third data output interface is responsible for output calculations result;

具體地，所述數據輸出介面傳輸數據的去向可以是：上文數據來源和此處的數據去向決定了基礎處理電路在裝置中的連接關係。Specifically, the destination of the data transmitted by the data output interface may be: the source of the above data and the destination of the data here determine the connection relationship of the basic processing circuit in the device.

所述神經網絡運算電路裝置的其他基礎處理電路，所述神經網絡運算電路裝置擁有多個基礎處理電路；Other basic processing circuits of the neural network computing circuit device, the neural network computing circuit device has a plurality of basic processing circuits;

具體地，包括算術運算電路：該算術運算電路具體可以為：一個或多個乘法器電路、一個或多個累加器電路、一個或多個執行兩組數內積運算的電路中的一個或任意組合。Specifically, it includes an arithmetic operation circuit: the arithmetic operation circuit may specifically be: one or more circuits of one or more multipliers, one or more circuits of accumulators, one or more circuits performing inner product operations of two sets of numbers, or any combination.

在一種可選方案中，可以執行兩個數的乘法運算，其結果可以保存在片上緩存和/或寄存器上，也可以直接累加到寄存器和/或片上緩存中；In an optional solution, the multiplication operation of two numbers can be performed, and the result can be stored in the on-chip cache and/or register, or directly accumulated in the register and/or on-chip cache;

在一種可選方案中，可以執行兩組數據的內積運算，其結果可以保存在片上緩存和/或寄存器中，也可以直接累加到寄存器和/或片上緩存中；In an optional solution, the inner product operation of two sets of data can be performed, and the result can be stored in the on-chip cache and/or register, or directly accumulated in the register and/or on-chip cache;

在一種可選方案中，可以執行數據的累加運算，將數據累加到片上緩存和或寄存器中；In an optional solution, data accumulation operations may be performed to accumulate data into on-chip caches and or registers;

具體地，累加器電路被累加的數據，可以是：從數據輸入介面接收到的數據、保存在片上緩存和/或寄存器中的數據、乘法器運算結果、累加器運算結果、內積運算器運算結果中的一個或任意組合。Specifically, the accumulated data of the accumulator circuit may be: data received from the data input interface, data stored in the on-chip cache and/or register, multiplier operation result, accumulator operation result, inner product operator operation One or any combination of results.

需要說明的是，上述對基礎處理電路的描述中所用到的「數據輸入介面」和「數據輸出介面」是指每一個基礎處理電路的數據輸入與輸出介面，而不是整個裝置的數據輸入與輸出介面。It should be noted that the "data input interface" and "data output interface" used in the above description of the basic processing circuit refer to the data input and output interface of each basic processing circuit, not the data input and output of the entire device. interface.

本披露還揭露了一個神經網絡運算裝置，其包括一個或多個在如圖1a或如圖1b所示的芯片，用於從其他處理裝置中獲取待運算數據和控制信息，執行指定的神經網絡運算，執行結果通過I/O介面傳遞給外圍設備。外圍設備譬如攝像頭，顯示器，鼠標，鍵盤，網卡，wifi介面，服務器。當包含一個以上神如圖1a或如圖1b所示的芯片時，如圖1a或如圖1b所示的芯片間可以通過特定的結構進行鏈接並傳輸數據，譬如，通過PCIE總線進行互聯並傳輸數據，以支持更大規模的神經網絡的運算。此時，可以共享同一控制系統，也可以有各自獨立的控制系統；可以共享內存，也可以每個加速器有各自的內存。此外，其互聯方式可以是任意互聯拓撲。This disclosure also discloses a neural network computing device, which includes one or more chips as shown in Figure 1a or Figure 1b, used to obtain data to be calculated and control information from other processing devices, and execute a specified neural network Operation, the execution result is transmitted to the peripheral device through the I/O interface. Peripherals such as cameras, monitors, mice, keyboards, network cards, wifi interfaces, servers. When more than one chip as shown in Figure 1a or Figure 1b is included, the chips shown in Figure 1a or Figure 1b can be linked and transmit data through a specific structure, for example, interconnected and transmitted through a PCIE bus data to support the operation of larger neural networks. At this time, the same control system can be shared, or there can be independent control systems; the memory can be shared, or each accelerator can have its own memory. In addition, its interconnection method can be any interconnection topology.

該神經網絡運算裝置具有較高的兼容性，可通過PCIE介面與各種類型的服務器相連接。The neural network computing device has high compatibility and can be connected with various types of servers through the PCIE interface.

本披露還揭露了一個組合處理裝置，其包括上述的神經網絡運算裝置，通用互聯介面，和其他處理裝置（即通用處理裝置）。神經網絡運算裝置與其他處理裝置進行交互，共同完成用戶指定的操作。如4c下圖為組合處理裝置的示意圖。This disclosure also discloses a combined processing device, which includes the above-mentioned neural network computing device, a general interconnection interface, and other processing devices (ie, a general processing device). The neural network computing device interacts with other processing devices to jointly complete operations specified by the user. For example, the figure below in 4c is a schematic diagram of the combined processing device.

其他處理裝置，包括中央處理器CPU、圖形處理器GPU、神經網絡處理器等通用／專用處理器中的一種或以上的處理器類型。其他處理裝置所包括的處理器數量不做限制。其他處理裝置作為神經網絡運算裝置與外部數據和控制的介面，包括數據搬運，完成對本神經網絡運算裝置的開啓、停止等基本控制；其他處理裝置也可以和神經網絡運算裝置協作共同完成運算任務。Other processing devices include one or more types of general-purpose/special-purpose processors such as central processing unit CPU, graphics processing unit GPU, and neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as the interface between the neural network computing device and external data and control, including data transfer, and complete the basic control of starting and stopping the neural network computing device; other processing devices can also cooperate with the neural network computing device to complete computing tasks.

通用互聯介面，用於在所述神經網絡運算裝置與其他處理裝置間傳輸數據和控制指令。該神經網絡運算裝置從其他處理裝置中獲取所需的輸入數據，寫入神經網絡運算裝置片上的存儲裝置；可以從其他處理裝置中獲取控制指令，寫入神經網絡運算裝置片上的控制緩存；也可以讀取神經網絡運算裝置的存儲模塊中的數據並傳輸給其他處理裝置。The universal interconnection interface is used for transmitting data and control instructions between the neural network computing device and other processing devices. The neural network computing device obtains the required input data from other processing devices and writes it into the storage device on the neural network computing device; it can obtain control instructions from other processing devices and writes it into the control cache on the neural network computing device chip; The data in the storage module of the neural network computing device can be read and transmitted to other processing devices.

如圖4d所示，可選的，該結構還包括存儲裝置，用於保存在本運算單元／運算裝置或其他運算單元所需要的數據，尤其適用於所需要運算的數據在本神經網絡運算裝置或其他處理裝置的內部存儲中無法全部保存的數據。As shown in Figure 4d, optionally, the structure also includes a storage device, which is used to save the data required by the computing unit/computing device or other computing units, especially for the data required to be calculated in the neural network computing device or data that cannot be fully stored in the internal storage of other processing devices.

該組合處理裝置可以作為手機、機器人、無人機、視頻監控設備等設備的SOC片上系統，有效降低控制部分的核心面積，提高處理速度，降低整體功耗。此情況時，該組合處理裝置的通用互聯介面與設備的某些部件相連接。某些部件譬如攝像頭，顯示器，鼠標，鍵盤，網卡，wifi介面。The combined processing device can be used as a SOC system on a mobile phone, robot, drone, video surveillance equipment and other equipment, effectively reducing the core area of the control part, increasing the processing speed, and reducing the overall power consumption. In this case, the common interconnection interface of the combined processing device is connected with certain components of the equipment. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.

本披露實施例提供了一種神經網絡處理器板卡，可用於眾多通用或專用的計算系統環境或配置中。例如：個人計算機、服務器計算機、手持設備或便攜式設備、平板型設備、智能家居、家電、多處理器系統、基於微處理器的系統、機器人、可編程的消費電子設備、網絡個人計算機（personal computer，PC）、小型計算機、大型計算機、包括以上任何系統或設備的分布式計算環境等等。The embodiment of the present disclosure provides a neural network processor board, which can be used in many general-purpose or special-purpose computing system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet-type devices, smart homes, home appliances, multiprocessor systems, microprocessor-based systems, robots, programmable consumer electronics, networked personal computers , PC), minicomputers, mainframe computers, distributed computing environments including any of the above systems or devices, etc.

請參照圖5c，圖5c為本披露實施例提供的一種神經網絡處理器板卡的結構示意圖。如圖5c所示，上述神經網絡處理器板卡10包括神經網絡芯片封裝結構11、第一電氣及非電氣連接裝置12和第一基板（substrate）13。Please refer to FIG. 5c, which is a schematic structural diagram of a neural network processor board provided by an embodiment of the present disclosure. As shown in FIG. 5 c , the neural network processor board 10 includes a neural network chip packaging structure 11 , a first electrical and non-electrical connection device 12 and a first substrate 13 .

本披露對於神經網絡芯片封裝結構11的具體結構不作限定，可選的，如圖5d所示，上述神經網絡芯片封裝結構11包括：神經網絡芯片111、第二電氣及非電氣連接裝置112、第二基板113。The present disclosure does not limit the specific structure of the neural network chip packaging structure 11. Optionally, as shown in FIG. Two substrates 113 .

本披露所涉及的神經網絡芯片111的具體形式不作限定，上述的神經網絡芯片111包含但不限於將神經網絡處理器集成的神經網絡晶片，上述晶片可以由硅材料、鍺材料、量子材料或分子材料等製成。根據實際情況（例如：較嚴苛的環境）和不同的應用需求可將上述神經網絡晶片進行封裝，以使神經網絡晶片的大部分被包裹住，而將神經網絡晶片上的引腳通過金線等導體連到封裝結構的外邊，用於和更外層進行電路連接。The specific form of the neural network chip 111 involved in this disclosure is not limited. The above-mentioned neural network chip 111 includes but is not limited to a neural network chip that integrates a neural network processor. The above-mentioned chip can be made of silicon material, germanium material, quantum material or molecular materials etc. According to the actual situation (for example: harsher environment) and different application requirements, the above neural network chip can be packaged so that most of the neural network chip is wrapped, and the pins on the neural network chip are passed through the gold wire The other conductors are connected to the outside of the package structure for circuit connection with the outer layer.

本披露對於神經網絡芯片111的具體結構不作限定，可選的，請參照圖1a或圖1b所示的裝置。The present disclosure does not limit the specific structure of the neural network chip 111 , as an option, please refer to the device shown in FIG. 1a or FIG. 1b .

本披露對於第一基板13和第二基板113的類型不做限定，可以是印制電路板(printed circuit board，PCB)或(printed wiring board，PWB)，還可能為其它電路板。對PCB的製作材料也不做限定。The present disclosure does not limit the types of the first substrate 13 and the second substrate 113 , they may be printed circuit boards (printed circuit board, PCB) or (printed wiring board, PWB), and may also be other circuit boards. The manufacturing material of the PCB is not limited either.

本披露所涉及的第二基板113用於承載上述神經網絡芯片111，通過第二電氣及非電氣連接裝置112將上述的神經網絡芯片111和第二基板113進行連接得到的神經網絡芯片封裝結構11，用於保護神經網絡芯片111，便於將神經網絡芯片封裝結構11與第一基板13進行進一步封裝。The second substrate 113 involved in this disclosure is used to carry the aforementioned neural network chip 111, and the neural network chip packaging structure 11 obtained by connecting the aforementioned neural network chip 111 and the second substrate 113 through the second electrical and non-electrical connection device 112 , used to protect the neural network chip 111 and facilitate further packaging of the neural network chip packaging structure 11 and the first substrate 13 .

對於上述具體的第二電氣及非電氣連接裝置112的封裝方式和封裝方式對應的結構不作限定，可根據實際情況和不同的應用需求選擇合適的封裝方式並進行簡單地改進，例如：倒裝芯片球柵陣列封裝（Flip Chip Ball Grid Array Package，FCBGAP），薄型四方扁平式封裝（Low-profile Quad Flat Package，LQFP）、帶散熱器的四方扁平封裝（Quad Flat Package with Heat sink，HQFP）、無引腳四方扁平封裝（Quad Flat Non-lead Package，QFN）或小間距四方扁平式封裝（Fine-pitch Ball Grid Package，FBGA）等封裝方式。There is no limitation on the packaging method and the structure corresponding to the packaging method of the above-mentioned second electrical and non-electrical connecting device 112, and a suitable packaging method can be selected and simply improved according to actual conditions and different application requirements, for example: flip chip Flip Chip Ball Grid Array Package (FCBGAP), Low-profile Quad Flat Package (LQFP), Quad Flat Package with Heat Sink (HQFP), None Package methods such as Quad Flat Non-lead Package (QFN) or Fine-pitch Ball Grid Package (FBGA).

倒裝芯片（Flip Chip），適用於對封裝後的面積要求高或對導線的電感、信號的傳輸時間敏感的情況下。除此之外可以用引線鍵合（Wire Bonding）的封裝方式，減少成本，提高封裝結構的靈活性。Flip Chip (Flip Chip) is suitable for situations where the area after packaging is high or the inductance of the wire and the transmission time of the signal are sensitive. In addition, wire bonding (Wire Bonding) packaging can be used to reduce costs and improve the flexibility of the packaging structure.

球柵陣列（Ball Grid Array），能夠提供更多引腳，且引腳的平均導線長度短，具備高速傳遞信號的作用，其中，封裝可以用引腳網格陣列封裝（Pin Grid Array，PGA）、零插拔力（Zero Insertion Force，ZIF）、單邊接觸連接（Single Edge Contact Connection，SECC）、觸點陣列（Land Grid Array，LGA）等來代替。Ball Grid Array can provide more pins, and the average wire length of the pins is short, which can transmit signals at high speed. Among them, the package can be packaged with Pin Grid Array (PGA) , Zero Insertion Force (ZIF), Single Edge Contact Connection (SECC), Land Grid Array (LGA), etc. instead.

可選的，採用倒裝芯片球柵陣列（Flip Chip Ball Grid Array）的封裝方式對神經網絡芯片111和第二基板113進行封裝，具體的神經網絡芯片封裝結構的示意圖可參照圖6。如圖6所示，上述神經網絡芯片封裝結構包括：神經網絡芯片21、焊盤22、焊球23、第二基板24、第二基板24上的連接點25、引腳26。Optionally, the neural network chip 111 and the second substrate 113 are packaged in a Flip Chip Ball Grid Array (Flip Chip Ball Grid Array) packaging manner. For a schematic diagram of the packaging structure of the neural network chip, refer to FIG. 6 . As shown in FIG. 6 , the neural network chip packaging structure includes: a neural network chip 21 , pads 22 , solder balls 23 , a second substrate 24 , connection points 25 on the second substrate 24 , and pins 26 .

其中，焊盤22與神經網絡芯片21相連，通過在焊盤22和第二基板24上的連接點25之間焊接形成焊球23，將神經網絡芯片21和第二基板24連接，即實現了神經網絡芯片21的封裝。Wherein, the pad 22 is connected to the neural network chip 21, and the solder ball 23 is formed by soldering between the pad 22 and the connection point 25 on the second substrate 24, and the neural network chip 21 and the second substrate 24 are connected, which realizes The packaging of the neural network chip 21.

引腳26用於與封裝結構的外部電路（例如，神經網絡處理器板卡10上的第一基板13）相連，可實現外部數據和內部數據的傳輸，便於神經網絡芯片21或神經網絡芯片21對應的神經網絡處理器對數據進行處理。對於引腳的類型和數量本披露也不作限定，根據不同的封裝技術可選用不同的引腳形式，並遵從一定規則進行排列。The pin 26 is used to connect with the external circuit of the package structure (for example, the first substrate 13 on the neural network processor board 10), which can realize the transmission of external data and internal data, and facilitate the neural network chip 21 or the neural network chip 21 The corresponding neural network processor processes the data. The present disclosure does not limit the type and quantity of the pins, and different pin forms can be selected according to different packaging technologies, and the arrangement must follow certain rules.

可選的，上述神經網絡芯片封裝結構還包括絕緣填充物，置於焊盤22、焊球23和連接點25之間的空隙中，用於防止焊球與焊球之間產生干擾。Optionally, the above neural network chip packaging structure further includes insulating fillers placed in the gaps between the pads 22 , solder balls 23 and connection points 25 to prevent interference between solder balls.

其中，絕緣填充物的材料可以是氮化硅、氧化硅或氧氮化硅；干擾包含電磁干擾、電感干擾等。Wherein, the material of the insulating filler may be silicon nitride, silicon oxide or silicon oxynitride; interference includes electromagnetic interference, inductive interference, and the like.

可選的，上述神經網絡芯片封裝結構還包括散熱裝置，用於散髮神經網絡芯片21運行時的熱量。其中，散熱裝置可以是一塊導熱性良好的金屬片、散熱片或散熱器，例如，風扇。Optionally, the above neural network chip packaging structure further includes a heat dissipation device for dissipating the heat generated by the neural network chip 21 during operation. Wherein, the heat dissipation device may be a metal sheet with good thermal conductivity, a heat sink or a heat sink, for example, a fan.

舉例來說，如圖6a所示，神經網絡芯片封裝結構11包括：神經網絡芯片21、焊盤22、焊球23、第二基板24、第二基板24上的連接點25、引腳26、絕緣填充物27、散熱膏28和金屬外殼散熱片29。其中，散熱膏28和金屬外殼散熱片29用於散髮神經網絡芯片21運行時的熱量。For example, as shown in FIG. 6a, the neural network chip packaging structure 11 includes: a neural network chip 21, pads 22, solder balls 23, a second substrate 24, connection points 25 on the second substrate 24, pins 26, Insulating filler 27, heat dissipation paste 28 and heat dissipation fins 29 of the metal casing. Wherein, the thermal paste 28 and the metal shell heat sink 29 are used to dissipate the heat of the neural network chip 21 during operation.

可選的，上述神經網絡芯片封裝結構11還包括補強結構，與焊盤22連接，且內埋於焊球23中，以增強焊球23與焊盤22之間的連接強度。Optionally, the neural network chip packaging structure 11 further includes a reinforcement structure connected to the pad 22 and embedded in the solder ball 23 to enhance the connection strength between the solder ball 23 and the pad 22 .

其中，補強結構可以是金屬線結構或柱狀結構，在此不做限定。Wherein, the reinforcement structure may be a metal wire structure or a column structure, which is not limited here.

本披露對於第一電氣及非電氣裝置12的具體形式也不作限定，可參照第二電氣及非電氣裝置112的描述，即通過焊接的方式將神經網絡芯片封裝結構11進行封裝，也可以採用連接線連接或插拔方式連接第二基板113和第一基板13的方式，便於後續更換第一基板13或神經網絡芯片封裝結構11。This disclosure does not limit the specific form of the first electrical and non-electrical device 12. You can refer to the description of the second electrical and non-electrical device 112, that is, the neural network chip packaging structure 11 is packaged by welding, or it can be connected The way of connecting the second substrate 113 and the first substrate 13 by wire connection or plugging is convenient for subsequent replacement of the first substrate 13 or the packaging structure 11 of the neural network chip.

可選的，第一基板13包括用於擴展存儲容量的內存單元的介面等，例如：同步動態隨機存儲器（Synchronous Dynamic Random Access Memory，SDRAM）、雙倍速率同步動態隨機存儲器（Double Date Rate SDRAM，DDR）等，通過擴展內存提高了神經網絡處理器的處理能力。Optionally, the first substrate 13 includes an interface of a memory unit for expanding the storage capacity, such as: synchronous dynamic random access memory (Synchronous Dynamic Random Access Memory, SDRAM), double rate synchronous dynamic random access memory (Double Date Rate SDRAM, DDR), etc., improve the processing capacity of the neural network processor by expanding the memory.

第一基板13上還可包括快速外部設備互連總線（Peripheral Component Interconnect-Express，PCI-E或PCIe）介面、小封裝可熱插拔（Small Form-factor Pluggable，SFP）介面、以太網介面、控制器局域網總線（Controller Area Network，CAN）介面等等，用於封裝結構和外部電路之間的數據傳輸，可提高運算速度和操作的便利性。The first substrate 13 may also include a Peripheral Component Interconnect-Express (PCI-E or PCIe) interface, a Small Form-factor Pluggable (SFP) interface, an Ethernet interface, Controller Area Network bus (Controller Area Network, CAN) interface, etc., are used for data transmission between the package structure and external circuits, which can improve the computing speed and the convenience of operation.

將神經網絡處理器封裝為神經網絡芯片111，將神經網絡芯片111封裝為神經網絡芯片封裝結構11，將神經網絡芯片封裝結構11封裝為神經網絡處理器板卡10，通過板卡上的介面（插槽或插芯）與外部電路（例如：計算機主板）進行數據交互，即直接通過使用神經網絡處理器板卡10實現神經網絡處理器的功能，並保護神經網絡芯片111。且神經網絡處理器板卡10上還可添加其他模塊，提高了神經網絡處理器的應用範圍和運算效率。The neural network processor is packaged as a neural network chip 111, the neural network chip 111 is packaged as a neural network chip packaging structure 11, and the neural network chip packaging structure 11 is packaged as a neural network processor board 10, through the interface on the board ( socket or ferrule) to perform data interaction with an external circuit (for example: computer motherboard), that is, directly use the neural network processor board 10 to realize the function of the neural network processor and protect the neural network chip 111 . Moreover, other modules can also be added to the neural network processor board 10, which improves the application range and computing efficiency of the neural network processor.

在一個實施例里，本公開公開了一個電子裝置，其包括了上述神經網絡處理器板卡10或神經網絡芯片封裝結構11。In one embodiment, the present disclosure discloses an electronic device, which includes the neural network processor board 10 or the neural network chip packaging structure 11 described above.

電子裝置包括數據處理裝置、機器人、電腦、打印機、掃描儀、平板電腦、智能終端、手機、行車記錄儀、導航儀、傳感器、攝像頭、服務器、相機、攝像機、投影儀、手錶、耳機、移動存儲、可穿戴設備、交通工具、家用電器、和/或醫療設備。Electronic devices include data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cameras, video cameras, projectors, watches, earphones, mobile storage , wearable devices, vehicles, home appliances, and/or medical devices.

所述交通工具包括飛機、輪船和/或車輛；所述家用電器包括電視、空調、微波爐、冰箱、電飯煲、加濕器、洗衣機、電燈、燃氣灶、油煙機；所述醫療設備包括核磁共振儀、B超儀和/或心電圖儀。Said vehicles include airplanes, ships and/or vehicles; said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods; said medical equipment includes nuclear magnetic resonance machine, B-ultrasound machine and/or electrocardiograph.

以上所述的具體實施例，對本披露的目的、技術方案和有益效果進行了進一步詳細說明，所應理解的是，以上所述僅為本披露的具體實施例而已，並不用於限制本披露，凡在本披露的精神和原則之內，所做的任何修改、等同替換、改進等，均應包含在本披露的保護範圍之內。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the disclosure in detail. It should be understood that the above descriptions are only specific embodiments of the disclosure and are not intended to limit the disclosure. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this disclosure shall be included within the scope of protection of this disclosure.

S201、S202、S203 、S204、 S201b、S202b、S203b、S301、S302、S303、S304‧‧‧步驟A、Ai 、B、S‧‧‧矩陣P‧‧‧向量10‧‧‧神經網絡處理器板卡11‧‧‧神經網絡芯片封裝結構12‧‧‧第一電氣及非電氣連接裝置13‧‧‧第一基板111‧‧‧神經網絡芯片112‧‧‧第二電氣及非電氣連接裝置113‧‧‧第二基板1111‧‧‧存儲單元1112‧‧‧直接內存存取單元1113‧‧‧指令緩存單元1114‧‧‧權值緩存單元1115‧‧‧輸入神經元緩存單元1116‧‧‧輸出神經元緩存單元1117‧‧‧控制單元1118‧‧‧運算單元21‧‧‧神經網絡芯片22‧‧‧焊盤23‧‧‧焊球24‧‧‧第二基板25‧‧‧第二基板24上的連接點26‧‧‧引腳27‧‧‧絕緣填充物28‧‧‧散熱膏29‧‧‧金屬外殼散熱片S201, S202, S203, S204, S201b, S202b, S203b, S301, S302, S303, S304‧‧‧Step A, Ai, B, S‧‧‧Matrix P‧‧‧Vector 10‧‧‧Neural Network Processor Board Card 11‧‧‧neural network chip packaging structure 12‧‧‧first electrical and non-electrical connection device 13‧‧‧first substrate 111‧‧‧neural network chip 112‧‧second electrical and non-electrical connection device 113‧ ‧‧Second Substrate 1111‧‧‧Storage Unit 1112‧‧‧Direct Memory Access Unit 1113‧‧‧Instruction Cache Unit 1114‧‧‧Weight Value Cache Unit 1115‧‧‧Input Neuron Cache Unit 1116‧‧‧Output Neuron Meta cache unit 1117‧‧‧control unit 1118‧‧‧operation unit 21‧‧‧neural network chip 22‧‧‧pad 23‧‧‧solder ball 24‧‧‧second substrate 25‧‧‧on the second substrate 24 Connection point 26‧‧‧pin 27‧‧‧insulating filler 28‧‧‧thermal paste 29‧‧‧metal shell heat sink

圖1a是一種集成電路芯片裝置結構示意圖。Fig. 1a is a schematic structural diagram of an integrated circuit chip device.

圖1b是另一種集成電路芯片裝置結構示意圖。Fig. 1b is a schematic structural diagram of another integrated circuit chip device.

圖1c是一種基礎處理電路的結構示意圖。Fig. 1c is a schematic structural diagram of a basic processing circuit.

圖1d為一種定點數據類型的示意結構圖。Fig. 1d is a schematic structural diagram of a fixed-point data type.

圖2為一種矩陣乘以向量流程示意圖。Fig. 2 is a schematic flow chart of matrix multiplication by vector.

圖2a是矩陣乘以向量的示意圖。Figure 2a is a schematic diagram of matrix multiplication by vector.

圖2b為一種矩陣乘以矩陣流程示意圖。Fig. 2b is a schematic flow chart of matrix-by-matrix multiplication.

圖2c是矩陣Ai乘以向量的示意圖。Fig. 2c is a schematic diagram of matrix Ai multiplied by a vector.

圖2d是矩陣A乘以矩陣B的示意圖。Figure 2d is a schematic diagram of matrix A multiplied by matrix B.

圖2e是矩陣Ai乘以矩陣B的示意圖。Fig. 2e is a schematic diagram of matrix Ai multiplied by matrix B.

圖3a為神經網絡訓練示意圖。Figure 3a is a schematic diagram of neural network training.

圖3b為卷積運算示意圖。Fig. 3b is a schematic diagram of convolution operation.

圖4a為神經網絡正向運算示意圖。Fig. 4a is a schematic diagram of the forward operation of the neural network.

圖4b為神經網絡反向運算示意圖。Fig. 4b is a schematic diagram of the reverse operation of the neural network.

圖4c為本披露還揭露了一個組合處理裝置結構示意圖。FIG. 4c is a schematic structural diagram of a combination processing device disclosed in the present disclosure.

圖4d為本披露還揭露了一個組合處理裝置另一種結構示意圖。FIG. 4d is a schematic diagram of another structure of a combined processing device disclosed in this disclosure.

圖5a為神經網絡另一種正向運算示意圖。Fig. 5a is a schematic diagram of another forward operation of the neural network.

圖5b為神經網絡另一種反向運算示意圖。Fig. 5b is a schematic diagram of another reverse operation of the neural network.

圖5c為本披露實施例提供的一種神經網絡處理器板卡的結構示意圖；Fig. 5c is a schematic structural diagram of a neural network processor board provided by an embodiment of the present disclosure;

圖5d為本披露實施例流提供的一種神經網絡芯片封裝結構的結構示意圖；Fig. 5d is a schematic structural diagram of a neural network chip packaging structure provided by an embodiment of the present disclosure;

圖5e為本披露實施例流提供的一種神經網絡芯片的結構示意圖；Fig. 5e is a schematic structural diagram of a neural network chip provided by an embodiment of the present disclosure;

圖6為本披露實施例流提供的一種神經網絡芯片封裝結構的示意圖；FIG. 6 is a schematic diagram of a neural network chip packaging structure provided by an embodiment of the present disclosure;

圖6a為本披露實施例流提供的另一種神經網絡芯片封裝結構的示意圖。Fig. 6a is a schematic diagram of another neural network chip packaging structure provided by the embodiment of the present disclosure.

Claims

An integrated circuit chip device, wherein the integrated circuit chip device includes: a main processing circuit and a plurality of basic processing circuits, the main processing circuit or at least one basic processing circuit in the plurality of basic processing circuits includes: a data type operation circuit , the data type operation circuit is used to perform mutual conversion between floating-point type data and fixed-point type data according to the calculation amount and data storage amount; the main processing circuit is used to perform each continuous operation in the neural network operation and with The basic processing circuit transmits data; the plurality of basic processing circuits are used to execute calculations in the neural network in parallel according to the data transmitted by the main processing circuit, and transmit the calculation results to the main processing circuit.

According to the integrated circuit chip device of claim 1, the integrated circuit chip device further includes: a branch processing circuit, the branch processing circuit is arranged between the main processing circuit and at least one basic processing circuit; the branch processing circuit Circuitry for forwarding data between the main processing circuit and at least one base processing circuit.

According to the integrated circuit chip device of item 1 of the patent scope of the application, the main processing circuit is used to obtain a data block to be calculated and an operation instruction, and the data type operation circuit converts the data block to be calculated into a certain Point-type data blocks, divide the fixed-point-type data blocks to be calculated into a distribution data block and a broadcast data block according to the operation instruction; split the distribution data block to obtain multiple basic data blocks, and divide the multiple distribute basic data blocks to the at least one basic processing circuit, and broadcast the broadcast data block to the at least one basic processing circuit; The basic processing circuit is used to perform an inner product operation on the basic data block and the broadcast data block in a fixed-point type to obtain an operation result, and send the operation result to the main processing circuit; the main processing circuit is used for the operation The result processing obtains the data block to be calculated and the instruction result of the operation instruction.

According to the integrated circuit chip device according to item 2 or 3 of the scope of the patent application, the main processing circuit is specifically used to broadcast the broadcast data block to the plurality of basic processing circuits once.

According to the integrated circuit chip device of item 4 of the scope of the patent application, the basic processing circuit is specifically used to perform inner product processing on the basic data block and the broadcast data block in a fixed-point data type to obtain an inner product processing result, and the The inner product processing results are accumulated to obtain an operation result, and the operation result is sent to the main processing circuit.

According to the integrated circuit chip device of item 4 of the scope of patent application, wherein, the main processing circuit is used to accumulate the calculation results to obtain an accumulation result when the calculation result is the result of inner product processing, and the accumulation result Arranging to obtain the data block to be calculated and the instruction result of the operation instruction.

According to the integrated circuit chip device according to item 2 or 3 of the scope of patent application, the main processing circuit is specifically used to divide the broadcast data block into multiple partial broadcast data blocks, and broadcast the multiple partial broadcast data blocks to the plurality of underlying processing circuits.

The integrated circuit chip device according to item 7 of the scope of the patent application, wherein the basic processing circuit is specifically used to perform an inner product processing on the part of the broadcast data block and the basic data block in a fixed-point data type to obtain the inner product processing result , accumulating the inner product processing results to obtain a partial calculation result, and sending the partial calculation result to the main processing circuit.

The integrated circuit chip device according to item 8 of the scope of the patent application, wherein the basic processing circuit is specifically used to multiplex the part of the broadcast data block n times and perform the inner product operation between the part of the broadcast data block and n basic data blocks to obtain n As for the partial processing results, n partial processing results are respectively accumulated to obtain n partial operation results, and the n partial operation results are sent to the main processing circuit, where n is an integer greater than or equal to 2.

According to the integrated circuit chip device of claim 1, the main processing circuit includes: a main register or a main on-chip cache circuit; the basic processing circuit includes: a basic register or a basic on-chip cache circuit.

According to the integrated circuit chip device of item 10 of the scope of the patent application, the main processing circuit includes: a vector arithmetic unit circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access circuit, and a data type operation circuit Or one or any combination of data rearrangement circuits.

According to the integrated circuit chip device according to item 1 of the scope of the patent application, the main processing circuit is used to obtain the data block to be calculated and the operation instruction, and divide the data block to be calculated into the distribution data according to the operation instruction block and the broadcast data block; the distributed data block is split and processed to obtain multiple basic data blocks, the multiple basic data blocks are distributed to the at least one basic processing circuit, and the broadcast data block is broadcast to the at least one basic Processing circuit; the basic processing circuit is used to convert the basic data block and the broadcast data block into the fixed-point type data block, perform an inner product operation according to the fixed-point type data block to obtain an operation result, and convert the operation result The floating-point data is sent to the main processing circuit; the main processing circuit is used to process the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.

According to the integrated circuit chip device of claim 2, the branch processing circuit includes a plurality of branch processing circuits, the main processing circuit is respectively connected to the plurality of branch processing circuits, and each branch processing circuit is connected to at least one basic processing circuit .

According to the integrated circuit chip device of item 1 of the patent scope of the application, the data is one or any combination of vectors, matrices, three-dimensional data blocks, four-dimensional data blocks and n-dimensional data blocks.

According to the integrated circuit chip device according to item 3 of the scope of the patent application, if the operation instruction is a multiplication instruction, the main processing circuit determines that the multiplier data block is the broadcast data block, and the multiplicand data block is the distribution data block; if The operation instruction is a convolution instruction, the main processing circuit determines that the input data block is the broadcast data block, and the convolution kernel is the distribution data block.

A neural network computing device, wherein the neural network computing device includes one or more integrated circuit chip devices according to any one of items 1-15 of the scope of the patent application.

A combined processing device, wherein the combined processing device includes: a neural network computing device, a general interconnection interface, and a general processing device according to the 16th patent scope of the application; the neural network computing device communicates with the general processing device through the general interconnection interface device connection.

A chip, wherein the chip integrates an integrated circuit chip device as described in any one of the patent scopes 1-15, or a neural network computing device as described in the patent scope 16, or as described in the patent scope of the application The combined processing device of item 17.

A smart device, wherein, the smart device includes the chip as claimed in item 18 of the scope of the patent application.

An operation method of a neural network, wherein the method is applied in an integrated circuit chip device, and the integrated circuit chip device includes: an integrated circuit chip device according to any one of items 1-15 of the scope of patent application, and the integrated circuit chip device is used for for performing neural network operations.

According to the method of item 18 of the scope of patent application, the operations of the neural network include: convolution operation, matrix multiplication matrix operation, matrix multiplication vector operation, paranoid operation, full connection operation, GEMM operation, GEMV operation, activation operation One or any combination.