CN109960673A

CN109960673A - Integrated circuit chip device and Related product

Info

Publication number: CN109960673A
Application number: CN201711343642.1A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Zhongke Cambrian Technology Co Ltd
Current assignee: Cambricon Technologies Corp Ltd; Beijing Zhongke Cambrian Technology Co Ltd
Priority date: 2017-12-14
Filing date: 2017-12-14
Publication date: 2019-07-02
Anticipated expiration: 2037-12-14
Also published as: TWI767097B; TW201928796A; CN109960673B

Abstract

It includes: main process task circuit and multiple based process circuits that present disclosure, which provides a kind of integrated circuit chip device and Related product, the integrated circuit chip device,；At least one circuit includes: data type computing circuit in the main process task circuit or multiple based process circuits；The data type computing circuit, for executing the conversion between floating point type data and fixed point type data；The multiple based process circuit is in array distribution；Each based process circuit and other adjacent based process circuit connections, m based process circuit of n based process circuit of the 1st row of main process task circuit connection, n based process circuit of m row and the 1st column.The advantage that the technical solution that present disclosure provides has calculation amount small, low in energy consumption.

Description

Integrated circuit chip device and Related product

Technical field

Present disclosure is related to field of neural networks more particularly to a kind of integrated circuit chip device and Related product.

Background technique

Artificial neural network (Artificial Neural Network, i.e. ANN), it is artificial since being the 1980s The research hotspot that smart field rises.It is abstracted human brain neuroid from information processing angle, and it is simple to establish certain Model is formed different networks by different connection types.Neural network or class are also often directly referred to as in engineering and academia Neural network.Neural network is a kind of operational model, is constituted by being coupled to each other between a large amount of node (or neuron).It is existing Neural network operation be based on CPU (Central Processing Unit, central processing unit) or GPU (English: Graphics Processing Unit, graphics processor) Lai Shixian neural network operation, such operation it is computationally intensive, Power consumption is high.

Summary of the invention

Present disclosure embodiment provides a kind of integrated circuit chip device and Related product, can promote the processing of computing device Speed improves efficiency.

In a first aspect, providing a kind of integrated circuit chip device includes: main process task circuit and multiple based process Circuit；

The multiple based process circuit is in array distribution；Each based process circuit and other adjacent adjacent bases Processing circuit connects, k based process circuit in the multiple based process circuit of main process task circuit connection, the k A tandem circuit are as follows: m basis of n based process circuit of the 1st row, n based process circuit of m row and the 1st column Processing circuit；

The multiple based process circuit includes: data type conversion circuit, for executing floating point type data and determining Conversion between vertex type data；

The main process task circuit, for execute each continuous operation in neural network computing and and with the k Based process circuit transmission data；

The k based process circuit, for the number between the main process task circuit and multiple based process circuits According to forwarding；

The multiple based process circuit, for determining whether to start the data according to the type of transmission data and operation Type conversion circuit executes the conversion of the data type of transmission data, executes in neural network in a parallel fashion according to transmission data Operation, and by operation result by with the k based process circuit transmission give the main process task circuit.

Second aspect, provides a kind of neural network computing device, and the neural network computing device includes one or more The integrated circuit chip device that first aspect provides.

The third aspect, provides a kind of combined treatment device, and the combined treatment device includes: the nerve that second aspect provides Network operations device, general interconnecting interface and general processing unit；

The neural network computing device is connect by the general interconnecting interface with the general processing unit.

Fourth aspect, provides a kind of chip, the device or third of the device of the integrated chip first aspect, second aspect The device of aspect.

5th aspect, provides a kind of electronic equipment, the electronic equipment includes the chip of fourth aspect.

6th aspect, provides a kind of operation method of neural network, and the method is applied in integrated circuit chip device, The integrated circuit chip device includes: integrated circuit chip device described in first aspect, the integrated circuit chip device For executing the operation of neural network.

As can be seen that providing data conversion computing circuit by present disclosure embodiment and converting the type of data block Operation afterwards saves transfer resource and computing resource, so it is with low in energy consumption, the small advantage of calculation amount.

Detailed description of the invention

Fig. 1 a is a kind of integrated circuit chip device structural schematic diagram.

Fig. 1 b is another integrated circuit chip device structural schematic diagram.

Fig. 1 c is a kind of structural schematic diagram of based process circuit.

Fig. 1 d is a kind of structural schematic diagram of main process task circuit.

Fig. 1 e is a kind of schematic configuration diagram of fixed-point data type.

Fig. 2 a is a kind of application method schematic diagram of based process circuit.

Fig. 2 b is a kind of main process task circuit transmission schematic diagram data.

Fig. 2 c is Matrix Multiplication with the schematic diagram of vector.

Fig. 2 d is a kind of integrated circuit chip device structural schematic diagram.

Fig. 2 e is another integrated circuit chip device structural schematic diagram.

Fig. 2 f is Matrix Multiplication with the schematic diagram of matrix.

Fig. 3 a is convolution input data schematic diagram.

Fig. 3 b is convolution kernel schematic diagram.

Fig. 3 c is the operation window schematic diagram of a three-dimensional data block of input data.

Fig. 3 d is another operation window schematic diagram of a three-dimensional data block of input data.

Fig. 3 e is the another operation window schematic diagram of a three-dimensional data block of input data

Fig. 4 a is neural network forward operation schematic diagram.

Fig. 4 b is the reversed operation schematic diagram of neural network.

Fig. 4 c is that present disclosure is also disclosed that a combined treatment device structural schematic diagram.

Fig. 4 d is that present disclosure is also disclosed that a combined treatment device another kind structural schematic diagram.

Fig. 5 a is a kind of structural schematic diagram for neural network processor board that present disclosure embodiment provides；

Fig. 5 b is a kind of structural schematic diagram for neural network chip encapsulating structure that present disclosure embodiment stream provides；

Fig. 5 c is a kind of structural schematic diagram for neural network chip that present disclosure embodiment stream provides；

Fig. 6 is a kind of schematic diagram for neural network chip encapsulating structure that present disclosure embodiment stream provides；

Fig. 6 a is the schematic diagram for another neural network chip encapsulating structure that present disclosure embodiment stream provides.

Specific embodiment

In order to make those skilled in the art more fully understand present disclosure scheme, below in conjunction in present disclosure embodiment The technical solution in present disclosure embodiment is clearly and completely described in attached drawing, it is clear that described embodiment is only Present disclosure a part of the embodiment, instead of all the embodiments.Based on the embodiment in present disclosure, those of ordinary skill in the art Every other embodiment obtained without creative efforts belongs to the range of present disclosure protection.

In the device that first aspect provides, the main process task circuit, for obtaining data block and operation to be calculated Instruction is divided into distribution data block and broadcast data block to the data block to be calculated according to the operational order；To described Distribution data block carries out deconsolidation process and obtains multiple basic data blocks, and the multiple basic data block is distributed to and the K Based process circuit, by the broadcast data block broadcast to the k based process circuit；

The multiple based process circuit, for referring to according to the basic data block, broadcast data block and the operation that receive Basic data block, the broadcast data block that basic data block, broadcast data block are converted into fixed-point data type are enabled, to described basic Data block and the broadcast data block execute inner product operation with fixed-point data type and obtain with fixed-point data type operation as a result, will Fixed-point data type operation result is converted into floating type operation result by the k based process circuit transmission to institute State main process task circuit；

The main process task circuit, for handling to obtain the data block to be calculated to the operation result and operation refers to The instruction results of order.

Optionally, the main process task circuit is once broadcasted to the k base specifically for passing through the broadcast data block Plinth processing circuit.

In the device that first aspect provides, the main process task circuit, for being inner product processing in such as described operation result Result when, obtain accumulation result after cumulative to the operation result, which arranged to obtain the number to be calculated According to block and the instruction results of operational order.

In the device that first aspect provides, the main process task circuit is more specifically for the broadcast data block to be divided into A part broadcast data block, by the multiple part broadcast data block by repeatedly broadcasting to the k based process circuit.

In the device that first aspect provides, the multiple based process circuit is specifically used for the part broadcasting number It is converted into fixed-point data type according to block and the basic data block, to be determined after inner product processing of fixed-point data type execution The inner product processing result of the fixed-point data type is added up and obtains fixed-point data type by the inner product processing result of point data type Partial arithmetic result, the partial arithmetic result of the fixed point type is converted into floating point type operation result and passes through the k Based process circuit is sent to the main process task circuit.

In the device that first aspect provides, the multiple based process circuit is specifically used for the broadcast of multiplexing n times part Data block executes the part broadcast data block with fixed-point data type and the n basic data block inner product operation obtains fixed-point data N part processing result of type obtains fixed-point data type n after n part processing result of fixed-point data type adds up respectively A partial arithmetic result starts the data type computing circuit for n partial arithmetic result conversion of the fixed-point data type Main process task circuit is sent to by the k based process circuit at n partial arithmetic result of floating point type data type, it is described N is the integer more than or equal to 2.

In the device that first aspect provides, the main process task circuit includes: buffer circuit on master register or main leaf；

The multiple based process circuit includes: base register or basic on piece buffer circuit.

In the device that first aspect provides, the main process task circuit includes: vector operation device circuit, arithmetic logic unit Circuit, accumulator circuit, matrix transposition circuit, direct memory access circuit, data type computing circuit or data rearrangement circuit One of or any combination.

First aspect provide device in, the data are as follows: vector, matrix, three-dimensional data block, 4 D data block and A kind of or any combination in n dimensional data block.

In the device that first aspect provides, such as operational order is multiplying order, and the main process task circuit determination multiplies Number data block is broadcast data block, and multiplicand data block is distribution data block；

If the operational order is convolution instruction, the main process task circuit determines that input block is broadcast data block, volume Product core is distribution data block.

In the method that the 6th aspect provides, the operation of the neural network includes: convolution algorithm, Matrix Multiplication matrix fortune Calculation, bigoted operation, connects one of operation, GEMM operation, GEMV operation, activation operation entirely or appoints Matrix Multiplication vector operation Meaning combination.

A refering to fig. 1, Fig. 1 a are a kind of integrated circuit chip device that present disclosure provides, the integrated circuit chip device packet Include: main process task circuit and multiple based process circuits, the multiple based process circuit are arranged in array (m*n array), wherein M, the value range of n is that at least one value is more than or equal to 2 in integer and m, n more than or equal to 1.For m*n array distribution Multiple based process circuits, each based process circuit and adjacent based process circuit connection, the main process task circuit connection K based process circuit of multiple based process circuits, the k based process circuit can be with are as follows: at n basis of the 1st row Manage m based process circuit of circuit, n based process circuit of m row and the 1st column.Integrated circuit as shown in Figure 1a Chip apparatus, main process task circuit and/or multiple based process circuits may include data type conversion computing circuit, specifically Can have part basis processing circuit in multiple based process circuits includes data type conversion circuit, for example, optional at one Technical solution in, can be by k based process circuit-configuration data type conversion circuit, such n based process circuit can To be each responsible for the data progress data type conversion step to m based process circuit of this column.This setting can be improved fortune Efficiency is calculated, power consumption is reduced, because for n based process circuit of the 1st row, since it is initially received main process task circuit The data of transmission, then the data conversion received can be reduced subsequent based process circuit at the data of fixed point type Calculation amount and the amount transmitted with the data of subsequent based process circuit similarly match m based process circuit of first row Setting data type conversion circuit also has the advantages that calculation amount is small and low in energy consumption.In addition, according to the structure, main process task circuit can be with Using dynamic data sending strategy, for example, m based process circuit broadcast data of the main process task circuit to the 1st column, main process task Circuit sends distribution data to n based process circuit of the 1st row, this advantage is transmitted not by different data input ports In same data to based process circuit, such based process circuit can not distinguish which kind of data is the data received be, Only it needs to be determined that the data can know which kind of data it belongs to from which receiving port reception.

The main process task circuit, for execute each continuous operation in neural network computing and with it is coupled The based process circuit transmission data；It above-mentioned continuous operation but is not limited to: accumulating operation, ALU operation, activation operation etc. Operation.

The multiple based process circuit, for executing the fortune in neural network in a parallel fashion according to the data of transmission It calculates, and by operation result by giving the main process task circuit with the based process circuit transmission of the main process task circuit connection.On Stating the operation that parallel mode executes in neural network includes but is not limited to: inner product operation, matrix or vector multiplication operation etc..

Main process task circuit may include: data transmitting line, data receiver circuit or interface, which can be with Integrated data distributes circuit and data broadcasting circuit, certainly in practical applications, data distribution circuit and data broadcasting electricity Road can also be respectively set.For broadcast data, that is, need to be sent to the data of each based process circuit.For distributing number According to, that is, need selectively to be sent to the data of part basis processing circuit, specifically, such as convolution algorithm, the volume of convolution algorithm Product input data needs to be sent to all based process circuits, and all its is broadcast data, and convolution kernel needs selectively hair Part basis data block is given, so convolution kernel is distribution data.Distribution data specifically select to be sent to that based process The mode of circuit can specifically be determined by main process task circuit according to load and other methods of salary distribution.For broadcast transmission Broadcast data is sent to each based process circuit with the forms of broadcasting by mode.(in practical applications, by once broadcasting Mode broadcast data is sent to each based process circuit, broadcast data can also be sent by way of repeatedly broadcasting To each based process circuit, present disclosure specific embodiment is not intended to limit the number of above-mentioned broadcast), for distributing sender Distribution data are selectively sent to part basis processing circuit by formula.

Main process task circuit (as shown in Figure 1 d) may include register and/or on piece buffer circuit, and the main process task circuit is also It may include: control circuit, vector operation device circuit, ALU (arithmetic and logic unit, arithmetic logic unit) The circuits such as circuit, accumulator circuit, DMA (Direct Memory Access, direct memory access) circuit, certainly in reality In, above-mentioned main process task circuit can also be added, conversion circuit (such as matrix transposition circuit), data rearrangement circuit or activation Circuit etc. others circuit.

Each based process circuit may include base register and/or basic on piece buffer circuit；Each based process Circuit can also include: one or any combination in inner product operation device circuit, vector operation device circuit, accumulator circuit etc..On Stating inner product operation device circuit, vector operation device circuit, accumulator circuit can be integrated circuit, above-mentioned inner product operation device electricity Road, vector operation device circuit, accumulator circuit may be the circuit being separately provided.

Optionally, the cumulative fortune of inner product operation can be executed for the accumulator circuit of n based process circuit of m row It calculates, because this product knot for arranging all based process circuits can be received for m row based process circuit Fruit, and the accumulating operation of inner product operation is executed to the accumulating operation of inner product operation by n based process circuit of m row, this Sample can effectively distribute computing resource, have the advantages that save power consumption.This technical solution is larger particularly with m quantity Shi Gengwei is applicable in.

The circuit of execution can be distributed by main process task circuit for data type conversion, specifically, display can be passed through Or implicit mode distributes the circuit of execution, for display mode, main process task circuit can configure a special instruction or refer to It enables, when based process circuit receives the special instruction or instruction, determines and execute data type conversion, such as based process circuit When not receiving special instruction or instruction, determination does not execute data type conversion.For another example, it can be executed in a manner of hint, For example, based process circuit receives data type for the data of floating point type and when determination needs to be implemented inner product operation, by this Data type conversion at fixed point type data.For the mode of display configuration, special instruction or instruction can configure one and pass Decreasing order column, the every of the descending series pass through a based process circuit, and numerical value subtracts 1, and based process circuit reads the descending series Value, if the value be greater than zero, then execute data type conversion, if the value be equal to or less than zero, then do not execute data type conversion. This setting is what the based process circuit distributed according to array was configured, such as m based process circuit of the i-th column is come Say, main process task circuit needs the 5 based process circuits in front to execute data type conversion, then main process task circuit issue one it is special Instruction, the special instruction include descending series, and the initial value of the descending series can be 5, then every to pass through a based process Circuit, the value of descending series subtract 1, and when to the 5th based process circuit, the value of the descending series is 1, at the 6th basis When managing circuit, which is 0, and the 6th based process circuit will not execute the data type conversion, such mode at this time The executing subject and execution number that the dynamic configuration data type of main process task circuit can be allowed to convert.

Present disclosure one embodiment provides a kind of integrated circuit chip device, including a main process task circuit (can also claim For master unit) and multiple based process circuits (being referred to as base unit)；The structure of embodiment is as shown in Figure 1 b；Wherein, It is the internal structure of the neural network computing device in dotted line frame；The arrow of grey filling indicates at main process task circuit and basis The data transmission path between gate array is managed, hollow arrow indicates each based process circuit in based process gate array Data transmission path between (adjacent foundation processing circuit).Wherein, the length and width length of based process gate array can be different, I.e. the value of m, n can be different, naturally it is also possible to which identical, present disclosure is not intended to limit the occurrence of above-mentioned value.

The circuit structure of based process circuit is as illustrated in figure 1 c；Dotted line frame indicates the boundary of based process circuit in figure, with The block arrow that dotted line frame is intersected indicates that (be directed toward in dotted line frame is input channel to data I/O channel, it is indicated that dotted line frame is defeated Channel out)；Rectangle frame in dotted line frame indicates storage unit circuit (register and/or on piece caching), including input data 1, Input data 2, multiplication or inner product are as a result, cumulative data；Diamond indicates calculator circuit, including multiplication or inner product operation device, Adder.

In the present embodiment, the neural network computing device includes a main process task circuit and 16 based process circuits (16 based process circuits are just to for example, in practical applications, can use other numerical value)；

In the present embodiment, there are two Data Input Interface, two data output interfaces for based process circuit；After this example In continuous description, lateral input interface (being directed toward the lateral arrows of this unit in Fig. 1 b) is referred to as and inputs 0, vertical input interface (the vertical arrow of this unit is directed toward in Fig. 1 b), which is referred to as, inputs 1；By each lateral data output interface (from this list in Fig. 1 b The lateral arrows that member is pointed out) it is referred to as output 0, vertical data output interface (the vertical arrow pointed out in Fig. 1 b from this unit) claims Make output 1.

The Data Input Interface and data output interface of each based process circuit can be separately connected different units, Including main process task circuit and other based process circuits；

In this example, the input 0 of based process circuit 0,4,8,12 (number is shown in Fig. 1 b) this four based process circuits and master The data output interface of processing circuit connects；

In this example, the input 1 of 0,1,2,3 this four based process circuits of based process circuit and the data of main process task circuit Output interface connection；

In this example, output 1 and the main process task circuit of 12,13,14,15 this four based process circuits of based process circuit Data Input Interface is connected；

In this example, the case where based process circuit output interface is connected with other based process circuit input interfaces, sees figure Shown in 1b, it will not enumerate；

Specifically, the output interface S1 of S cell is connected with the input interface P1 of P unit, indicates that P unit can be from it P1 interface is sent to the data of its S1 interface to S cell.

The present embodiment includes a main process task circuit, and main process task circuit is connected (i.e. by input interface with external device (ED) Have output interface), the Data Input Interface of a part of data output interface of main process task circuit and a part of based process circuit It is connected；A part of Data Input Interface of main process task circuit is connected with the data output interface of a part of based process circuit.

The application method of integrated circuit chip device

Involved data can be the data of arbitrary data types in the application method that present disclosure provides, for example, can To be data that data that the floating number of any bit wide indicates are also possible to that the fixed-point number of any bit wide indicates.

A kind of structural schematic diagram of the fixed point type data is a kind of fixed point type number as shown in fig. le as shown in fig. le According to expression, for computing system, the storage bit number of 1 floating data is 32bit, and for fixed-point data, especially The expression of data is carried out using the data of floating point type as shown in fig. le, the storage bit number of 1 fixed-point data can be accomplished 16Bit hereinafter, so for this conversion for, the transport overhead that can be significantly reduced between calculator, in addition, for calculate For device, the space of the data storage of less bit is also smaller, i.e., storage overhead can be smaller, and calculation amount can also be reduced, that is, counts Calculating expense can be reduced, so the expense of computing cost and storage can be reduced, but be also to need for the conversion of data type There are the expense of part, hereinafter referred to as transition overhead, for computationally intensive, the big data of data storage capacity, transition overhead phase It almost can be ignored for subsequent computing cost, storage overhead and transport overhead, so for computationally intensive, The big data of data storage capacity, present disclosure are used data type conversion into the technical solution of the data of fixed point type, conversely, It is small for calculation amount, the small data of data storage capacity, at this time since computing cost itself, storage overhead and transport overhead just compare It is smaller, at this time if using fixed-point data, since the precision of fixed-point data can be slightly below floating data, before calculation amount is lesser It puts, needs to guarantee the precision calculated, so here by the data conversion of fixed point type at floating data, i.e., it is smaller by increasing Expense improve the purpose of the precision of calculating.

The operation completed in based process circuit is needed, following methods progress can be used:

Based process circuit computing is first converted and then be transmitted further to main process task circuit to the type of data (for example, main Floating number can be converted into the lower fixed-point number of bit wide and be transmitted further to based process circuit by processing circuit, its advantage is that can subtract The bit wide for transmitting data less, reduces the total bit number amount of transmission, and based process circuit executes the efficiency of ground bit wide fixed-point calculation Higher, power consumption is lower)

After based process circuit can receive data then advanced row data type conversion is calculated again (for example, basis Processing circuit receives the floating number that main process task circuit transmission comes, and is then converted into fixed-point number and carries out operation, improves operation efficiency, Reduce power consumption).

Then based process circuit counting can be transmitted further to main process task electricity after going out result with advanced row data type conversion Road is (for example, the floating point arithmetic result that based process circuit counting goes out can first be converted to the fixed-point number of low-bit width and then pass again It is defeated by main process task circuit, benefit is reduction of the data bit width of transmission process, and it is more efficient, and saved power consumption).

The application method (such as Fig. 2 a) of based process circuit；

Main process task circuit receives input data to be calculated outside device；

Optionally, main process task circuit utilizes the various computing circuits of this unit, vector operation circuit, inner product operation device electricity Road, accumulator circuit etc. carry out calculation process to data；

Main process task circuit is by data output interface to based process gate array (the set of all based process circuits Referred to as based process gate array) send data (as shown in Figure 2 b)；

The mode of transmission data herein can be to a part of based process circuit and directly transmit data, i.e. repeatedly broadcast Mode；

The mode for sending data herein can send different data, i.e. distributor to different based process circuits respectively Formula；

Based process gate array calculates data；

Based process circuit carries out operation after receiving input data；

Optionally, based process circuit transmits out the data from the data output interface of this unit after receiving data It goes；(it is transferred to other based process circuits for not receiving data from main process task circuit directly.)

Optionally, based process circuit transfers out operation result from data output interface；(results of intermediate calculations or Final calculation result)

Main process task circuit receives the output data returned from based process gate array；

Optionally, it is (such as tired to continue processing to the data received from based process gate array for main process task circuit Add or activate operation)；

Main process task processing of circuit finishes, and processing result is transferred to outside device from data output interface.

Matrix Multiplication vector operation is completed using the circuit device；

(Matrix Multiplication vector can be every a line in matrix and carry out inner product operation with vector respectively, and these results are pressed The sequence of corresponding row puts into a vector.)

Be described below calculate size be M row L column matrix S and length be L vector P multiplication operation, such as Fig. 2 c institute Show.

The method uses all or part based process circuit of the neural computing device, it is assumed that uses K based process circuit；

The data in some or all of matrix S row are sent each of k based process circuit by main process task circuit Based process circuit；

In a kind of optional scheme, the data of certain row in matrix S are sent one by the control circuit of main process task circuit every time Number or a part of number give some based process circuit；(for example, for sending a number every time, it can be for for some Based process circuit, the 1st transmission the 1st number of the 3rd row, the 2nd the 2nd number sent in the 3rd row data, the 3rd transmission the 3rd The 3rd capable number ..., or for sending a part of number every time, the 1st the 3rd row the first two number of transmission (the i.e. the 1st, 2 number), Second of the 3rd row the 3rd of transmission and the 4th number, third time send the 3rd row the 5th and the 6th number ...；)

In a kind of optional scheme, the control circuit of main process task circuit respectively sends out the data of rows a few in matrix S every time Send an a part of number of number person to some based process circuit；(for example, for some based process circuit, send for the 1st time 1st number of the every row of 3,4,5 rows, the 2nd number of the 2nd the 3rd, 4, the 5 every row of row of transmission, the 3rd the 3rd, 4, the 5 every row of row of transmission 3rd number ... or the 1st transmission every row the first two number of the 3rd, 4,5 row, second sends the 3rd, 4, the 5 every row of row the 3rd and the 4 numbers, third time send the every row the 5th of the 3rd, 4,5 row and the 6th number ....)

The data in vector P are gradually sent the 0th based process circuit by the control circuit of main process task circuit；

After 0th based process circuit receives the data of vector P, send the data to coupled next A based process circuit, i.e. based process circuit 1；

Specifically, some based process circuits cannot directly obtain all numbers needed for calculating from main process task circuit According to for example, the based process circuit 1 in Fig. 2 d, only one Data Input Interface are connected with main process task circuit, so can only The data of matrix S are directly obtained from main process task circuit, and the data of vector P just need to export by based process circuit 0 to base Plinth processing circuit 1, similarly, based process circuit 1 will also continue to export the data of vector P at basis after also receiving data Manage circuit 2.

Each based process circuit carries out operation to the data received, which includes but is not limited to: inner product operation, Multiplying, add operation etc.；

In a kind of optinal plan, based process circuit calculates the multiplication of one or more groups of two data every time, then will As a result it is added on register and/or on piece caching；

In a kind of optinal plan, based process circuit calculates the inner product of one or more groups of two vectors every time, then will As a result it is added on register and/or on piece caching；

After based process circuit counting goes out result, result is transferred out from data output interface and (is transferred to and connects with it Other based process circuits connect)；

In a kind of optinal plan, which can be the final result or intermediate result of inner product operation；

After based process circuit receives the calculated result from other based process circuits, transmit the data to Its other based process circuit or main process task circuit for being connected；

Main process task circuit receive each based process circuit inner product operation as a result, the result treatment is most terminated Fruit (processing can be accumulating operation or activation operation etc.).

The embodiment of Matrix Multiplication vector approach is realized using above-mentioned computing device:

In a kind of optinal plan, multiple based process circuits are according to as shown in Fig. 2 d or Fig. 2 e used in method Mode arranges；

As shown in Figure 2 c, matrix S and matrix P are converted into fixed point type by the data conversion computing circuit of main process task circuit Data；The M row data of matrix S are divided into K group by the control circuit of Main Processor Unit, are responsible for respectively by i-th of based process circuit The operation of i-th group (set of row is denoted as Ai in this group of data)；

The method that M row data are grouped herein be arbitrarily will not duplicate allocation packet mode；

In a kind of optinal plan, use the following method of salary distribution: it is a that jth row is given jth %K (% for take the remainder operation) Based process circuit；

In a kind of optinal plan, for being unable to average packet the case where can also be right first to a part of row mean allocation It is distributed in any way in remaining row.

Data in row part or all of in matrix S are successively sent to corresponding by the control circuit of main process task circuit every time Based process circuit；

In a kind of optinal plan, it is negative that the control circuit of main process task circuit sends it to i-th of based process circuit every time One or more data in data line in i-th group of data Mi of duty；

In a kind of optinal plan, it is negative that the control circuit of main process task circuit sends it to i-th of based process circuit every time One or more data of every row some or all of in i-th group of data Mi of duty in row；

The control circuit of main process task circuit successively sends the data in vector P to the 1st based process circuit；

In a kind of optinal plan, the control circuit of main process task circuit can send one or more in vector P every time A data；

The data that i-th of based process circuit receives vector P are sent to coupled i+1 based process later Circuit；

Each based process circuit receive from certain a line in matrix S or one or more data in a few rows with And after one or more data from vector P, carry out operation (including but not limited to multiplication or addition)；

In a kind of optinal plan, the data that based process circuit receives are also possible to intermediate result, are stored in deposit On device and/or on piece caching；

Based process circuit by local calculated result be transferred to coupled next based process circuit or Main process task circuit；

In a kind of optinal plan, corresponding to the structure of Fig. 2 d, only the last one based process circuit of each column is defeated Outgoing interface is connected with main process task circuit, and in this case, only the last one based process circuit can directly will be local Calculated result be transferred to main process task circuit, the calculated result of other based process circuits will pass to the next base of oneself Plinth processing circuit, next based process circuit pass to down next based process circuit until being all transferred to the last one base Plinth processing circuit, the last one based process circuit is by other based process of local calculated result and this column received The result of circuit executes accumulation calculating and obtains intermediate result, and intermediate result is sent to main process task circuit；It certainly can also be for most The processing result of the result of other tandem circuits of this column and local can be transmitted directly to by the latter based process circuit Main process task circuit.

In a kind of optinal plan, corresponding to the structure of Fig. 2 e, each based process circuit has and main process task circuit Local calculated result is directly transferred to master by the output interface being connected, in this case, each based process circuit Processing circuit；

After based process circuit receives the calculated result that other based process circuits pass over, it is transferred to and its phase The next based process circuit or main process task circuit of connection.

Main process task circuit receive M inner product operation as a result, operation result as Matrix Multiplication vector.

Matrix Multiplication matrix operation is completed using the circuit device；

Be described below calculate size be M row L column matrix S and size be L row N column matrix P multiplication operation, (square Every a line in battle array S is identical as each column length of matrix P, as shown in figure 2f)

This method is illustrated using described device embodiment as shown in Figure 1 b；

Matrix S and matrix P are converted into the data of fixed point type by the data conversion computing circuit of main process task circuit；

The control circuit of main process task circuit sends the data in some or all of matrix S row to defeated by lateral data Incoming interface those of be directly connected with main process task circuit based process circuit (for example, in Fig. 1 b the grey filling of the top it is perpendicular To data path)；

In a kind of optinal plan, the data of certain row in matrix S are sent one by the control circuit of main process task circuit every time Several or a part of number gives some based process circuit；(for example, for some based process circuit, the 1st the 3rd row of transmission 1st number, the 2nd time send the 3rd row data in the 2nd number, the 3rd time transmission the 3rd row the 3rd number ... or the 1st time The 3rd row the first two number is sent, second of the 3rd row the 3rd of transmission and the 4th number, third time send the 3rd row the 5th and the 6th Number ...；)

In a kind of optinal plan, the control circuit of main process task circuit respectively sends the data of rows a few in matrix S every time One a part of number of number person gives some based process circuit；(for example, for some based process circuit, the 1st transmission the 3rd, 1st number of the every row of 4,5 rows, the 2nd number of the 2nd the 3rd, 4, the 5 every row of row of transmission, the 3rd the 3rd, 4, the 5 every row of row of transmission 3rd number ... or the 1st transmission every row the first two number of the 3rd, 4,5 row, second sends the 3rd, 4, the 5 every row of row the 3rd and the 4 numbers, third time send the every row the 5th of the 3rd, 4,5 row and the 6th number ...；)

The control circuit of main process task circuit sends the data some or all of in matrix P in column to by vertical data Input interface those of is directly connected based process circuit (for example, based process gate array is left in Fig. 1 b with main process task circuit The lateral data path of the grey filling of side)；

In a kind of optinal plan, the data that certain in matrix P arranges are sent one by the control circuit of main process task circuit every time Several or a part of number gives some based process circuit；(for example, for some based process circuit, the 3rd column of the 1st transmission 1st number, the 2nd time send the 3rd column data in the 2nd number, the 3rd time send the 3rd column the 3rd number ... or the 1st time The 3rd column the first two number is sent, second of transmission the 3rd arranges the 3rd and the 4th number, and third time sends the 3rd and arranges the 5th and the 6th Number ...；)

In a kind of optinal plan, the control circuit of main process task circuit respectively sends the data of column a few in matrix P every time One a part of number of number person gives some based process circuit；(for example, for some based process circuit, the 1st transmission the 3rd, 1st number of 4,5 column each columns, the 2nd number of the 2nd the 3rd, 4,5 column each column of transmission, the 3rd the 3rd, 4,5 column each column of transmission 3rd number ... or the 1st the 3rd, 4,5 column each column the first two number of transmission, second sends the 3rd, 4,5 column each column the 3rd and the 4 numbers, third time send the 3rd, 4,5 column each column the 5th and the 6th number ...；)

After based process circuit receives the data of matrix S, which is passed by its lateral data output interface Its next based process circuit that is connected is defeated by (for example, the cross of the white filling in Fig. 1 b among based process gate array To data path)；After based process circuit receives the data of matrix P, which is connect by its vertical data output Port transmission is to coupled next based process circuit (for example, the white in Fig. 1 b among based process gate array The vertical data path of filling)；

Each based process circuit carries out operation to the data received；

After based process circuit counting goes out result, result can be transferred out from data output interface；

Specifically, from the interface if the based process circuit has the output interface being directly connected with main process task circuit Transmission is as a result, if it is not, towards that directly can export result to the direction of the based process circuit of main process task circuit output (for example, bottom line based process circuit outputs it result and is directly output to main process task circuit in Fig. 1 b, other bases Processing circuit transmits downwards operation result from vertical output interface).

Towards that directly can export result to the direction of main process task circuit output, (for example, in Fig. 1 b, bottom line is basic Processing circuit outputs it result and is directly output to main process task circuit, other based process circuits are downward from vertical output interface Transmit operation result)；

Main process task circuit receive each based process circuit inner product operation as a result, output result can be obtained.

The embodiment of " Matrix Multiplication matrix " method:

Method uses the based process gate array arranged according to mode as shown in Figure 1 b, it is assumed that has h row, w column；

The h row data of matrix S are divided into h group by the control circuit of main process task circuit, are born respectively by i-th of based process circuit Blame the operation of i-th group (set of row is denoted as Hi in this group of data)；

The method that h row data are grouped herein be arbitrarily will not duplicate allocation packet mode；

In a kind of optinal plan, using the following method of salary distribution: jth row is given by the control circuit of main process task circuit J%h based process circuit；

The W column data of matrix P is divided into w group by the control circuit of main process task circuit, is born respectively by i-th of based process circuit Blame the operation of i-th group (set of row is denoted as Wi in this group of data)；

The method that W column data is grouped herein be arbitrarily will not duplicate allocation packet mode；

In a kind of optinal plan, using the following method of salary distribution: jth row is given by the control circuit of main process task circuit J%w based process circuit；

In a kind of optinal plan, first a part of column average can also be distributed the case where for being unable to average packet, it is right It is distributed in any way in remaining column.

The data in some or all of matrix S row are sent based process circuit battle array by the control circuit of main process task circuit First based process circuit of every row in column；

In a kind of optinal plan, the control circuit of main process task circuit the i-th row into based process gate array every time First based process circuit sends one or more data in the data line in its i-th group of responsible data Hi；

In a kind of optinal plan, the control circuit of main process task circuit the i-th row into based process gate array every time First based process circuit some or all of sends in its i-th group of responsible data Hi the one or more of every row in row Data；

The data in some or all of matrix P column are sent based process circuit battle array by the control circuit of main process task circuit First based process circuit of each column in column；

In a kind of optinal plan, the control circuit of main process task circuit every time into based process gate array i-th column First based process circuit sends one or more data in the column data in its i-th group of responsible data Wi；

In a kind of optinal plan, the control circuit of main process task circuit every time into based process gate array i-th column First based process circuit some or all of sends in its i-th group of responsible data Ni the one or more of each column in column Data；

Each based process circuit carries out operation to the data received；

Specifically, from the interface if the based process circuit has the output interface being directly connected with main process task circuit Transmission is as a result, if it is not, towards that directly can export result to the direction of the based process circuit of main process task circuit output (for example, bottom line based process circuit, which outputs it result, is directly output to main process task circuit, other based process circuits Operation result is transmitted downwards from vertical output interface).

Towards can be directly to the direction of main process task circuit output output result (for example, bottom line based process electricity Road outputs it result and is directly output to main process task circuit, other based process circuits transmit downwards fortune from vertical output interface Calculate result)；

" transverse direction " used in above description, the words such as " vertical " are intended merely to example shown in statement Fig. 1 b, actually make Two different interfaces are represented with " transverse direction " " vertical " interface for only needing to distinguish each unit.

Full connection operation is completed using the circuit device:

If the input data of full articulamentum is a vector (i.e. the case where input of neural network is single sample), Using the weight matrix of full articulamentum as matrix S, input vector as vector P, according to described device using Matrix Multiplication with to Amount method executes operation；

If the input data of full articulamentum is a matrix (i.e. the case where input of neural network is multiple samples), Using the weight matrix of full articulamentum as matrix S, input vector as matrix P, or using the weight matrix of full articulamentum as Matrix P, input vector execute operation as matrix S, according to the Matrix Multiplication of described device with matrix；

Convolution algorithm is completed using the circuit device:

Convolution algorithm is described below, a square indicates that a data, input data indicate (N with Fig. 3 a in figure below A sample, each sample have C channel, a height of H, width W of the characteristic pattern in each channel), weight namely convolution kernel Fig. 3 b It indicates (having M convolution kernel, each convolution kernel has C channel, and height and width are respectively KH and KW).For N number of sample of input data This, the rule of convolution algorithm is the same, and explained later carries out the process of convolution algorithm on a sample, in a sample On, each of M convolution kernel will carry out same operation, and each convolution kernel operation obtains a sheet of planar characteristic pattern, and M is a M plane characteristic figure is finally calculated in convolution kernel, (to a sample, the output of convolution is M characteristic pattern), and one is rolled up Product core will carry out inner product operation in each plan-position of a sample, be slided then along the direction H and W, for example, Fig. 3 c indicates that the position in convolution kernel lower right corner in a sample of input data carries out the corresponding diagram of inner product operation；Fig. 3 d Indicate one lattice of position upward sliding that a lattice are slided in the position of convolution to the left and Fig. 3 e indicates convolution.

The data conversion computing circuit of main process task circuit can be by the data conversion in some or all of weight convolution kernel At the data of fixed point type, the control circuit of main process task circuit sends the data in some or all of weight convolution kernel to logical It crosses lateral Data Input Interface and those of is directly connected based process circuit with main process task circuit (for example, the top in Fig. 1 b The vertical data path of grey filling)；

In a kind of optinal plan, the control circuit of main process task circuit sends the data of some convolution kernel in weight every time One number or a part of number give some based process circuit；(for example, for some based process circuit, send for the 1st time The 1st number of 3 rows, the 2nd the 2nd number sent in the 3rd row data, the 3rd number ... or the 1st of the 3rd the 3rd row of transmission The 3rd row the first two number of secondary transmission, second of the 3rd row the 3rd of transmission and the 4th number, third time send the 3rd row the 5th and the 6th Number ...；)

Another situation is that, the control circuit of main process task circuit is by the several convolution kernels of certain in weight in a kind of optinal plan Data every time respectively send an a part of number of number person give some based process circuit；(for example, for some based process electricity Road, the 1st number of the 1st the 3rd, 4, the 5 every row of row of transmission, the 2nd number of the 2nd the 3rd, 4, the 5 every row of row of transmission, the 3rd transmission 3rd number ... of the 3rd, 4, the 5 every row of row or the 1st transmission every row the first two number of the 3rd, 4,5 row, second of transmission the 3rd, The every row the 3rd of 4,5 rows and the 4th number, third time send the every row the 5th of the 3rd, 4,5 row and the 6th number ...；)

The control circuit of main process task circuit divides input data according to the position of convolution, the control of main process task circuit Circuit by the data some or all of in input data in convolution position be sent to by vertical Data Input Interface directly with Main process task circuit be connected those of based process circuit (for example, what the grey in Fig. 1 b on the left of based process gate array was filled Lateral data path)；

In a kind of optinal plan, the control circuit of main process task circuit is every by the data of some convolution position in input data One number of secondary transmission or a part of number give some based process circuit；(for example, for some based process circuit, the 1st time It sending the 3rd and arranges the 1st number, the 2nd the 2nd number sent in the 3rd column data sends the 3rd number ... of the 3rd column for the 3rd time, Or the 1st the 3rd column the first two number of transmission, second, which sends the 3rd, arranges the 3rd and the 4th number, and third time sends the 3rd and arranges the 5th and the 6 numbers ...；)

Another situation is that, the control circuit of main process task circuit is by the several volumes of certain in input data in a kind of optinal plan The data of product position respectively send a number every time or a part of number gives some based process circuit；(for example, for some base Plinth processing circuit, the 1st number of the 1st the 3rd, 4,5 column each column of transmission, the 2nd number of the 2nd the 3rd, 4,5 column each column of transmission, The 3rd number ... or the 1st the 3rd, 4,5 column each column the first two number of transmission of 3rd the 3rd, 4,5 column each column of transmission, second The 3rd, 4,5 column each column the 3rd and the 4th number are sent, third time sends the 3rd, 4,5 column each column the 5th and the 6th number ...；)

After based process circuit receives the data of weight, which is transmitted by its lateral data output interface It is connected next based process circuit to it (for example, the transverse direction of the white filling in Fig. 1 b among based process gate array Data path)；After based process circuit receives the data of input data, which is connect by its vertical data output Port transmission is to coupled next based process circuit (for example, the white in Fig. 1 b among based process gate array The vertical data path of filling)；

Each based process circuit carries out operation to the data received；

The method for completing to add bias operation using the circuit device；

The function that two vectors or two matrixes are added may be implemented using the vector operation device circuit of main process task circuit；

The every a line for a vector being added to a matrix may be implemented using the vector operation device circuit of main process task circuit On or each column on function.

In a kind of optinal plan, the matrix can come from the result that described device executes Matrix Multiplication matrix operation；

In a kind of optinal plan, the vector can come from the result that described device executes Matrix Multiplication vector operation；

In a kind of optinal plan, the matrix can come from the number that the main process task circuit of described device receives from outside According to.

In a kind of optinal plan, the vector can come from the number that the main process task circuit of described device receives from outside According to.

Including but not limited to the above data source.

The method for completing activation primitive operation using the circuit device:

Using the active circuit of main process task circuit, a vector is inputted, the activation vector of the vector is calculated；

In a kind of optinal plan, the active circuit of main process task circuit swashs each of input vector value by one Function (input of activation primitive is a numerical value, and output is also a numerical value) living, calculate a numerical value be output to output to The corresponding position of amount；

In a kind of optinal plan, activation primitive may is that y=max (m, x), and wherein x is input numerical value, and y is output number Value, m is a constant；

In a kind of optinal plan, activation primitive may is that y=tanh (x), and wherein x is input numerical value, and y is output number Value；

In a kind of optinal plan, activation primitive may is that y=sigmoid (x), and wherein x is input numerical value, and y is output Numerical value；

In a kind of optinal plan, activation primitive can be a piecewise linear function；

In a kind of optinal plan, activation primitive can be one number of any input, export a several function.

In a kind of optinal plan, the source of input vector has (including but not limited to):

The external data sources of described device；

In a kind of optinal plan, input data carries out the operation result of Matrix Multiplication vector from described device；

In a kind of optinal plan, input data carries out the operation result of Matrix Multiplication matrix from described device；

The main process task circuit counting result of described device；

In a kind of optinal plan, input data from described device main process task circuit realize biasing set after calculating knot Fruit.

The method for realizing BLAS (Basic Linear Algebra Subprograms) using described device；

GEMM calculating refers to: the operation of the matrix-matrix multiplication in the library BLAS.The usual representation of the operation are as follows: C= Alpha*op (S) * op (P)+beta*C, wherein A and B is two matrixes of input, and C is output matrix, and alpha and beta are Scalar, op represents certain operation to matrix S or P, in addition, also having the integer of some auxiliary as a parameter to illustrating matrix The width of A and B is high；

The step of GEMM is calculated is realized using described device are as follows:

Input matrix S and matrix P can be carried out the conversion of data type by main process task circuit before carrying out OP operation；

The conversion circuit of main process task circuit carries out respective op operation to input matrix S and matrix P；

In a kind of optinal plan, op can operate for the transposition of matrix；Utilize the vector operation function of main process task circuit Or data rearrangement column function (being previously noted the circuit that main process task circuit has data rearrangement column), realize matrix transposition behaviour Make, certainly in practical applications, above-mentioned OP can also be realized directly by conversion circuit, such as when the operation of matrix transposition, directly Matrix transposition circuit was connected to realize that OP is operated；

In a kind of optinal plan, the op of some matrix can be sky, OP operation without；

The matrix multiplication between op (S) and op (P) is completed using the calculation method of Matrix Multiplication matrix to calculate；

Using main process task circuit logical circuit of arithmetic to each of result of op (S) * op (P) value carry out multiplied by The operation of alpha；

In a kind of optinal plan alpha be 1 in the case where multiplied by alpha operation without；

The operation of beta*C is realized using the logical circuit of arithmetic of main process task circuit；

In the case that beta is 1 in a kind of optinal plan, without being operated multiplied by beta；

Using the logical circuit of arithmetic of main process task circuit, it is right between matrix alpha*op (S) * op (P) and beta*C to realize The step of answering position to be added；

In the case that beta is 0 in a kind of optinal plan, without phase add operation；

GEMV calculating refers to: the operation of the Matrix-Vector multiplication in the library BLAS.The usual representation of the operation are as follows: C= Alpha*op (S) * P+beta*C, wherein S is input matrix, and P is the vector of input, and C is output vector, and alpha and beta are Scalar, op represent certain operation to matrix S；

The step of GEMV is calculated is realized using described device are as follows:

The conversion circuit of main process task circuit carries out corresponding op operation to input matrix S；

In a kind of optinal plan, op can operate for the transposition of matrix；Utilize the matrix transposition circuit of main process task circuit Realize the operation of matrix transposition；

The Matrix-Vector multiplication between matrix op (S) and vector P is completed with the calculation method of Matrix Multiplication vector to calculate；

Each of result of op (S) * P value is carried out multiplied by alpha using the logical circuit of arithmetic of main process task circuit Operation；

Using the logical circuit of arithmetic of main process task circuit, the operation of beta*C is realized；

Using the logical circuit of arithmetic of main process task circuit, realizes and correspond to position between matrix alpha*op (S) * P and beta*C The step of setting addition；

Realize data type conversion

It is realized using the data type conversion computing circuit of main process task circuit by the conversion of data type；

In a kind of optinal plan, the form of data type conversion includes but is not limited to: floating number turns fixed-point number and fixed point It is several to turn floating number etc.；

Update the method for weight:

The right value update function in neural network training process, tool are realized using the vector operation device circuit of main process task circuit Body, right value update refers to the gradient of using weights to update the method for weight.

In a kind of optinal plan, using main process task circuit vector operation device circuit to weight and weight gradient the two Vector carries out plus and minus calculation and obtains operation result, which is to update weight.

In a kind of optinal plan, using main process task circuit vector operation device circuit weight and weight gradient multiplied by Or intermediate weight and intermediate weight gradient value are obtained divided by a number, vector operation device circuit is to intermediate weight and intermediate weight ladder Angle value carries out plus and minus calculation and obtains operation result, which is to update weight.

In a kind of optinal plan, can the gradients of first using weights calculate one group of momentum, then reuse momentum with Weight carries out plus-minus and updated weight is calculated.

The method for realizing the reversed operation of full articulamentum

The reversed operation of full articulamentum is segmented into two parts, and in Fig. 4 a, solid arrow indicates the forward direction of full articulamentum Calculating process, dotted portion indicate the retrospectively calculate process of full articulamentum.

It can be seen that come from Fig. 4 a, the method for completing matrix multiple operation using described device that device can be used is completed The reversed operation of full articulamentum；

Realize the reversed operation of convolutional layer；

The reversed operation of convolutional layer is segmented into two parts, and in Fig. 4 a, solid arrow indicates that the positive of convolutional layer calculates Process indicates the retrospectively calculate process of convolutional layer as shown in Figure 4 b.

Device as shown in Figure 1a can be used using as shown in Figure 1 b in the reversed operation of convolutional layer shown in Fig. 4 a, Fig. 4 b Device complete convolutional layer reversed operation.Execute forward operation or reversed operation it is practical be neural network multiple operations, Multiple operation includes but is not limited to: Matrix Multiplication is with matrix, Matrix Multiplication in vector, convolution algorithm, activation operation etc. operation One kind or any combination, the mode of above-mentioned operation can not repeat here with the description in present disclosure.

Present disclosure is also disclosed that a neural network computing device comprising one or more is in such as Fig. 1 a or such as Fig. 1 b institute The chip shown is used to obtained from other processing units to operational data and control information, executes specified neural network computing, Implementing result passes to peripheral equipment by I/O interface.Peripheral equipment for example camera, display, mouse, keyboard, network interface card, Wifi interface, server.When comprising more than one mind such as Fig. 1 a or chip as shown in Figure 1 b, such as Fig. 1 a or as shown in Figure 1 b Chip chamber can be linked by specific structure and transmit data, for example, interconnected and transmitted by PCIE bus Data, to support the operation of more massive neural network.At this point it is possible to share same control system, can also have respectively solely Vertical control system；Can with shared drive, can also each accelerator have respective memory.In addition, its mutual contact mode can be Any interconnection topology.

The neural network computing device compatibility with higher can pass through PCIE interface and various types of server phases Connection.

Present disclosure is also disclosed that a combined treatment device comprising above-mentioned neural network computing device, general interconnection Interface and other processing units (i.e. general processing unit).Neural network computing device is interacted with other processing units, altogether The operation specified with completion user.Such as the schematic diagram that Fig. 4 c is combined treatment device.

Other processing units, including central processor CPU, graphics processor GPU, neural network processor etc. are general/special With one of processor or above processor type.Processor quantity included by other processing units is with no restrictions.Its His interface of the processing unit as neural network computing device and external data and control, including data are carried, and are completed to Benshen Unlatching, stopping through network operations device etc. control substantially；Other processing units can also cooperate with neural network computing device It is common to complete processor active task.

General interconnecting interface, for transmitting data and control between the neural network computing device and other processing units Instruction.The neural network computing device obtains required input data, write-in neural network computing dress from other processing units Set the storage device of on piece；Control instruction can be obtained from other processing units, write-in neural network computing device on piece Control caching；The data in the memory module of neural network computing device can also be read and be transferred to other processing units.

As shown in figure 4d, optionally, which further includes storage device, for being stored in this arithmetic element/arithmetic unit Or data required for other arithmetic elements, be particularly suitable for required for operation data this neural network computing device or its The data that can not be all saved in the storage inside of his processing unit.

The combined treatment device can be used as the SOC on piece of the equipment such as mobile phone, robot, unmanned plane, video monitoring equipment The die area of control section is effectively reduced in system, improves processing speed, reduces overall power.When this situation, the combined treatment The general interconnecting interface of device is connected with certain components of equipment.Certain components for example camera, display, mouse, keyboard, Network interface card, wifi interface.

Present disclosure embodiment provides a kind of neural network processor board, can be used for numerous general or special purpose calculating systems In system environment or configuration.Such as: personal computer, server computer, handheld device or portable device, laptop device, Smart home, household electrical appliances, multicomputer system, microprocessor-based system, robot, programmable consumer-elcetronics devices, net Network personal computer (personal computer, PC), minicomputer, mainframe computer including any of the above system are set Standby distributed computing environment etc..

A referring to figure 5., Fig. 5 a are a kind of structural representation for neural network processor board that present disclosure embodiment provides Figure.As shown in Fig. 5 a, above-mentioned neural network processor board 10 include neural network chip encapsulating structure 11, first it is electrical and Non-electrical attachment device 12 and first substrate (substrate) 13.

Present disclosure is not construed as limiting the specific structure of neural network chip encapsulating structure 11, optionally, as shown in Figure 5 b, Above-mentioned neural network chip encapsulating structure 11 includes: neural network chip 111, second electrical and non-electrical attachment device 112, the Two substrates 113.

The concrete form of neural network chip 111 involved in present disclosure is not construed as limiting, above-mentioned neural network chip 111 Including but not limited to the neural network chip for integrating neural network processor, above-mentioned chip can be by silicon materials, germanium material, amount Sub- material or molecular material etc. are made.(such as: more harsh environment) and different application demands can will be upper according to the actual situation Neural network chip is stated to be packaged, so that the major part of neural network chip is wrapped, and will be on neural network chip Pin is connected to the outside of encapsulating structure by conductors such as gold threads, for carrying out circuit connection with more outer layer.

Present disclosure is not construed as limiting the specific structure of neural network chip 111, optionally, please refers to Fig. 1 a or Fig. 1 b institute The device shown.

Present disclosure for first substrate 13 and the second substrate 113 type without limitation, can be printed circuit board (printed circuit board, PCB) or (printed wiring board, PWB), it is also possible to be other circuit boards.It is right The making material of PCB is also without limitation.

The second substrate 113 involved in present disclosure is electrical and non-by second for carrying above-mentioned neural network chip 111 The neural network chip that above-mentioned neural network chip 111 and the second substrate 113 are attached by electrical connection arrangement 112 Encapsulating structure 11, for protecting neural network chip 111, convenient for by neural network chip encapsulating structure 11 and first substrate 13 into Row further encapsulation.

Electrical for above-mentioned specific second and non-electrical attachment device 112 the corresponding knot of packaged type and packaged type Structure is not construed as limiting, and can be selected suitable packaged type with different application demands according to the actual situation and simply be improved, Such as: flip chip ball grid array encapsulates (Flip Chip Ball Grid Array Package, FCBGAP), slim four directions Flat type packaged (Low-profile Quad Flat Package, LQFP), the quad flat package (Quad with radiator Flat Package with Heat sink, HQFP), without pin quad flat package (Quad Flat Non-lead Package, QFN) or the encapsulation side small spacing quad flat formula encapsulation (Fine-pitch Ball Grid Package, FBGA) etc. Formula.

Flip-chip (Flip Chip), suitable for the area requirements after encapsulation are high or biography to the inductance of conducting wire, signal In the case where defeated time-sensitive.In addition to this packaged type that wire bonding (Wire Bonding) can be used, reduces cost, mentions The flexibility of high encapsulating structure.

Ball grid array (Ball Grid Array), is capable of providing more pins, and the average conductor length of pin is short, tool The effect of standby high-speed transfer signal, wherein encapsulation can encapsulate (Pin Grid Array, PGA), zero slotting with Pin-Grid Array Pull out force (Zero Insertion Force, ZIF), single edge contact connection (Single Edge Contact Connection, SECC), contact array (Land Grid Array, LGA) etc. replaces.

Optionally, using the packaged type of flip chip ball grid array (Flip Chip Ball Grid Array) to mind It is packaged through network chip 111 and the second substrate 113, the schematic diagram of specific neural network chip encapsulating structure can refer to Fig. 6.As shown in fig. 6, above-mentioned neural network chip encapsulating structure includes: neural network chip 21, pad 22, soldered ball 23, second Tie point 25, pin 26 on substrate 24, the second substrate 24.

Wherein, pad 22 is connected with neural network chip 21, passes through the tie point 25 on pad 22 and the second substrate 24 Between welding form soldered ball 23, neural network chip 21 and the second substrate 24 are connected, that is, realize neural network chip 21 Encapsulation.

Pin 26 is used for the external circuit with encapsulating structure (for example, the first substrate on neural network processor board 10 13) be connected, it can be achieved that external data and internal data transmission, it is corresponding convenient for neural network chip 21 or neural network chip 21 Neural network processor data are handled.Type and quantity present disclosure for pin are also not construed as limiting, according to difference Encapsulation technology different pin forms can be selected, and defer to certain rule and arranged.

Optionally, above-mentioned neural network chip encapsulating structure further includes insulation filler, is placed in pad 22, soldered ball 23 and connects In gap between contact 25, interference is generated between soldered ball and soldered ball for preventing.

Wherein, the material of insulation filler can be silicon nitride, silica or silicon oxynitride；Interference comprising electromagnetic interference, Inductive interferences etc..

Optionally, above-mentioned neural network chip encapsulating structure further includes radiator, for distributing neural network chip 21 Heat when operation.Wherein, radiator can be the good sheet metal of one piece of thermal conductivity, cooling fin or radiator, for example, wind Fan.

For example, as shown in Figure 6 a, neural network chip encapsulating structure 11 include: neural network chip 21, pad 22, Soldered ball 23, the second substrate 24, the tie point 25 in the second substrate 24, pin 26, insulation filler 27, thermal grease 28 and metal Shell cooling fin 29.Wherein, thermal grease 28 and metal shell cooling fin 29 are used to distribute heat when neural network chip 21 is run Amount.

Optionally, above-mentioned neural network chip encapsulating structure 11 further includes reinforced structure, is connect with pad 22, and interior is embedded in In soldered ball 23, to enhance the bonding strength between soldered ball 23 and pad 22.

Wherein, reinforced structure can be metal wire structure or column structure, it is not limited here.

Present disclosure is electrical for first and the concrete form of non-electrical device of air 12 is also not construed as limiting, can refer to second it is electrical and Neural network chip encapsulating structure 11 is packaged by the description of non-electrical device of air 112 by welding, can also be with By the way of connecting line connection or pluggable mode connection the second substrate 113 and first substrate 13, it is convenient for the first base of subsequent replacement Plate 13 or neural network chip encapsulating structure 11.

Optionally, first substrate 13 includes the interface etc. for the internal storage location of extension storage capacity, such as: synchronous dynamic Random access memory (Synchronous Dynamic Random Access Memory, SDRAM), Double Data Rate synchronous dynamic with Machine memory (Double Date Rate SDRAM, DDR) etc., the processing of neural network processor is improved by exented memory Ability.

It may also include quick external equipment interconnection bus (Peripheral Component on first substrate 13 Interconnect-Express, PCI-E or PCIe) interface, hot-swappable (the Small Form-factor of small package Pluggable, SFP) interface, Ethernet interface, Controller Area Network BUS (Controller Area Network, CAN) connect Mouthful etc., for the data transmission between encapsulating structure and external circuit, the convenience of arithmetic speed and operation can be improved.

Neural network processor is encapsulated as neural network chip 111, neural network chip 111 is encapsulated as neural network Neural network chip encapsulating structure 11 is encapsulated as neural network processor board 10, by board by chip-packaging structure 11 Interface (slot or lock pin) and external circuit (such as: computer motherboard) carry out data interaction, i.e., directly by using nerve Network processing unit board 10 realizes the function of neural network processor, and protects neural network chip 111.And Processing with Neural Network Other modules can be also added on device board 10, improve the application range and operation efficiency of neural network processor.

In one embodiment, the present disclosure discloses an electronic devices comprising above-mentioned neural network processor plate Card 10 or neural network chip encapsulating structure 11.

Electronic device include data processing equipment, robot, computer, printer, scanner, tablet computer, intelligent terminal, Mobile phone, automobile data recorder, navigator, sensor, camera, server, camera, video camera, projector, wrist-watch, earphone, movement Storage, wearable device, the vehicles, household electrical appliance, and/or Medical Devices.

The vehicles include aircraft, steamer and/or vehicle；The household electrical appliance include TV, air-conditioning, micro-wave oven, Refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator；The Medical Devices include Nuclear Magnetic Resonance, B ultrasound instrument And/or electrocardiograph.

Particular embodiments described above has carried out further in detail the purpose of present disclosure, technical scheme and beneficial effects Describe in detail it is bright, it is all it should be understood that be not limited to present disclosure the foregoing is merely the specific embodiment of present disclosure Within the spirit and principle of present disclosure, any modification, equivalent substitution, improvement and etc. done should be included in the guarantor of present disclosure Within the scope of shield.

Claims

1. a kind of integrated circuit chip device, which is characterized in that the integrated circuit chip device include: main process task circuit and Multiple based process circuits；

The multiple based process circuit is in array distribution；Each based process circuit and other adjacent adjacent based process Circuit connection, k based process circuit in the multiple based process circuit of main process task circuit connection, the k base Plinth circuit are as follows: m based process of n based process circuit of the 1st row, n based process circuit of m row and the 1st column Circuit；

The multiple based process circuit includes: data type conversion circuit, for executing floating point type data and fixed point class Conversion between type data；

The main process task circuit, for execute each continuous operation in neural network computing and and with the k basis Processing circuit transmits data；

The k based process circuit turns for the data between the main process task circuit and multiple based process circuits Hair；

The multiple based process circuit, for determining whether to start the data type according to the type of transmission data and operation Conversion circuit executes the conversion of the data type of transmission data, executes the fortune in neural network in a parallel fashion according to transmission data It calculates, and by operation result by giving the main process task circuit with the k based process circuit transmission.

2. integrated circuit chip device according to claim 1, which is characterized in that

The main process task circuit, for obtaining data block and operational order to be calculated, according to the operational order to it is described to The data block of calculating is divided into distribution data block and broadcast data block；Distribution data block progress deconsolidation process is obtained more A basic data block, by the multiple basic data block be distributed to the K based process circuit, by the broadcast data block Broadcast to the k based process circuit；

The multiple based process circuit, the basic data block received for foundation, broadcast data block and operational order will Basic data block, broadcast data block are converted into the basic data block of fixed-point data type, broadcast data block, to the master data Block and the broadcast data block execute inner product operation with fixed-point data type and obtain with fixed-point data type operation as a result, will fixed point Data type operation result is converted into floating type operation result by the k based process circuit transmission to the master Processing circuit；

The main process task circuit, for handling to obtain the data block to be calculated and operational order to the operation result Instruction results.

3. integrated circuit chip device according to claim 2, which is characterized in that

The main process task circuit is once broadcasted to the k based process circuit specifically for passing through the broadcast data block.

4. integrated circuit chip device according to claim 2, which is characterized in that

The main process task circuit, for being added up to the operation result when such as described operation result is the result of inner product processing After obtain accumulation result, which is arranged to obtain the instruction results of the data block to be calculated and operational order.

5. integrated circuit chip device according to claim 2, which is characterized in that

The main process task circuit will be the multiple specifically for the broadcast data block is divided into multiple portions broadcast data block Part broadcast data block is by repeatedly broadcasting to the k based process circuit.

6. integrated circuit chip device according to claim 5, which is characterized in that

The multiple based process circuit, it is fixed specifically for the part broadcast data block and the basic data block to be converted into Point data type will to obtain the inner product processing result of fixed-point data type after inner product processing of fixed-point data type execution The inner product processing result of the fixed-point data type is cumulative to obtain the partial arithmetic result of fixed-point data type, by the fixed point class The partial arithmetic result of type is converted into floating point type operation result and is sent to the main process task by the k based process circuit Circuit.

7. integrated circuit chip device according to claim 6, which is characterized in that

The multiple based process circuit, being specifically used for multiplexing n times part broadcast data block should with the execution of fixed-point data type Part broadcast data block and the n basic data block inner product operation obtain n part processing result of fixed-point data type, will pinpoint N part processing result of data type obtains n partial arithmetic result of fixed-point data type after adding up respectively, starts the data Described n partial arithmetic result of fixed-point data type is converted into n part of floating point type data type and transported by type operation circuit It calculates result and main process task circuit is sent to by the k based process circuit, the n is the integer more than or equal to 2.

8. integrated circuit chip device according to claim 1-7, which is characterized in that

The main process task circuit includes: buffer circuit on master register or main leaf；

9. integrated circuit chip device according to claim 8, which is characterized in that

The main process task circuit includes: vector operation device circuit, arithmetic logic unit circuit, accumulator circuit, matrix transposition electricity One of road, direct memory access circuit, data type computing circuit or data rearrangement circuit or any combination.

10. integrated circuit chip device according to claim 1, which is characterized in that

A kind of data are as follows: or any combination in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block.

11. integrated circuit chip device according to claim 2, which is characterized in that

If the operational order is multiplying order, the main process task circuit determines that multiplier data block is broadcast data block, multiplicand Data block is distribution data block；

If the operational order is convolution instruction, the main process task circuit determines that input block is broadcast data block, convolution kernel To distribute data block.

12. a kind of neural network computing device, which is characterized in that the neural network computing device includes one or more as weighed Benefit requires integrated circuit chip device described in 1-11 any one.

13. a kind of combined treatment device, which is characterized in that the combined treatment device includes: mind as claimed in claim 12 Through network operations device, general interconnecting interface and general processing unit；

14. a kind of chip, which is characterized in that the integrated chip such as claim 1-13 any one described device.

15. a kind of smart machine, which is characterized in that the smart machine includes chip as claimed in claim 14.

16. a kind of operation method of neural network, which is characterized in that the method is applied in integrated circuit chip device, institute Stating integrated circuit chip device includes: the integrated circuit chip device as described in claim 1-11 any one, described integrated Circuit chip device is used to execute the operation of neural network.

17. according to the method for claim 16, which is characterized in that the operation of the neural network includes: convolution algorithm, square Battle array multiplies matrix operation, Matrix Multiplication vector operation, bigoted operation, connects operation entirely, GEMM operation, GEMV operation, activates in operation One kind or any combination.