CN109961137A

CN109961137A - Integrated circuit chip device and Related product

Info

Publication number: CN109961137A
Application number: CN201711347408.6A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Zhongke Cambrian Technology Co Ltd
Current assignee: Cambricon Technologies Corp Ltd; Beijing Zhongke Cambrian Technology Co Ltd
Priority date: 2017-12-14
Filing date: 2017-12-14
Publication date: 2019-07-02
Anticipated expiration: 2037-12-14
Also published as: TW201937415A; CN109961137B; TWI795482B

Abstract

It includes: main process task circuit and multiple based process circuits that present disclosure, which provides a kind of integrated circuit chip device and Related product, the integrated circuit chip device,；The based process circuit includes: that data type computing circuit includes；The data type computing circuit, for executing the conversion between floating point type data and fixed point type data；The main process task circuit, for executing each continuous operation in neural network computing and to the multiple based process circuit transmission data；The multiple based process circuit, for executing conversion to the type of the transmission data according to whether the operation control of the transmission data starts the data type computing circuit；The operation in neural network is executed in a parallel fashion according to the transmission data after the transmission data or conversion, and operation result is transferred to the main process task circuit.The advantage that the technical solution that present disclosure provides has calculation amount small, low in energy consumption.

Description

Integrated circuit chip device and Related product

Technical field

Present disclosure is related to field of neural networks more particularly to a kind of integrated circuit chip device and Related product.

Background technique

Artificial neural network (Artificial Neural Network, i.e. ANN), it is artificial since being the 1980s The research hotspot that smart field rises.It is abstracted human brain neuroid from information processing angle, and it is simple to establish certain Model is formed different networks by different connection types.Neural network or class are also often directly referred to as in engineering and academia Neural network.Neural network is a kind of operational model, is constituted by being coupled to each other between a large amount of node (or neuron).It is existing Neural network operation be based on CPU (Central Processing Unit, central processing unit) or GPU (English: Graphics Processing Unit, graphics processor) Lai Shixian neural network operation, such operation it is computationally intensive, Power consumption is high.

Summary of the invention

Present disclosure embodiment provides a kind of integrated circuit chip device and Related product, can promote the processing of computing device Speed improves efficiency.

In a first aspect, providing a kind of integrated circuit chip device, the integrated circuit chip device includes: main process task circuit And multiple based process circuits；

The based process circuit includes: that data type computing circuit includes；The data type computing circuit, for holding Conversion between row floating point type data and fixed point type data；

The main process task circuit, for executing each continuous operation in neural network computing and to the multiple base Plinth processing circuit transmits data；

The multiple based process circuit, for whether starting the data class according to the operation control of the transmission data Type computing circuit executes conversion to the type of the transmission data；According to the transmission data after the transmission data or conversion with simultaneously Line mode executes the operation in neural network, and operation result is transferred to the main process task circuit.

Second aspect, provides a kind of neural network computing device, and the neural network computing device includes one or more The integrated circuit chip device that first aspect provides.

The third aspect, provides a kind of combined treatment device, and the combined treatment device includes: the nerve that second aspect provides Network operations device, general interconnecting interface and general processing unit；

The neural network computing device is connect by the general interconnecting interface with the general processing unit.

Fourth aspect, provides a kind of chip, the device or third of the device of the integrated chip first aspect, second aspect The device of aspect.

5th aspect, provides a kind of electronic equipment, the electronic equipment includes the chip of fourth aspect.

6th aspect, provides a kind of operation method of neural network, and the method is applied in integrated circuit chip device, The integrated circuit chip device includes: integrated circuit chip device described in first aspect, the integrated circuit chip device For executing the operation of neural network.

As can be seen that providing data conversion computing circuit by present disclosure embodiment and converting the type of data block Operation afterwards saves transfer resource and computing resource, so it is with low in energy consumption, the small advantage of calculation amount.

Detailed description of the invention

Fig. 1 a is a kind of integrated circuit chip device structural schematic diagram.

Fig. 1 b is another integrated circuit chip device structural schematic diagram.

Fig. 1 c is a kind of structural schematic diagram of based process circuit.

Fig. 1 d is a kind of schematic configuration diagram of fixed-point data type.

Fig. 2 is a kind of Matrix Multiplication with vector flow diagram.

Fig. 2 a is Matrix Multiplication with the schematic diagram of vector.

Fig. 2 b is a kind of Matrix Multiplication with matrix procedures schematic diagram.

Fig. 2 c is schematic diagram of the matrix A i multiplied by vector B.

Fig. 2 d is schematic diagram of the matrix A multiplied by matrix B.

Fig. 2 e is schematic diagram of the matrix A i multiplied by matrix B.

Fig. 3 a is neural metwork training schematic diagram.

Fig. 3 b is convolution algorithm schematic diagram.

Fig. 4 a is neural network forward operation schematic diagram.

Fig. 4 b is the reversed operation schematic diagram of neural network.

Fig. 4 c is that present disclosure is also disclosed that a combined treatment device structural schematic diagram.

Fig. 4 d is that present disclosure is also disclosed that a combined treatment device another kind structural schematic diagram.

Fig. 5 a is neural network another kind forward operation schematic diagram.

Fig. 5 b is another reversed operation schematic diagram of neural network.

Fig. 5 c is a kind of structural schematic diagram for neural network processor board that present disclosure embodiment provides；

Fig. 5 d is a kind of structural schematic diagram for neural network chip encapsulating structure that present disclosure embodiment provides；

Fig. 5 e is a kind of structural schematic diagram for neural network chip that present disclosure embodiment provides；

Fig. 6 is a kind of schematic diagram for neural network chip encapsulating structure that present disclosure embodiment stream provides；

Fig. 6 a is the schematic diagram for another neural network chip encapsulating structure that present disclosure embodiment stream provides.

Specific embodiment

In order to make those skilled in the art more fully understand present disclosure scheme, below in conjunction in present disclosure embodiment The technical solution in present disclosure embodiment is clearly and completely described in attached drawing, it is clear that described embodiment is only Present disclosure a part of the embodiment, instead of all the embodiments.Based on the embodiment in present disclosure, those of ordinary skill in the art Every other embodiment obtained without creative efforts belongs to the range of present disclosure protection.

In the device that first aspect provides, described device further includes branch circuit, and the branch circuit is arranged described Between main process task circuit and multiple based process circuits, for the main process task circuit and the multiple based process circuit it Between forwarding transmission data.

In the device that first aspect provides, the main process task circuit, for obtaining data block and operation to be calculated Instruction is divided into distribution data block and broadcast data block to the data block to be calculated according to the operational order；To described Distribution data block carries out deconsolidation process and obtains multiple basic data blocks, the multiple basic data block is distributed to connected to it Circuit broadcasts the broadcast data block to circuit connected to it；

The based process circuit, for starting the based process circuit to the basic data block according to the operation It is converted into fixed-point data type with the broadcast data block, inner product operation is executed with fixed-point data type and obtains fixed-point data type Operation result, start the data type computing circuit for the operation result of the fixed-point data type and be converted into floating data The operation result of type is sent to main process task circuit；

The main process task circuit handles to obtain the data block to be calculated for the operation result to the floating point type And the instruction results of operational order.

In the device that first aspect provides, the main process task circuit is specifically used for the broadcast data block passing through one It is secondary to broadcast to the multiple based process circuit.

In the device that first aspect provides, the main process task circuit is more specifically for the broadcast data block to be divided into A part broadcast data block, by the multiple part broadcast data block by repeatedly broadcasting to the multiple based process circuit.

In the device that first aspect provides, the based process circuit is specifically used for the part broadcast data block With the basic data block to obtain inner product processing result after inner product processing of fixed point type execution, inner product processing is tied Fruit is cumulative to obtain partial arithmetic result, and the partial arithmetic result is converted into floating point type and is sent to the main process task circuit.

In the device that first aspect provides, the based process circuit is specifically used for multiplexing n times part broadcast data Block executes the part broadcast data block with fixed-point data type and the n basic data block inner product operation obtains fixed-point data type N part processing result, n part processing result of fixed-point data type is converted into n of floating point type part and is handled As a result, obtaining n partial arithmetic result after n part processing result of floating point type is added up respectively, the n part is transported It calculates result and is sent to the main process task circuit, the n is the integer more than or equal to 2.

In the device that first aspect provides, the main process task circuit includes: buffer circuit on master register or main leaf；

Or the based process circuit includes: base register or basic on piece buffer circuit.

In the device that first aspect provides, the main process task circuit includes: vector operation device circuit, arithmetic logic unit Circuit, accumulator circuit, matrix transposition circuit, direct memory access circuit, data type computing circuit or data rearrangement circuit One of or any combination.

First aspect provide device in, the data include: vector, matrix, three-dimensional data block, 4 D data block with And a kind of or any combination in n dimensional data block.

In the device that first aspect provides, such as operational order is multiplying order, and the main process task circuit determination multiplies Number data block is broadcast data block, and multiplicand data block is distribution data block；

If the operational order is convolution instruction, the main process task circuit determines that input block is broadcast data block, volume Product core is distribution data block.

In the method that the 6th aspect provides, the operation of the neural network includes: convolution algorithm, Matrix Multiplication matrix fortune Calculation, bigoted operation, connects one of operation, GEMM operation, GEMV operation, activation operation entirely or appoints Matrix Multiplication vector operation Meaning combination.

A refering to fig. 1, Fig. 1 a are a kind of structural schematic diagram of integrated circuit chip device, as shown in Figure 1a, the chip apparatus It include: main process task circuit, basic handling circuit and branch process circuit (optional).Wherein,

Main process task circuit may include register and/or on piece buffer circuit, which can also include: control Circuit, vector operation device circuit, ALU (arithmetic and logic unit, arithmetic logic unit) circuit, accumulator electricity The circuits such as road, DMA (Direct Memory Access, direct memory access) circuit, certainly in practical applications, above-mentioned main place Reason circuit can also add, conversion circuit (such as matrix transposition circuit), data rearrangement circuit or active circuit etc. others Circuit；

Optionally, main process task circuit may include: data type conversion computing circuit, and data type conversion computing circuit can With for by the data received or sent from floating point type data conversion at fixed point type data, certainly in practical applications, It can be by fixed point type data conversion at floating point type data.The present invention is not intended to limit above-mentioned data type conversion computing circuit Concrete form.

Main process task circuit further includes data transmitting line, data receiver circuit or interface, which can collect At data distribution circuit and data broadcasting circuit, certainly in practical applications, data distribution circuit and data broadcasting circuit It can also be respectively set；Above-mentioned data transmitting line and data receiver circuit also can integrate shape together in practical applications At data transmit-receive circuit.For broadcast data, that is, need to be sent to the data of each based process circuit.For distributing data, Need selectively to be sent to the data of part basis processing circuit, specific selection mode can be by main process task circuit foundation Load and calculation are specifically determined.For broadcast transmission mode, i.e., broadcast data is sent to the forms of broadcasting Each based process circuit.(broadcast data in practical applications, is sent to each based process by way of once broadcasting Broadcast data can also be sent to each based process circuit, the application specific implementation by way of repeatedly broadcasting by circuit Mode is not intended to limit the number of above-mentioned broadcast), for distributing sending method, i.e., distribution data are selectively sent to part base Plinth processing circuit.

Realizing that the control circuit of main process task circuit is to some or all of based process circuit transmission number when distributing data According to (data may be the same or different, specifically, if sending data by the way of distribution, each reception data The data that based process circuit receives can be different, naturally it is also possible to which the data for having part basis processing circuit to receive are identical；

Specifically, when broadcast data, the control circuit of main process task circuit is to some or all of based process circuit transmission Data, each based process circuit for receiving data can receive identical data.

Optionally, the vector operation device circuit of above-mentioned main process task circuit can execute vector operation, including but not limited to: two A vector addition subtraction multiplication and division, vector and constant add, subtract, multiplication and division operation, or executes any operation to each element in vector. Wherein, continuous operation is specifically as follows, and vector and constant add, subtract, multiplication and division operation, activating operation, accumulating operation etc..

Each based process circuit may include base register and/or basic on piece buffer circuit；Each based process Circuit can also include: one or any combination in inner product operation device circuit, vector operation device circuit, accumulator circuit etc..On Stating inner product operation device circuit, vector operation device circuit, accumulator circuit can be integrated circuit, above-mentioned inner product operation device electricity Road, vector operation device circuit, accumulator circuit may be the circuit being separately provided.

The chip apparatus can also include optionally one or more branch process circuits, such as have branch process circuit When, wherein main process task circuit and branch process circuit connection, the branch process circuit and basic handling circuit connection, this locates substantially The inner product operation device circuit of reason circuit is used to execute the inner product operation between data block, the control circuit control of the main process task circuit Data receiver circuit or data transmitting line receive and dispatch external data, and controlling data transmitting line by control circuit will be external Data distribution to branch process circuit, branch process circuit is used to receive and dispatch the data of main process task circuit or basic handling circuit. Structure as shown in Figure 1a is suitble to the calculating of complex data, because for main process task circuit, the quantity of the unit of connection It is limited, so needing to add branch process circuit between main process task circuit and basic handling circuit to realize more basic places The access of circuit is managed, to realize the calculating to complex data block.The connection structure of branch process circuit and based process circuit It can be arbitrary, be not limited to the H-type structure of Fig. 1 a.Optionally, main process task circuit is broadcast to based process circuit or divides The structure of hair, based process circuit to main process task circuit are the structures for collecting (gather).Broadcast, distribution and the definition collected are such as Under, for distribution or broadcasting architecture, the quantity of based process circuit at this time is greater than main process task circuit, i.e. 1 main process task circuit Corresponding multiple based process circuits are the structure of broadcast or distribution from main process task circuit to multiple based process circuits, conversely, It can be collection structure from multiple based process circuits to main process task circuit.

The data of based process circuit, the distribution of reception main process task circuit or broadcast are saved in the on piece of based process circuit In caching, operation generation can be carried out as a result, data can be sent to main process task circuit.

Involved data can be the data of arbitrary data types in based process circuit, can be any bit wide The data that floating number indicates are also possible to the data that the fixed-point number of any bit wide indicates；All computing circuits being related to and storage Circuit can be the computing circuit and storage circuit for the arbitrary data types being capable of handling, and can be the floating number of any bit wide Computing circuit and storage circuit be also possible to any bit wide fixed-point number computing circuit and storage circuit.

Optionally, each based process circuit may each comprise data type conversion computing circuit, can also be in part base Plinth processing circuit configuration data type translation operation circuit；What data type conversion computing circuit can be used for receive or send Data from floating point type data conversion at fixed point type data, can also be by fixed point type data conversion at floating point type data. The present invention is not intended to limit the concrete form of above-mentioned data type conversion computing circuit.

Optionally, the vector operation device circuit of the based process circuit can hold two vectors after data type conversion Capable vector operation, certainly in practical applications, the inner product operation device circuit of based process circuit can be to data type conversion Two vectors afterwards execute inner product operation, and accumulator circuit can also add up to the result of inner product operation.

In a kind of optinal plan, two vectors can be stored on piece caching and/or register, based process circuit Operation can be executed according to two vectors that need to extract actually calculated.The operation includes but is not limited to: inner product operation, multiplication fortune Calculation, add operation or other operations.

In a kind of optinal plan, the result of inner product operation can be added on piece caching and/or register；Its is optional The advantages of scheme is the volume of transmitted data reduced between based process circuit and main process task circuit, improves operation efficiency, drop Low data transmission power consumption.

In a kind of optinal plan, the result of inner product operation is transmitted without cumulative directly as result；This technical solution The advantages of be the operand reduced inside based process circuit, improve based process circuit operation efficiency.

In a kind of optinal plan, each based process circuit can execute the inner product operation of two vectors of multiple groups, can also It is added up respectively with the result to multiple groups inner product operation；

In a kind of optinal plan, two vector datas of multiple groups can be stored on piece caching and/or register；

In a kind of optinal plan, the result of multiple groups inner product operation can be added on piece caching and/or register respectively In；

In a kind of optinal plan, the result of each group inner product operation can be transmitted without cumulative directly as result；

In a kind of optinal plan, each based process circuit can execute the same vector and carry out respectively with multiple vectors The operation (" one-to-many " inner product, i.e., it is shared for having a vector in multiple groups inner product in every group of two vectors) of inner product operation, And the corresponding inner product result of each vector is added up respectively.Same set of weight may be implemented to different defeated in this technical solution Enter data repeatedly to be calculated, increase data-reusing, reduce the volume of transmitted data of based process circuit internal data, improves meter Efficiency is calculated, power consumption is reduced.

Specifically, calculate in the data that use of inner product, each group shared every group of vector sum of another vector (i.e. every group it Between different that vector) data source can be different:

In a kind of optinal plan, when calculating inner product, the shared vector of each group comes from main process task circuit or bifurcation Manage the broadcast or distribution of circuit；

In a kind of optinal plan, when calculating inner product, the shared vector of each group is cached from piece；

In a kind of optinal plan, when calculating inner product, the shared vector of each group comes from register；

In a kind of optinal plan, when calculating inner product, every group another unshared vector from main process task circuit or The broadcast or distribution of person's branch process circuit；

In a kind of optinal plan, when calculating inner product, every group another unshared vector to cache since on piece；

In a kind of optinal plan, when calculating inner product, every group another unshared vector comes from register；

In a kind of optinal plan, when carrying out the inner product operation of multiple groups, every group of shared vector is in based process circuit On piece caching and/register in retain any number；

In a kind of optinal plan, shared vector can correspond to every group of inner product and retain portion；

In a kind of optinal plan, shared vector can only retain a；

Specifically, the result of multiple groups inner product operation can be added to respectively on piece caching and/or register；

Specifically, the result of each group inner product operation can be transmitted without cumulative directly as result；

Structure shown in a refering to fig. 1, it includes a main process task circuit (vector operations can be executed), more based process electricity Road (can execute inner product operation).The benefit combined in this way is: device can not only use based process circuit execute matrix and to Multiplying is measured, could be used that main process task circuit executes any other vector operation, make device in limited hardware circuit Under configuration, more operations can be completed faster, reduce with the number that carries out data transmission outside device, improve calculating Efficiency reduces power consumption.Turn in addition, data type can be set in based process circuit and/or main process task circuit in this chip Computing circuit is changed, in this way can be by floating point type data conversion at fixed point type data when carrying out neural computing, it can also With by fixed point type data conversion, at floating point type data, and this chip can be according to each circuit (mainly main process task electricity Road and based process circuit) operand (i.e. load capacity) dynamically distribution data type is converted by that circuit, this Sample can reduce the complicated process of data calculating, reduce power consumption, and the conversion for dynamically distributing data type can be realized not Influence the computational efficiency of chip.The mode of the distribution includes but is not limited to: load balancing, load minimum value distribution etc. mode.

Device shown in b refering to fig. 1, device shown in Fig. 1 b for no branch process circuit computing device, such as Fig. 1 b institute The device shown comprising: main process task circuit and N number of based process circuit, wherein (specific structure is as schemed for main process task circuit Shown in 1c) can be with direct or indirect connection, when the mode being for example indirectly connected with N number of based process circuit, a kind of optional side Case may include N/4 branch process circuit as shown in Figure 1a, and each branch process circuit is separately connected 4 based process electricity Road may refer to above-mentioned as shown in Figure 1a retouch for the circuit that main process task circuit and N number of based process circuit separately include It states, which is not described herein again, what needs to be explained here is that, above-mentioned based process circuit can also be arranged in branch process circuit, In addition, the quantity of each branch process circuit connection based process circuit can also be not limited to 4, producer can be according to reality It is configured.The above-mentioned main process task circuit and/or N number of based process circuit may each comprise data type conversion operation electricity Road, specifically, can be main process task circuit includes data type computing circuit, be also possible to N number of based process circuit or in which A part include data type conversion circuit, be also possible to main process task circuit and N number of based process circuit or in which one Divide and includes.Above-mentioned main process task circuit can instruct the behaviour of dynamic distribution data type conversion step according to neural computing Make entity, specifically, main process task circuit can determine whether to execute data type to the data received according to the load of itself Switch process, specifically, the value of load can be arranged to multiple sections, the corresponding distribution data type conversion step in each section For example, the load value in section 1 is lower by taking 3 sections as an example data class can be individually performed by main process task circuit in executing subject Type switch process, 2 load value of section, can be by main process task circuit or N number of based process electricity between section 1 and section 3 Road executes data type conversion step jointly, and 3 load value of section is higher, can execute data type by N number of based process circuit Switch process.In this regard, can be executed in a manner of expressing, such as main process task circuit can configure a special instruction or refer to It enables, when based process circuit receives the special instruction or instruction, determines and execute data type conversion step, such as based process When circuit does not receive special instruction or instruction, determination does not execute data type conversion step.It for another example, can be in a manner of hint It executes, for example, based process circuit receives the data that data type is floating point type and determination needs to be implemented inner product operation When, by the data type conversion at the data of fixed point type.

A kind of method realized and calculated using device as shown in Figure 1a is provided below, the method for the calculating is specifically as follows The calculation of neural network, such as the forward operation of neural network, the training of neural network, in practical applications, forward direction fortune Matrix Multiplication matrix, convolution algorithm, activation operation, transform operation etc. operation can be executed according to different input datas by calculating, on Stating operation can be realized using device as shown in Figure 1a.

The data conversion computing circuit of main process task circuit first convert to the type of data and then be transmitted by control circuit Based process circuit computing is given, for example, floating number can be converted into bit wide more by the data conversion computing circuit of main process task circuit Low fixed-point number is transmitted further to based process circuit, its advantage is that can reduce transmission data bit wide, reduce transmission always than Special quantity, the efficiency that based process circuit executes ground bit wide fixed-point calculation is also higher, and power consumption is lower.

The data received such as based process circuit are floating data, then based process circuit can receive after data by Then the advanced row data type conversion of data conversion computing circuit is calculated again, for example, based process circuit receives main process task The floating number that circuit transmission comes, data conversion computing circuit are then converted into fixed-point number, then the inner product of based process circuit Calculator circuit, vector operation device circuit or accumulator circuit carry out operation, improve operation efficiency, reduce power consumption.

Then based process circuit counting can be transmitted further to main process task electricity after going out result with advanced row data type conversion Road, for example, the floating point arithmetic result that based process circuit counting goes out can first be converted to the fixed-point number of low-bit width and then pass again It is defeated by main process task circuit, benefit is reduction of the data bit width of transmission process, and it is more efficient, and saved power consumption.

Data to be calculated are transferred on all or part based process circuit by main process task circuit；With Matrix Multiplication with For vector calculates, matrix data can be split each column as a basic data by the control circuit of main process task circuit, such as M*n matrix, can split into the vector of n m row, and the control circuit of main process task circuit distributes the vector of n m row after fractionation To multiple based process circuits.For vector, vector can be integrally broadcast to each basis by the control circuit of main process task circuit Processing circuit.If the value of m is bigger, control circuit can be with x=2 first by m*n matrix-split at x*n vector Example, can specifically split into, 2n vector, and each vector includes m/2 row, i.e., divide equally vector each in the vector of n m row At 2 vectors, by taking the first row as an example, if first vector of the vector of n m row is 1000 rows, then being divided into 2 vectors can Think, preceding 500 row is formed primary vector, and 500 rows form secondary vector by after, and control circuit is broadcasted by 2 by 2 vectors It is broadcast to multiple based process circuits.

The mode of data transmission can be broadcast and perhaps distribute or other any possible transmission modes；

After based process circuit receives data, operation is executed, operation result is obtained；

Operation result is transmitted back to main process task circuit by based process circuit；

The operation result can be intermediate calculation results, be also possible to final operation result.

The operation of Matrix Multiplication vector is completed using device as shown in Figure 1a；

(Matrix Multiplication vector can be every a line in matrix and carry out inner product operation with vector respectively, and these results are pressed The sequence of corresponding row puts into a vector.)

Be described below calculate size be M row L column matrix S and length be L vector P multiplication operation, following Fig. 2 a It is shown, (every a line in matrix S is identical as vector P length, and the data opsition dependent in them corresponds) described neural network Computing device possesses K based process circuit:

Referring to Fig.2, Fig. 2 has provided a kind of implementation method of Matrix Multiplication vector, can specifically include:

Step S201, every data line in matrix S is converted into pinpointing by the data conversion computing circuit of main process task circuit The data of type, the control circuit of main process task circuit are distributed in some in K based process circuit, based process circuit The distribution data received are stored in the on piece caching and/or register of based process circuit；

In a kind of optinal plan, if line number M≤K of matrix S, the control circuit of main process task circuit is to K basis Processing circuit distributes a line of s-matrix respectively；

In a kind of optinal plan, if line number M > K of matrix S, the control circuit of main process task circuit gives each basis Processing circuit distributes a line or the data of multirow in s-matrix respectively.

The collection for the row being distributed in the S of i-th of based process circuit is combined into Ai, shares Mi row, as Fig. 2 c is indicated i-th Calculating to be executed on based process circuit.

In a kind of optinal plan, in each based process circuit, such as in i-th of based process circuit, it can incite somebody to action The distribution data received such as matrix A i is stored in the register and/or on piece caching of i-th of based process circuit；Advantage The volume of transmitted data of distribution data after being the reduction of, improves computational efficiency, reduces power consumption.

Step S202, vector P is converted into the data of fixed point type, main place by the data type computing circuit of main process task circuit Each section in the vector P of fixed point type is transferred to K based process circuit by the control circuit of reason circuit in a broadcast manner；

In a kind of optinal plan, the control circuit of main process task circuit, which can only broadcast each section in vector P, once to be arrived In register or the on piece caching of each based process circuit, i-th of based process circuit is to the vector P's this time obtained Data are fully multiplexed, and the corresponding inner product operation with every a line in matrix A i is completed.Advantage is reduced from main process task circuit To the volume of transmitted data of the repetition transmission of the vector P of based process circuit, execution efficiency is improved, reduces transmission power consumption.

In a kind of optinal plan, each section in vector P can be repeatedly broadcast to respectively by the control circuit of main process task circuit In register or the on piece caching of a based process circuit, data of i-th of based process circuit to the vector P obtained every time Without multiplexing, the inner product operation of the every a line corresponded in matrix A i is completed by several times；Advantage is reduced in based process circuit The volume of transmitted data of the vector P of the single transmission in portion, and the capacity of based process circuit caching and/or register can be reduced, Execution efficiency is improved, transmission power consumption is reduced, reduces cost.

In a kind of optinal plan, each section in vector P can be repeatedly broadcast to respectively by the control circuit of main process task circuit In register or the on piece caching of a based process circuit, data of i-th of based process circuit to the vector P obtained every time Fractional reuse is carried out, the inner product operation of the every a line corresponded in matrix A i is completed；Advantage is reduced from main process task circuit to base The volume of transmitted data of plinth processing circuit also reduces the volume of transmitted data inside based process circuit, improves execution efficiency, reduces and pass Defeated power consumption.

Step S203, the inner product of the data of inner product operation device the circuit counting matrix S and vector P of K based process circuit, Such as i-th of based process circuit, the inner product of the data of the data and vector P of calculating matrix Ai；

Step S204, the accumulator circuit of K based process circuit is added up the result of inner product operation As a result, accumulation result to be transmitted back to main process task circuit in the form of fixed point type.

In a kind of optinal plan, each based process circuit can be executed to the part and (part that inner product operation obtains That is a part of accumulation result, such as accumulation result are as follows: F1*G1+F2*G2+F3*G3+F4*G4+F5*G5, then part and Can be with are as follows: the value of F1*G1+F2*G2+F3*G3) it is transmitted back to main process task circuit and adds up；Advantage is to reduce based process electricity Operand inside road improves the operation efficiency of based process circuit.

The part that can also be obtained the inner product operation that each based process circuit executes in a kind of optinal plan and guarantor It is cumulative to terminate to be transmitted back to main process task circuit later in register and/or the on piece caching of existence foundation processing circuit；Advantage is, Reduce the volume of transmitted data between based process circuit and main process task circuit, improve operation efficiency, reduces data transmission Power consumption.

In a kind of optinal plan, can also by the obtained part of inner product operation that each based process circuit executes and It is stored in the register and/or on piece caching of based process circuit and adds up under partial picture, be transferred under partial picture Main process task circuit adds up, cumulative to terminate to be transmitted back to main process task circuit later；Advantage is to reduce based process circuit and master Volume of transmitted data between processing circuit, improves operation efficiency, reduces data transmission power consumption, reduces based process circuit Internal operand improves the operation efficiency of based process circuit.

Refering to Fig. 2 b, the operation of Matrix Multiplication matrix is completed using device as shown in Figure 1a；

Be described below calculate size be M row L column matrix S and size be L row N column matrix P multiplication operation, (square Every a line in battle array S is identical with each column length of matrix P, as shown in Figure 2 d) the neural computing device possesses K base Plinth processing circuit:

Step S201b, every data line in matrix S is distributed to K based process by the control circuit of main process task circuit In some in circuit, the data received are stored on piece caching and/or register by based process circuit；

In a kind of optinal plan, if line number M≤K of S, the control circuit of main process task circuit is to M based process Circuit distributes a line of s-matrix respectively；

In a kind of optinal plan, if line number M > K of S, the control circuit of main process task circuit is to each based process electricity Distribute a line or the data of multirow in s-matrix respectively in road.

There is Mi row to be distributed to i-th of based process circuit in S, the collection of this Mi row is collectively referred to as Ai, as Fig. 2 e indicates i-th of base Calculating to be executed on plinth processing circuit.

In a kind of optinal plan, in each based process circuit, such as in i-th of based process circuit:

Matrix A i is stored in i-th of based process circuit register by the received matrix A i distributed by main process task circuit And/or on piece caching；Advantage be the reduction of after volume of transmitted data, improve computational efficiency, reduce power consumption.

Step S202b, each section in matrix P is transferred to each base by the control circuit of main process task circuit in a broadcast manner Plinth processing circuit；

In a kind of optinal plan, each section in matrix P can only be broadcasted and once arrive posting for each based process circuit In storage or on piece caching, i-th of based process circuit is fully multiplexed the data of the matrix P this time obtained, Complete the corresponding inner product operation with every a line in matrix A i；Multiplexing in the present embodiment is specifically as follows based process circuit and is counting Reused in calculation, for example, matrix P data multiplexing, can be and the data of matrix P are being used for multiple times.

In a kind of optinal plan, each section in matrix P can be repeatedly broadcast to respectively by the control circuit of main process task circuit In register or the on piece caching of a based process circuit, data of i-th of based process circuit to the matrix P obtained every time Without multiplexing, the inner product operation of the every a line corresponded in matrix A i is completed by several times；

In a kind of optinal plan, each section in matrix P can be repeatedly broadcast to respectively by the control circuit of main process task circuit In register or the on piece caching of a based process circuit, data of i-th of based process circuit to the matrix P obtained every time Fractional reuse is carried out, the inner product operation of the every a line corresponded in matrix A i is completed；

In a kind of optinal plan, each based process circuit, such as i-th of based process circuit, calculating matrix Ai's The inner product of data and the data of matrix P；

Step S203b, the result of inner product operation is added up and is transmitted by the accumulator circuit of each based process circuit Return main process task circuit.

In a kind of optinal plan, based process circuit can execute the part and be transmitted back to that inner product operation obtains for each Main process task circuit adds up；

The part that can also be obtained the inner product operation that each based process circuit executes in a kind of optinal plan and guarantor It is cumulative to terminate to be transmitted back to main process task circuit later in register and/or the on piece caching of existence foundation processing circuit；

In a kind of optinal plan, can also by the obtained part of inner product operation that each based process circuit executes and It is stored in the register and/or on piece caching of based process circuit and adds up under partial picture, be transferred under partial picture Main process task circuit adds up, cumulative to terminate to be transmitted back to main process task circuit later；

Refering to Fig. 3 a, full connection operation is completed using device as shown in Figure 1a:

If the input data of full articulamentum is a vector (i.e. the case where input of neural network is single sample), Using the weight matrix of full articulamentum as matrix S, input vector is executed as vector P according to the application method one of described device The operation of Matrix Multiplication vector as shown in Figure 2；

If the input data of full articulamentum is that (i.e. the input of neural network is multiple samples as batch to a matrix The case where), then using the weight matrix of full articulamentum as matrix S, input vector is as matrix P, or with the power of full articulamentum For value matrix as matrix P, input vector uses Matrix Multiplication matrix as shown in Figure 2 c as matrix S, according to described device Execute operation；

Refering to Fig. 3 b, convolution algorithm is completed using device as shown in Figure 1a:

For a convolutional layer, remember that the quantity of its convolution kernel is M；

Step S301, the weight of each of convolutional layer weight convolution kernel is distributed to by the control circuit of main process task circuit In some in K based process circuit, it is stored in the on piece caching and/or register of based process circuit；

In a kind of optinal plan, if number M≤K of convolution kernel, the control circuit of main process task circuit is to M base Plinth processing circuit distributes the weight of a convolution kernel respectively；

In a kind of optinal plan, if number M > K of convolution kernel, the control circuit of main process task circuit is given at each basis Reason circuit distributes the weight of one or more convolution kernels respectively.

Shared Mi convolution kernel is distributed to i-th of based process circuit, and the collection of these convolution kernel weights is collectively referred to as Ai.

The convolution kernel weight Ai distributed by main process task circuit received is stored in its register and/or on piece caching；

Step S302, each section in input data P is transferred to respectively by the control circuit of main process task circuit in a broadcast manner A based process circuit；

In a kind of optinal plan, each section in input data P can only be broadcasted one by the control circuit of main process task circuit In secondary register or on piece caching to each based process circuit, i-th of based process circuit is defeated to what is this time obtained The data for entering data P are fully multiplexed, and the corresponding inner product operation with each convolution kernel in Ai is completed；

In a kind of optinal plan, the control circuit of main process task circuit can repeatedly be broadcasted each section in input data P Into register or the on piece caching of each based process circuit, i-th of based process circuit is to the input number obtained every time According to the data of P without multiplexing, the inner product operation for corresponding to each of Ai convolution kernel is completed by several times；

In a kind of optinal plan, the control circuit of main process task circuit can repeatedly be broadcasted each section in input data P Into register or the on piece caching of each based process circuit, i-th of based process circuit is to the input number obtained every time Fractional reuse is carried out according to the data of P, completes the inner product operation for corresponding to each of Ai convolution kernel；

Step S303, the data inner product of each based process circuit counting convolution kernel and input data P, such as i-th of base Plinth processing circuit calculates the inner product of each convolution kernel of Ai and the data of input data P；

Step S304, the result of inner product operation is added up and is transmitted back to by the accumulator circuit of each based process circuit Main process task circuit:

In a kind of optinal plan, can based process circuit will execute every time the obtained part of inner product operation and be transmitted back to Main process task circuit adds up；

In a kind of optinal plan, part and guarantor that based process circuit can also obtain the inner product operation executed every time It is cumulative to terminate to be transmitted back to main process task circuit later in register and/or the on piece caching of existence foundation processing circuit；

In a kind of optinal plan, part that based process circuit can also obtain the inner product operation executed every time and It is stored in the register and/or on piece caching of based process circuit and adds up under partial picture, be transferred under partial picture Main process task circuit adds up, cumulative to terminate to be transmitted back to main process task circuit later；

The method of weight is updated using device as shown in Figure 1a:

The right value update function in neural network training process, tool are realized using the vector operation device circuit of main process task circuit Body, right value update refers to the gradient of using weights to update the method for weight.

In a kind of optinal plan, using main process task circuit vector operation device circuit to weight and weight gradient the two Vector carries out plus and minus calculation and obtains operation result, which is to update weight.

In a kind of optinal plan, using main process task circuit vector operation device circuit weight and weight gradient multiplied by Or intermediate weight and intermediate weight gradient value are obtained divided by a number, vector operation device circuit is to intermediate weight and intermediate weight ladder Angle value carries out plus and minus calculation and obtains operation result, which is to update weight.

In a kind of optinal plan, can the gradients of first using weights calculate one group of momentum, then reuse momentum with Weight carries out plus-minus and updated weight is calculated；

The method of the reversed operation of full articulamentum is realized using device as shown in Figure 1a

The reversed operation of full articulamentum is segmented into two parts, and shown in following Fig. 4 a, solid arrow indicates full articulamentum Positive calculating process indicates the retrospectively calculate process of full articulamentum as shown in Figure 4 b.

The reversed operation of full articulamentum, it is as shown in Figure 2 b to can be used device as shown in Figure 1a shown in Fig. 4 a, Fig. 4 b Matrix Multiplication matrix method complete；

The reversed operation of convolutional layer is realized using device as shown in Figure 1a；

The reversed operation of convolutional layer is segmented into two parts, and in following Fig. 5 a, solid arrow indicates the positive meter of convolutional layer Calculation process indicates the retrospectively calculate process of convolutional layer as shown in Figure 5 b.

Device as shown in Figure 1a can be used using as shown in Figure 3b in the reversed operation of convolutional layer shown in Fig. 5 a, Fig. 5 b Method complete convolutional layer reversed operation.

BLAS (Basic Linear Algebra Subprograms) function is realized using device as shown in Figure 1a Method

GEMM calculating refers to: the operation of the matrix-matrix multiplication in the library BLAS.The usual representation of the operation are as follows: C= Alpha*op (S) * op (P)+beta*C, wherein S and P is two matrixes of input, and C is output matrix, and alpha and beta are Scalar, op represents certain operation to matrix S or P, in addition, also having the integer of some auxiliary as a parameter to illustrating matrix The width of S and P is high；

Realize that the step of GEMM is calculated includes: using the device of such as Fig. 1 a

The data type conversion computing circuit of main process task circuit can carry out data type conversion to matrix S and matrix P；

The conversion circuit of main process task circuit carries out respective op operation to input matrix S and matrix P；

In a kind of optinal plan, op can operate for the transposition of matrix；It can use the matrix transposition of main process task circuit Circuit realizes that the matrix transposition operates；

It, can also be by main process task circuit after the OP operation for having executed matrix S and matrix P in a kind of optinal plan Data conversion computing circuit execute data type conversion operation, i.e. data conversion computing circuit is by the number of op (S) and op (P) According to type by floating point type data conversion at fixed point type data, matrix multiplication operation as shown in Figure 2 b is then executed.

In a kind of optinal plan, the op of some matrix can be sky, op operation without；

Op (S) and op are completed with the calculation method using Matrix Multiplication matrix as illustrated in figure 2b of device as shown in Figure 1a (P) matrix multiplication between calculates；

Using main process task circuit arithmetic logic unit to each of result of op (S) * op (P) value carry out multiplied by The operation of alpha；

In a kind of optinal plan alpha be 1 in the case where multiplied by alpha operation without；

The operation of beta*C is realized using the arithmetic logic unit of main process task circuit；

In the case that beta is 1 in a kind of optinal plan, without the operation multiplied by beta；

It is right between matrix alpha*op (s) * op (P) and beta*C to be realized using the vector operation device circuit of main process task circuit The step of answering position to be added obtains the result of GEMM calculating.

In the case that beta is 0 in a kind of optinal plan, without this step operation；

GEMV calculating refers to: the operation of the Matrix-Vector multiplication in the library BLAS.The usual representation of the operation are as follows: C= Alpha*op (S) * P+beta*C, wherein S is input matrix, and P is the vector of input, and C is output vector, and alpha and beta are Scalar, op represent certain operation to matrix S；

The step of GEMV is calculated is realized using the device of such as Fig. 1 a are as follows:

The data type conversion computing circuit of main process task circuit can carry out data type to input matrix S and matrix P Conversion；

The conversion circuit of main process task circuit carries out corresponding op operation to input matrix S；

In a kind of optinal plan, op can operate for the transposition of matrix；It is realized using the conversion circuit of main process task circuit The operation of matrix transposition；

In a kind of optinal plan, the op of some matrix can be sky, transposition operation without；

With device as shown in Figure 1a using the Matrix Multiplication vector as described in Fig. 2 a calculation method complete matrix op (S) and to The Matrix-Vector multiplication measured between P calculates；

Each of result of op (S) * P value is carried out multiplied by alpha using the arithmetic logic unit of main process task circuit Operation；

It is realized using the vector operation device circuit of main process task circuit and corresponds to position between matrix alpha*op (S) * P and beta*C The step of setting addition obtains the result of GEMV.

In the case that beta is 0 in a kind of optinal plan, without the step of being added operation；

The method for realizing activation primitive using the device of such as Fig. 1 a

A vector is inputted using the active circuit of main process task circuit, calculates the activation vector of the vector；

In a kind of optinal plan, each of input vector value is passed through an activation by main process task Circuit activation circuit Function (input of activation primitive is a numerical value, and output is also a numerical value), calculates a numerical value and is output to output vector Corresponding position；

In a kind of optinal plan, activation primitive may is that y=max (m, x), and wherein x is input numerical value, and y is output number Value, m is a constant；

In a kind of optinal plan, activation primitive may is that y=tanh (x), and wherein x is input numerical value, and y is output number Value；

In a kind of optinal plan, activation primitive may is that y=sigmoid (x), and wherein x is input numerical value, and y is output Numerical value；

In a kind of optinal plan, activation primitive can be a piecewise linear function；

In a kind of optinal plan, activation primitive can be one number of any input, export a several function.

In a kind of optinal plan, the source of input vector has (including but not limited to):

The external data sources of described device；

In a kind of optinal plan, input data carries out the operation result of Matrix Multiplication vector from described device；

In a kind of optinal plan, input data carries out the operation result of Matrix Multiplication matrix from described device；

The main process task circuit counting result of described device；

In a kind of optinal plan, input data from described device main process task circuit realize biasing set after calculating knot Fruit.

It should be noted that above-mentioned activation operation can be by the arithmetic logic circuit and accumulator circuit in main process task circuit It realizes, can also individually increase an active circuit in main process task circuit to realize activation operation.

It is realized using the device of such as Fig. 1 a and adds bias operation:

The function that two vectors or two matrixes are added may be implemented using the vector operation device circuit of main process task circuit；

The every a line for a vector being added to a matrix may be implemented using the vector operation device circuit of main process task circuit On or each column on function.

In a kind of optinal plan, the matrix can come from the result that described device executes Matrix Multiplication matrix operation；

In a kind of optinal plan, the matrix can come from the result that described device executes Matrix Multiplication vector operation；

In a kind of optinal plan, the matrix can come from the number that the main process task circuit of described device receives from outside According to.

In a kind of optinal plan, the vector can come from the number that the main process task circuit of described device receives from outside According to.

Including but not limited to the above data source.

Data type conversion is realized using the device of such as Fig. 1 a:

It is realized using the data type conversion computing circuit of main process task circuit by the conversion of data type；

In a kind of optinal plan, the number of one group of data is realized using the data type conversion computing circuit of main process task circuit It is converted according to type；

In a kind of optinal plan, the form of data type conversion includes but is not limited to: floating number turns fixed-point number and fixed point It is several to turn floating number etc.；

The present invention also provides a kind of chip, which includes computing device, which includes:

Including a main process task circuit, involved data can be the number of arbitrary data types in main process task circuit According in a kind of optinal plan, the data that can be the floating number expression of any bit wide are also possible to the fixed-point number of any bit wide The data of expression；All computing circuits and storage circuit being related to can be computing circuit and the storage of arbitrary data types Circuit, in a kind of optinal plan, the computing circuit and storage circuit that can be the floating number of any bit wide are also possible to arbitrarily The computing circuit and storage circuit of the fixed-point number of bit wide.

In a kind of optinal plan, main process task circuit includes data type conversion computing circuit；

In a kind of optinal plan, main process task circuit includes the vector operation unit for executing data type conversion；

Specifically, the Data Input Interface comprising reception input data；

In a kind of optinal plan, the received data source may is that the neural network computing circuit device Some or all of external or described neural network computing circuit device based process circuit；

In a kind of optinal plan, the Data Input Interface can have multiple；Specifically, it may include output data Data output interface；

In a kind of optinal plan, the whereabouts of the data of the output may is that the outer of the neural network computing device Some or all of portion or the neural network computing circuit device based process circuit；

In a kind of optinal plan, the data output interface can have multiple；

In a kind of optinal plan, the main process task circuit includes on piece caching and/or register；

In a kind of optinal plan, includes arithmetic element in the main process task circuit, data operation can be executed；

It include arithmetic operation unit in the main process task circuit in a kind of optinal plan；

It include vector operation unit in the main process task circuit in a kind of optinal plan, it can be simultaneously to one group of data Execute operation；Specifically, the arithmetical operation and/or vector operation can be any type of operation, including but not limited to: two Number phase addition subtraction multiplication and division, a number and constant addition subtraction multiplication and division count one and execute exponent arithmetic, power operation, logarithm operation, with And various nonlinear operations, comparison operation, logical operation etc. are executed to two numbers.Two addition of vectors subtract multiplication and division, a vector Each of element and constant addition subtraction multiplication and division, exponent arithmetic, power operation, logarithm fortune are executed to each of vector element Calculation and various nonlinear operations etc. execute comparison operation, logical operation to the corresponding element of every two in a vector Deng.

In a kind of optinal plan, the main process task circuit includes data rearrangement column unit, in a certain order To based process circuit transmission data, or original place rearranges data in a certain order；

In a kind of optinal plan, the sequence of the data arrangement includes: to carry out dimension order to a poly-dimensional block data Transformation；The sequence of the data arrangement can also include: to carry out piecemeal to a data block to be sent at different bases Manage circuit.

The computing device further includes multiple based process circuits: each based process circuit is for calculating two vectors The method of inner product, calculating is two groups of numbers that based process circuit receives, by the corresponding multiplication of element in this two groups of numbers, and will The result of multiplication adds up；The result of inner product transfers out, and transfers out the position according to based process circuit here, and having can Other based process circuits can be transferred to, main process task circuit can also be transferred directly to.

Involved data can be the data of arbitrary data types in based process circuit, in a kind of optinal plan In, the data that can be the floating number expression of any bit wide are also possible to the data that the fixed-point number of any bit wide indicates；It is related to All computing circuits and storage circuit can be arbitrary data types computing circuit and storage circuit, in a kind of optional side In case, the computing circuit and storage circuit that can be the floating number of any bit wide are also possible to the operation of the fixed-point number of any bit wide Circuit and storage circuit.

In a kind of optinal plan, based process circuit includes data type conversion computing circuit；

In a kind of optinal plan, based process circuit includes the vector operation unit for executing data type conversion；

Specifically, the storage unit including being made of on piece caching and/or register；

Specifically, including one or more Data Input Interfaces for receiving data；

In a kind of optinal plan, including two Data Input Interfaces, it can divide from two Data Input Interfaces every time It Huo get not one or more data；

In a kind of optinal plan, based process circuit saves after can receiving input data from Data Input Interface In register and/or on piece caching；

The source that above-mentioned Data Input Interface receives data may is that other based process circuits and/or main process task circuit.

The main process task circuit of the neural network computing circuit device；

Other based process circuits of the neural network computing circuit device (gather around by the neural network computing circuit device There are multiple based process circuits)；

Specifically, the data output interface including one or more transmission output datas；

In a kind of optinal plan, one or more data can be transferred out from data output interface；

Specifically, it may is that the number received from Data Input Interface by the data that data output interface transfers out According to, be stored on piece caching and/or register in data, multiplier computation result, accumulator operation result or inner product operation One of device operation result or any combination.

It include three data output interfaces, two therein to correspond respectively to two data defeated in a kind of optinal plan Incoming interface, each layer go out upper one layer of data received from Data Input Interface, and third data output interface is responsible for output fortune Calculate result；

Specifically, the whereabouts of data output interface transmission data may is that context data source and data herein Whereabouts determines the connection relationship of based process circuit in a device.

The main process task circuit of the neural network computing circuit device；

Other based process circuits of the neural network computing circuit device, the neural network computing circuit device are gathered around There are multiple based process circuits；

Specifically, including arithmetic circuity: the arithmetic circuity is specifically as follows: one or more multipliers electricity Road, one or more accumulator circuits, one or more execute one or any combination in the circuit of two groups of number inner product operations.

In a kind of optinal plan, the multiplying of two numbers can be executed, result can be stored on piece caching and/ Or on register, can also directly it be added in register and/or on piece caching；

In a kind of optinal plan, the inner product operation of two groups of data can be executed, result can be stored on piece caching And/or in register, can also directly it be added in register and/or on piece caching；

In a kind of optinal plan, the accumulating operation of data can be executed, data accumulation on piece is cached and/or deposited In device；

Specifically, the data that accumulator circuit is cumulatively added may is that the data received from Data Input Interface, save Data, multiplier computation result, accumulator operation result, inner product operation device operation knot on piece caching and/or register One in fruit or any combination.

It should be noted that " Data Input Interface " and " data used in the above-mentioned description to based process circuit Output interface " refers to data input and the output interface of each based process circuit, rather than the data of whole device input With output interface.

Present disclosure is also disclosed that a neural network computing device comprising one or more is in such as Fig. 1 a or such as Fig. 1 b institute The chip shown is used to obtained from other processing units to operational data and control information, executes specified neural network computing, Implementing result passes to peripheral equipment by I/O interface.Peripheral equipment for example camera, display, mouse, keyboard, network interface card, Wifi interface, server.When comprising more than one mind such as Fig. 1 a or chip as shown in Figure 1 b, such as Fig. 1 a or as shown in Figure 1 b Chip chamber can be linked by specific structure and transmit data, for example, interconnected and transmitted by PCIE bus Data, to support the operation of more massive neural network.At this point it is possible to share same control system, can also have respectively solely Vertical control system；Can with shared drive, can also each accelerator have respective memory.In addition, its mutual contact mode can be Any interconnection topology.

The neural network computing device compatibility with higher can pass through PCIE interface and various types of server phases Connection.

Present disclosure is also disclosed that a combined treatment device comprising above-mentioned neural network computing device, general interconnection Interface and other processing units (i.e. general processing unit).Neural network computing device is interacted with other processing units, altogether The operation specified with completion user.Such as the schematic diagram that the 4c following figure is combined treatment device.

Other processing units, including central processor CPU, graphics processor GPU, neural network processor etc. are general/special With one of processor or above processor type.Processor quantity included by other processing units is with no restrictions.Its His interface of the processing unit as neural network computing device and external data and control, including data are carried, and are completed to Benshen Unlatching, stopping through network operations device etc. control substantially；Other processing units can also cooperate with neural network computing device It is common to complete processor active task.

General interconnecting interface, for transmitting data and control between the neural network computing device and other processing units Instruction.The neural network computing device obtains required input data, write-in neural network computing dress from other processing units Set the storage device of on piece；Control instruction can be obtained from other processing units, write-in neural network computing device on piece Control caching；The data in the memory module of neural network computing device can also be read and be transferred to other processing units.

As shown in figure 4d, optionally, which further includes storage device, for being stored in this arithmetic element/arithmetic unit Or data required for other arithmetic elements, be particularly suitable for required for operation data this neural network computing device or its The data that can not be all saved in the storage inside of his processing unit.

The combined treatment device can be used as the SOC on piece of the equipment such as mobile phone, robot, unmanned plane, video monitoring equipment The die area of control section is effectively reduced in system, improves processing speed, reduces overall power.When this situation, the combined treatment The general interconnecting interface of device is connected with certain components of equipment.Certain components for example camera, display, mouse, keyboard, Network interface card, wifi interface.

Present disclosure embodiment provides a kind of neural network processor board, can be used for numerous general or special purpose calculating systems In system environment or configuration.Such as: personal computer, server computer, handheld device or portable device, laptop device, Smart home, household electrical appliances, multicomputer system, microprocessor-based system, robot, programmable consumer-elcetronics devices, net Network personal computer (personal computer, PC), minicomputer, mainframe computer including any of the above system are set Standby distributed computing environment etc..

C referring to figure 5., Fig. 5 c are a kind of structural representation for neural network processor board that present disclosure embodiment provides Figure.As shown in Figure 5 c, above-mentioned neural network processor board 10 includes that neural network chip encapsulating structure 11, first is electrical and non- Electrical connection arrangement 12 and first substrate (substrate) 13.

Present disclosure is not construed as limiting the specific structure of neural network chip encapsulating structure 11, optionally, as fig 5d, Above-mentioned neural network chip encapsulating structure 11 includes: neural network chip 111, second electrical and non-electrical attachment device 112, the Two substrates 113.

The concrete form of neural network chip 111 involved in present disclosure is not construed as limiting, above-mentioned neural network chip 111 Including but not limited to the neural network chip for integrating neural network processor, above-mentioned chip can be by silicon materials, germanium material, amount Sub- material or molecular material etc. are made.(such as: more harsh environment) and different application demands can will be upper according to the actual situation Neural network chip is stated to be packaged, so that the major part of neural network chip is wrapped, and will be on neural network chip Pin is connected to the outside of encapsulating structure by conductors such as gold threads, for carrying out circuit connection with more outer layer.

Present disclosure is not construed as limiting the specific structure of neural network chip 111, optionally, please refers to Fig. 1 a or Fig. 1 b institute The device shown.

Present disclosure for first substrate 13 and the second substrate 113 type without limitation, can be printed circuit board (printed circuit board, PCB) or (printed wiring board, PWB), it is also possible to be other circuit boards.It is right The making material of PCB is also without limitation.

The second substrate 113 involved in present disclosure is electrical and non-by second for carrying above-mentioned neural network chip 111 The neural network chip that above-mentioned neural network chip 111 and the second substrate 113 are attached by electrical connection arrangement 112 Encapsulating structure 11, for protecting neural network chip 111, convenient for by neural network chip encapsulating structure 11 and first substrate 13 into Row further encapsulation.

Electrical for above-mentioned specific second and non-electrical attachment device 112 the corresponding knot of packaged type and packaged type Structure is not construed as limiting, and can be selected suitable packaged type with different application demands according to the actual situation and simply be improved, Such as: flip chip ball grid array encapsulates (Flip Chip Ball Grid Array Package, FCBGAP), slim four directions Flat type packaged (Low-profile Quad Flat Package, LQFP), the quad flat package (Quad with radiator Flat Package with Heat sink, HQFP), without pin quad flat package (Quad Flat Non-lead Package, QFN) or the encapsulation side small spacing quad flat formula encapsulation (Fine-pitch Ball Grid Package, FBGA) etc. Formula.

Flip-chip (Flip Chip), suitable for the area requirements after encapsulation are high or biography to the inductance of conducting wire, signal In the case where defeated time-sensitive.In addition to this packaged type that wire bonding (Wire Bonding) can be used, reduces cost, mentions The flexibility of high encapsulating structure.

Ball grid array (Ball Grid Array), is capable of providing more pins, and the average conductor length of pin is short, tool The effect of standby high-speed transfer signal, wherein encapsulation can encapsulate (Pin Grid Array, PGA), zero slotting with Pin-Grid Array Pull out force (Zero Insertion Force, ZIF), single edge contact connection (Single Edge Contact Connection, SECC), contact array (Land Grid Array, LGA) etc. replaces.

Optionally, using the packaged type of flip chip ball grid array (Flip Chip Ball Grid Array) to mind It is packaged through network chip 111 and the second substrate 113, the schematic diagram of specific neural network chip encapsulating structure can refer to figure 6.As shown in fig. 6, above-mentioned neural network chip encapsulating structure includes: neural network chip 21, pad 22, soldered ball 23, the second base Tie point 25, pin 26 on plate 24, the second substrate 24.

Wherein, pad 22 is connected with neural network chip 21, passes through the tie point 25 on pad 22 and the second substrate 24 Between welding form soldered ball 23, neural network chip 21 and the second substrate 24 are connected, that is, realize neural network chip 21 Encapsulation.

Pin 26 is used for the external circuit with encapsulating structure (for example, the first substrate on neural network processor board 10 13) be connected, it can be achieved that external data and internal data transmission, it is corresponding convenient for neural network chip 21 or neural network chip 21 Neural network processor data are handled.Type and quantity present disclosure for pin are also not construed as limiting, according to difference Encapsulation technology different pin forms can be selected, and defer to certain rule and arranged.

Optionally, above-mentioned neural network chip encapsulating structure further includes insulation filler, is placed in pad 22, soldered ball 23 and connects In gap between contact 25, interference is generated between soldered ball and soldered ball for preventing.

Wherein, the material of insulation filler can be silicon nitride, silica or silicon oxynitride；Interference comprising electromagnetic interference, Inductive interferences etc..

Optionally, above-mentioned neural network chip encapsulating structure further includes radiator, for distributing neural network chip 21 Heat when operation.Wherein, radiator can be the good sheet metal of one piece of thermal conductivity, cooling fin or radiator, for example, wind Fan.

For example, as shown in Figure 6 a, neural network chip encapsulating structure 11 include: neural network chip 21, pad 22, Outside soldered ball 23, the second substrate 24, the tie point 25 in the second substrate 24, pin 26, insulation filler 27, thermal grease 28 and metal Shell cooling fin 29.Wherein, thermal grease 28 and metal shell cooling fin 29 are used to distribute heat when neural network chip 21 is run.

Optionally, above-mentioned neural network chip encapsulating structure 11 further includes reinforced structure, is connect with pad 22, and interior is embedded in In soldered ball 23, to enhance the bonding strength between soldered ball 23 and pad 22.

Wherein, reinforced structure can be metal wire structure or column structure, it is not limited here.

Present disclosure is electrical for first and the concrete form of non-electrical device of air 12 is also not construed as limiting, can refer to second it is electrical and Neural network chip encapsulating structure 11 is packaged by the description of non-electrical device of air 112 by welding, can also adopt With connecting line connection or the mode of pluggable mode connection the second substrate 113 and first substrate 13, it is convenient for subsequent replacement first substrate 13 or neural network chip encapsulating structure 11.

Optionally, first substrate 13 includes the interface etc. for the internal storage location of extension storage capacity, such as: synchronous dynamic Random access memory (Synchronous Dynamic Random Access Memory, SDRAM), Double Data Rate synchronous dynamic with Machine memory (Double Date Rate SDRAM, DDR) etc., the processing of neural network processor is improved by exented memory Ability.

It may also include quick external equipment interconnection bus (Peripheral Component on first substrate 13 Interconnect-Express, PCI-E or PCIe) interface, hot-swappable (the Small Form-factor of small package Pluggable, SFP) interface, Ethernet interface, Controller Area Network BUS (Controller Area Network, CAN) connect Mouthful etc., for the data transmission between encapsulating structure and external circuit, the convenience of arithmetic speed and operation can be improved.

Neural network processor is encapsulated as neural network chip 111, neural network chip 111 is encapsulated as neural network Neural network chip encapsulating structure 11 is encapsulated as neural network processor board 10, by board by chip-packaging structure 11 Interface (slot or lock pin) and external circuit (such as: computer motherboard) carry out data interaction, i.e., directly by using nerve Network processing unit board 10 realizes the function of neural network processor, and protects neural network chip 111.And Processing with Neural Network Other modules can be also added on device board 10, improve the application range and operation efficiency of neural network processor.

In one embodiment, the present disclosure discloses an electronic devices comprising above-mentioned neural network processor plate Card 10 or neural network chip encapsulating structure 11.

Electronic device include data processing equipment, robot, computer, printer, scanner, tablet computer, intelligent terminal, Mobile phone, automobile data recorder, navigator, sensor, camera, server, camera, video camera, projector, wrist-watch, earphone, movement Storage, wearable device, the vehicles, household electrical appliance, and/or Medical Devices.

The vehicles include aircraft, steamer and/or vehicle；The household electrical appliance include TV, air-conditioning, micro-wave oven, Refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator；The Medical Devices include Nuclear Magnetic Resonance, B ultrasound instrument And/or electrocardiograph.

Particular embodiments described above has carried out further in detail the purpose of present disclosure, technical scheme and beneficial effects Describe in detail it is bright, it is all it should be understood that be not limited to present disclosure the foregoing is merely the specific embodiment of present disclosure Within the spirit and principle of present disclosure, any modification, equivalent substitution, improvement and etc. done should be included in the guarantor of present disclosure Within the scope of shield.

Claims

1. a kind of integrated circuit chip device, which is characterized in that the integrated circuit chip device include: main process task circuit and Multiple based process circuits；

The based process circuit includes: that data type computing circuit includes；The data type computing circuit, it is floating for executing Conversion between vertex type data and fixed point type data；

The main process task circuit, for executing each continuous operation in neural network computing and at the multiple basis Manage circuit transmission data；

The multiple based process circuit, for whether starting the data type fortune according to the operation control of the transmission data It calculates circuit and conversion is executed to the type of the transmission data；According to the transmission data after the transmission data or conversion with parallel side Formula executes the operation in neural network, and operation result is transferred to the main process task circuit.

2. the apparatus according to claim 1, which is characterized in that

Described device further includes branch circuit, and the branch circuit is arranged in the main process task circuit and multiple based process circuits Between, for the forwarding transmission data between the main process task circuit and the multiple based process circuit.

3. integrated circuit chip device according to claim 1, which is characterized in that

The main process task circuit, for obtaining data block and operational order to be calculated, according to the operational order to it is described to The data block of calculating is divided into distribution data block and broadcast data block；Distribution data block progress deconsolidation process is obtained more The multiple basic data block is distributed to circuit connected to it by a basic data block, by the broadcast data block broadcast to Circuit connected to it；

The based process circuit, for starting the based process circuit to the basic data block and institute according to the operation It states broadcast data block and is converted into fixed-point data type, inner product operation is executed with fixed-point data type and obtains the fortune of fixed-point data type It calculates as a result, starting the data type computing circuit and the operation result of the fixed-point data type is converted into floating type Operation result be sent to main process task circuit；

The main process task circuit, for the operation result to the floating point type handle to obtain the data block to be calculated and The instruction results of operational order.

4. integrated circuit chip device according to claim 3, which is characterized in that

The main process task circuit, specifically for the broadcast data block is electric to the multiple based process by once broadcasting Road.

5. integrated circuit chip device according to claim 3, which is characterized in that

The main process task circuit will be the multiple specifically for the broadcast data block is divided into multiple portions broadcast data block Part broadcast data block is by repeatedly broadcasting to the multiple based process circuit.

6. integrated circuit chip device according to claim 5, which is characterized in that

The based process circuit, specifically for holding the part broadcast data block with fixed point type with the basic data block Inner product processing result is obtained after inner product processing of row, the inner product processing result is added up and obtains partial arithmetic result, by institute It states partial arithmetic result and is converted into floating point type and be sent to the main process task circuit.

7. integrated circuit chip device according to claim 5, which is characterized in that

The based process circuit executes the part specifically for multiplexing n times part broadcast data block with fixed-point data type Broadcast data block and the n basic data block inner product operation obtain n part processing result of fixed-point data type, by fixed-point number N part processing result of floating point type is converted into according to n part processing result of type, at n part of floating point type Reason result obtains n partial arithmetic result after adding up respectively, and the n partial arithmetic result is sent to the main process task electricity Road, the n are the integer more than or equal to 2.

8. integrated circuit chip device according to claim 1, which is characterized in that

The main process task circuit includes: buffer circuit on master register or main leaf；

9. integrated circuit chip device according to claim 8, which is characterized in that

The main process task circuit includes: vector operation device circuit, arithmetic logic unit circuit, accumulator circuit, matrix transposition electricity One of road, direct memory access circuit, data type computing circuit or data rearrangement circuit or any combination.

10. integrated circuit chip device according to claim 1, which is characterized in that

The data include: a kind of in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block or any group It closes.

11. integrated circuit chip device according to claim 3, which is characterized in that

If the operational order is multiplying order, the main process task circuit determines that multiplier data block is broadcast data block, multiplicand Data block is distribution data block；

If the operational order is convolution instruction, the main process task circuit determines that input block is broadcast data block, convolution kernel To distribute data block.

12. a kind of neural network computing device, which is characterized in that the neural network computing device includes one or more as weighed Benefit requires integrated circuit chip device described in 1-11 any one.

13. a kind of combined treatment device, which is characterized in that the combined treatment device includes: mind as claimed in claim 12 Through network operations device, general interconnecting interface and general processing unit；

14. a kind of chip, which is characterized in that device of the integrated chip as described in claim 1-13 any one.

15. a kind of smart machine, which is characterized in that the smart machine includes chip as claimed in claim 14.

16. a kind of operation method of neural network, which is characterized in that the method is applied in integrated circuit chip device, institute Stating integrated circuit chip device includes: the integrated circuit chip device as described in claim 1-11 any one, described integrated Circuit chip device is used to execute the operation of neural network.

17. according to the method for claim 16, which is characterized in that the operation of the neural network includes: convolution algorithm, square Battle array multiplies matrix operation, Matrix Multiplication vector operation, bigoted operation, connects operation entirely, GEMM operation, GEMV operation, activates in operation One kind or any combination.