CN110826712B

CN110826712B - Neural network processor board card and related products

Info

Publication number: CN110826712B
Application number: CN201911333469.6A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2017-12-14
Filing date: 2017-12-14
Publication date: 2024-01-09
Anticipated expiration: 2037-12-14
Also published as: CN109961136A; CN111160542A; CN110826712A; CN110909872B; CN111242294B; CN111160542B; CN110909872A; CN111242294A; CN109961136B

Abstract

The present disclosure provides a neural network processor board and related products, the neural network processor board comprising: the semiconductor device comprises a neural network chip packaging structure, a first electrical and non-electrical connection device and a first substrate; the neural network chip packaging structure comprises: the device comprises a neural network chip, a second electrical and non-electrical connection device and a second substrate, wherein the second substrate bears the neural network chip, and the second substrate is connected with the neural network chip through the second electrical and non-electrical connection device. The technical scheme provided by the disclosure has the advantages of small calculated amount and low power consumption.

Description

Neural network processor board card and related products

Technical Field

The present disclosure relates to the field of neural networks, and more particularly, to a neural network processor board card and related products.

Background

Artificial neural networks (Artificial Neural Network, ANN) are a growing research hotspot in the area of artificial intelligence since the 80 s of the 20 th century. The human brain nerve cell network is abstracted from the information processing perspective, a certain simple model is built, and different networks are formed according to different connection modes. Also commonly referred to in engineering and academia as neural networks or neural-like networks. A neural network is an operational model, which is formed by interconnecting a large number of nodes (or neurons). The operation of the existing neural network is realized based on a CPU (Central Processing Unit ) or a GPU (English: graphics Processing Unit, graphic processor), and the calculation amount of the operation is large and the power consumption is high.

Disclosure of Invention

The embodiment of the disclosure provides an integrated circuit chip device and related products, which can improve the processing speed of a computing device and improve the efficiency.

In a first aspect, there is provided an integrated circuit chip device comprising: a main processing circuit and a plurality of basic processing circuits; the main processing circuit or at least one circuit of the plurality of basic processing circuits comprises: a data type operation circuit; the data type operation circuit is used for executing conversion between floating point type data and fixed point type data;

the plurality of basic processing circuits are distributed in an array; each basic processing circuit is connected with other adjacent basic processing circuits, and the main processing circuit is connected with n basic processing circuits of the 1 st row, n basic processing circuits of the m th row and m basic processing circuits of the 1 st column;

the main processing circuit is used for executing each continuous operation in the neural network operation and transmitting data with the basic processing circuit connected with the main processing circuit;

the basic processing circuits are used for executing operation in the neural network in a parallel mode according to the transmitted data, and transmitting operation results to the main processing circuit through the basic processing circuit connected with the main processing circuit.

In a second aspect, a neural network computing device is provided, the neural network computing device comprising one or more of the integrated circuit chip devices provided in the first aspect.

In a third aspect, there is provided a combination processing apparatus including: the neural network operation device, the universal interconnection interface and the universal processing device provided in the second aspect;

the neural network operation device is connected with the general processing device through the general interconnection interface.

In a fourth aspect, there is provided a chip integrating the apparatus of the first aspect, the apparatus of the second aspect or the apparatus of the third aspect.

In a fifth aspect, an electronic device is provided, the electronic device comprising the chip of the fourth aspect.

In a sixth aspect, there is provided a method of operating a neural network, the method being applied within an integrated circuit chip device, the integrated circuit chip device comprising: the integrated circuit chip device of the first aspect for performing operations of a neural network.

It can be seen that, according to the embodiment of the disclosure, the data conversion operation circuit is provided to perform post-conversion operation on the type of the data block, so that transmission resources and calculation resources are saved, and therefore, the data conversion operation circuit has the advantages of low power consumption and small calculation amount.

Drawings

FIG. 1a is a schematic diagram of an integrated circuit chip device.

FIG. 1b is a schematic diagram of another integrated circuit chip device.

FIG. 1c is a schematic diagram of a basic processing circuit.

FIG. 1d is a schematic diagram of a main processing circuit.

FIG. 1e is a schematic block diagram of a fixed point data type.

FIG. 2a is a schematic diagram of a basic processing circuit.

FIG. 2b is a schematic diagram of a main processing circuit transmitting data.

Fig. 2c is a schematic diagram of matrix multiplication by a vector.

Fig. 2d is a schematic diagram of an integrated circuit chip device.

Fig. 2e is a schematic diagram of another integrated circuit chip device.

Fig. 2f is a schematic diagram of a matrix multiplied by a matrix.

Fig. 3a is a schematic diagram of convolved input data.

Fig. 3b is a schematic diagram of a convolution kernel.

Fig. 3c is a schematic diagram of an operation window of a three-dimensional data block of the input data.

FIG. 3d is a schematic diagram of another operation window of a three-dimensional data block of the input data.

Fig. 3e is a schematic diagram of a further operation window of a three-dimensional data block of the input data.

Fig. 4a is a schematic diagram of a neural network forward operation.

Fig. 4b is a schematic diagram of the neural network reverse operation.

Fig. 4c is a schematic diagram of a combined processing apparatus according to the present disclosure.

Fig. 4d is a schematic diagram of another embodiment of a combination processing apparatus according to the present disclosure.

Fig. 5a is a schematic structural diagram of a board card of a neural network processor according to an embodiment of the disclosure;

FIG. 5b is a schematic structural diagram of a neural network chip package structure according to an embodiment of the present disclosure;

FIG. 5c is a schematic diagram of a neural network chip according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a neural network chip package structure provided by an embodiment of the present disclosure;

fig. 6a is a schematic diagram of another neural network chip package structure provided by an embodiment of the present disclosure.

Detailed Description

In order that those skilled in the art will better understand the present disclosure, a more complete description of the same will be rendered by reference to the appended drawings, wherein it is to be understood that the embodiments are merely some, but not all, of the embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that a person of ordinary skill in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.

In the apparatus provided in the first aspect, the main processing circuit is configured to obtain a data block to be calculated and an operation instruction, convert the data block to be calculated into a fixed-point type data block by using the data type operation circuit, and divide the fixed-point type data block to be calculated into a distribution data block and a broadcast data block according to the operation instruction; splitting the distributed data blocks to obtain a plurality of basic data blocks, distributing the basic data blocks to a basic processing circuit connected with the basic data blocks, and broadcasting the broadcast data blocks to the basic processing circuit connected with the basic processing circuit;

The basic processing circuit is used for performing inner product operation on the basic data block and the broadcast data block in a fixed-point data type to obtain an operation result, and sending the operation result to the main processing circuit;

or forwarding the basic data block and the broadcast data block to other basic processing circuits to execute inner product operation with fixed-point data types to obtain an operation result, and sending the operation result to the main processing circuit;

the main processing circuit is used for converting the operation result into floating point type data through the data type operation circuit and processing the floating point type data to obtain the data block to be calculated and the instruction result of the operation instruction.

In the apparatus provided in the first aspect, the main processing circuit is specifically configured to send the broadcast data block to the base processing circuit connected thereto by one broadcast.

In the apparatus provided in the first aspect, the basic processing circuit is specifically configured to perform inner product processing on the basic data block and the broadcast data block in a fixed-point data type to obtain an inner product processing result, accumulate the inner product processing result to obtain an operation result, and send the operation result to the main processing circuit.

In the apparatus provided in the first aspect, the main processing circuit is configured to, when the operation result is a result of inner product processing, accumulate the operation result to obtain an accumulated result, and arrange the accumulated result to obtain the data block to be calculated and an instruction result of the operation instruction.

In the apparatus provided in the first aspect, the main processing circuit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the base processing circuit through a plurality of times; the plurality of partial broadcast data blocks are combined to form the broadcast data block.

In the apparatus provided in the first aspect, the basic processing circuit is specifically configured to perform inner product processing on the partial broadcast data block and the basic data block in a fixed-point data type to obtain an inner product processing result, accumulate the inner product processing result to obtain a partial operation result, and send the partial operation result to the main processing circuit.

In the apparatus provided in the first aspect, the basic processing circuit is specifically configured to multiplex n times of the partial broadcast data blocks, perform inner product operations of the partial broadcast data blocks and the n basic data blocks to obtain n partial processing results, respectively accumulate the n partial processing results to obtain n partial operation results, and send the n partial operation results to the main processing circuit, where n is an integer greater than or equal to 2.

In the apparatus provided in the first aspect, the main processing circuit includes: a master register or master on-chip cache circuit;

the base processing circuit includes: basic registers or basic on-chip cache circuits.

In the apparatus provided in the first aspect, the main processing circuit includes: a vector operator circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transpose circuit, a direct memory access circuit, a data type operation circuit, or a data rearrangement circuit.

In the apparatus provided in the first aspect, the main processing circuit is configured to obtain a data block to be calculated and an operation instruction, and divide the data block to be calculated into a distribution data block and a broadcast data block according to the operation instruction; splitting the distributed data blocks to obtain a plurality of basic data blocks, distributing the basic data blocks to a basic processing circuit connected with the basic data blocks, and broadcasting the broadcast data blocks to the basic processing circuit connected with the basic processing circuit;

the basic processing circuit is used for converting the basic data block and the broadcast data block into fixed-point type data blocks, performing inner product operation according to the fixed-point type data blocks to obtain operation results, converting the operation results into floating point data and then sending the floating point data to the main processing circuit;

Or converting the basic data block and the broadcast data block into fixed-point type data blocks, forwarding the fixed-point type data blocks to other basic processing circuits to execute inner product operation to obtain operation results, converting the operation results into floating point data, and then sending the floating point data to the main processing circuit;

and the main processing circuit is used for processing the operation result to obtain the data block to be calculated and an instruction result of the operation instruction.

In the apparatus provided in the first aspect, the data is: vector, matrix, three-dimensional data block, four-dimensional data block, and n-dimensional data block.

In the apparatus provided in the first aspect, if the operation instruction is a multiplication instruction, the main processing circuit determines that a multiplier data block is a broadcast data block and a multiplicand data block is a distribution data block;

or if the operation instruction is a convolution instruction, the main processing circuit determines that the convolution input data block is a broadcast data block and the convolution kernel is a distribution data block.

In the method provided in the sixth aspect, the operation of the neural network includes: one or any combination of convolution operation, matrix multiplication matrix operation, matrix multiplication vector operation, paranoid operation, full connection operation, GEMM operation, GEMV operation and activation operation.

Referring to fig. 1a, fig. 1a is an integrated circuit chip device provided in the present disclosure, the integrated circuit chip device comprising: the system comprises a main processing circuit and a plurality of basic processing circuits, wherein the plurality of basic processing circuits are arranged in an array (m is an n array), the range of values of m and n is an integer greater than or equal to 1, and at least one value of m and n is greater than or equal to 2. For a plurality of basic processing circuits distributed in an m×n array, each basic processing circuit is connected with an adjacent basic processing circuit, the main processing circuit is connected with k basic processing circuits of the plurality of basic processing circuits, and the k basic processing circuits may be: n basic processing circuits of the 1 st row, n basic processing circuits of the m th row, and m basic processing circuits of the 1 st column. In the integrated circuit chip device shown in fig. 1a, the main processing circuit and/or the plurality of basic processing circuits may include a data type conversion operation circuit, and a part of the specific plurality of basic processing circuits may include a data type conversion circuit, for example, in an alternative solution, k basic processing circuits may be configured with the data type conversion circuit, so that the n basic processing circuits may be respectively responsible for performing a data type conversion step on data of m basic processing circuits in this column. This arrangement can improve the operation efficiency and reduce the power consumption because, for the n basic processing circuits of row 1, since the n basic processing circuits first receive the data transmitted by the main processing circuit, converting the received data into fixed-point type data can reduce the calculation amount of the subsequent basic processing circuits and the data transmission amount with the subsequent basic processing circuits, and similarly, configuring the data type conversion circuit for the m basic processing circuits of the first column has the advantages of small calculation amount and low power consumption. In addition, according to the structure, the main processing circuit can adopt a dynamic data transmission strategy, for example, the main processing circuit broadcasts data to m basic processing circuits in column 1, and the main processing circuit transmits distributed data to n basic processing circuits in row 1.

The main processing circuit is used for executing each continuous operation in the neural network operation and transmitting data with the basic processing circuit connected with the main processing circuit; the above successive operations are not limited to: accumulation operations, ALU operations, activation operations, and the like.

The basic processing circuits are used for executing operation in the neural network in a parallel mode according to the transmitted data, and transmitting operation results to the main processing circuit through the basic processing circuit connected with the main processing circuit. Performing operations in a neural network in parallel as described above includes, but is not limited to: inner product operations, matrix or vector multiplication operations, and the like.

The main processing circuit may include: the data transmitting circuit, the data receiving circuit or the interface can integrate the data distributing circuit and the data broadcasting circuit, and the data distributing circuit and the data broadcasting circuit can be respectively arranged in practical application. For broadcast data, i.e. data that needs to be sent to each basic processing circuit. For distributing data, that is, data that needs to be selectively sent to a part of the basic processing circuits, specifically, for example, convolution operation, convolution input data of the convolution operation needs to be sent to all basic processing circuits, all of which are broadcast data, and convolution kernel needs to be selectively sent to a part of basic data blocks, so that the convolution kernel is the distributing data. The manner in which the particular selection of distribution data is sent to that underlying processing circuit may be determined in particular by the main processing circuit in dependence upon the load and other distribution patterns. For the broadcast transmission scheme, broadcast data is transmitted in broadcast form to each base processing circuit. (in practical applications, broadcast data is transmitted to each basic processing circuit by means of one broadcast, or broadcast data may be transmitted to each basic processing circuit by means of multiple broadcasts, and the embodiments of the present disclosure are not limited to the number of broadcasts described above).

The main processing circuit (shown in fig. 1 d) may include a register and/or an on-chip buffer circuit, and may further include a control circuit, a vector arithmetic unit circuit, an ALU (arithmetic and logic unit, arithmetic logic unit) circuit, an accumulator circuit, a DMA (Direct Memory Access ) circuit, etc., although in practical applications, the main processing circuit may also include a conversion circuit (for example, a matrix transpose circuit), a data rearrangement circuit, an activation circuit, etc.

Each base processing circuit may include a base register and/or a base on-chip cache circuit; each base processing circuit may further include: an inner product operator circuit, a vector operator circuit, an accumulator circuit, or the like. The inner product arithmetic circuit, the vector arithmetic circuit, and the accumulator circuit may be integrated circuits, or may be individually provided.

Alternatively, the accumulator circuit of the n basic processing circuits in the mth row may perform the accumulation operation of the inner product operation, because for the m-th row of basic processing circuits, it can receive the product result of all the basic processing circuits in the column, and the accumulation operation of the inner product operation is performed by the n basic processing circuits in the mth row, so that the computing resources can be effectively allocated, and the advantage of saving power consumption is achieved. This solution is particularly applicable when the m number is large.

For circuits in which the data type conversion may be allocated for execution by the main processing circuit, in particular, circuits in which the execution may be allocated by a display or implicit manner, for display, the main processing circuit may be configured with a special instruction or instruction, and when the base processing circuit receives the special instruction or instruction, it is determined that the data type conversion is to be executed, and when the base processing circuit does not receive the special instruction or instruction, it is determined that the data type conversion is not to be executed. As another example, this may be performed implicitly, e.g., when the underlying processing circuitry receives data of which the data type is a floating point type and determines that an inner product operation needs to be performed, the data type is converted to fixed point type data. For the way the configuration is displayed, a special instruction or instruction may configure a decrementing sequence, the value of which is decremented by 1 per pass through a basic processing circuit, the basic processing circuit reads the value of the decrementing sequence, if the value is greater than zero, then performs the data type conversion, if the value is equal to or less than zero, then does not perform the data type conversion. The setting is configured according to the basic processing circuits allocated by the array, for example, for m basic processing circuits in the ith column, the main processing circuit needs the first 5 basic processing circuits to execute data type conversion, and then issues a special instruction, where the special instruction includes a decrementing sequence, an initial value of the decrementing sequence may be 5, and each time a basic processing circuit passes, the value of the decrementing sequence is decremented by 1, when the value of the decrementing sequence is 1 to the 5 th basic processing circuit, and when the decrementing sequence is 0 to the 6 th basic processing circuit, the 6 th basic processing circuit will not execute the data type conversion.

One embodiment of the present disclosure provides an integrated circuit chip device including a main processing circuit (also referred to as a main unit) and a plurality of base processing circuits (also referred to as base units); the structure of the embodiment is shown in FIG. 1 b; wherein the broken line frame is the internal structure of the neural network operation device; gray filled arrows represent data transmission paths between the main processing circuit and the basic processing circuit array, and open arrows represent data transmission paths between individual basic processing circuits (adjacent basic processing circuits) in the basic processing circuit array. The length, width and length of the basic processing circuit array may be different, that is, the values of m and n may be different, or may be the same, and the disclosure is not limited to the specific values of the values.

The circuit structure of the basic processing circuit is shown in fig. 1 c; in the figure, a dashed box represents the boundary of the basic processing circuit, and a thick arrow intersecting the dashed box represents a data input-output channel (pointing to the input channel within the dashed box, pointing to the output channel as the dashed box); the rectangular boxes in the dashed boxes represent memory cell circuits (registers and/or on-chip caches) including input data 1, input data 2, multiplication or inner product results, accumulated data; the diamond-shaped boxes represent an operator circuit including a multiply or inner product operator, an adder.

In this embodiment, the neural network computing device includes a main processing circuit and 16 basic processing circuits (the 16 basic processing circuits are merely for illustration, and other values may be adopted in practical applications);

in this embodiment, the basic processing circuit has two data input interfaces and two data output interfaces; in the following description of the present example, the horizontal input interface (the horizontal arrow in fig. 1b to the present unit) is referred to as input 0, and the vertical input interface (the vertical arrow in fig. 1b to the present unit) is referred to as input 1; each of the horizontal data output interfaces (horizontal arrow indicated from the unit in fig. 1 b) is referred to as output 0, and the vertical data output interfaces (vertical arrow indicated from the unit in fig. 1 b) is referred to as output 1.

The data input interface and the data output interface of each basic processing circuit can be respectively connected with different units, and the data input interface and the data output interface of each basic processing circuit comprise a main processing circuit and other basic processing circuits;

in this example, the inputs 0 of the four basic processing circuits 0,4,8,12 (numbered see fig. 1 b) are connected to the data output interface of the main processing circuit; in this example, the input 1 of the four basic processing circuits 0,1,2,3 is connected with the data output interface of the main processing circuit; in this example, the outputs 1 of the four basic processing circuits 12,13,14,15 are connected to the data input interface of the main processing circuit; in this example, the output interfaces of the basic processing circuit are connected with the input interfaces of other basic processing circuits, as shown in fig. 1b, which is not listed;

Specifically, the output interface S1 of the S unit is connected to the input interface P1 of the P unit, indicating that the P unit will be able to receive data from its P1 interface sent by the S unit to its S1 interface.

The embodiment comprises a main processing circuit, wherein the main processing circuit is connected with an external device (namely an input interface and an output interface), and a part of data output interfaces of the main processing circuit are connected with a part of data input interfaces of a basic processing circuit; a portion of the data input interfaces of the main processing circuit is coupled to a portion of the data output interfaces of the base processing circuit.

Method for using integrated circuit chip device

The data referred to in the usage method provided by the present disclosure may be any data type, for example, data represented by floating point numbers with any bit width or data represented by fixed point numbers with any bit width.

A schematic diagram of the fixed point type data is shown in fig. 1e, and as shown in fig. 1e, the structure diagram is a method for expressing fixed point type data, for a computing system, the storage Bit number of 1 floating point data is 32 bits, and for expressing fixed point data, especially for expressing data by adopting floating point type data shown in fig. 1e, the storage Bit number of 1 fixed point data can be less than 16 bits, so for the conversion, the transmission cost between calculators can be greatly reduced, in addition, for the calculator, the space for storing data with fewer bits is smaller, namely the storage cost is also reduced, namely the calculation cost is reduced, so that the calculation cost and the storage cost can be reduced, but for the conversion of the data type, the conversion cost is also required to be partially high, for the data with large calculation amount and the storage quantity, the conversion cost is almost negligible compared with the subsequent calculation cost, the storage cost and the transmission cost, for the data with large calculation quantity, the conversion cost is greatly reduced compared with the storage cost of the floating point type data, the calculation cost is reduced, the calculation cost is slightly higher than the calculation cost of the floating point type data with the calculation cost, the calculation cost is required to be slightly higher, and the calculation cost is slightly higher than the calculation cost is required to be higher than the floating point type data, the calculation cost is required to be slightly higher, the calculation cost is required to be higher than the floating point type data, and the calculation cost is slightly higher than the calculation cost is required to be compared with the calculation cost.

The operations that need to be completed in the basic processing circuitry can be performed using the following methods:

the main processing circuit converts the data type and then transmits the converted data type to the basic processing circuit for operation (for example, the main processing circuit can convert the floating point number into a fixed point number with lower bit width and then transmit the fixed point number to the basic processing circuit, the advantages are that the bit width of the transmitted data can be reduced, the total bit quantity of the transmitted data is reduced, the efficiency of the basic processing circuit for executing the fixed point operation with higher bit width and lower power consumption)

The basic processing circuit can convert the data type after receiving the data and then calculate (for example, the basic processing circuit receives the floating point number transmitted by the main processing circuit and then converts the floating point number into fixed point number to operate, thereby improving the operation efficiency and reducing the power consumption).

The basic processing circuit can perform data type conversion first and then transmit the result to the main processing circuit after calculating the result (for example, the floating point number operation result calculated by the basic processing circuit can be converted into a fixed point number with low bit width first and then transmitted to the main processing circuit, which has the advantages of reducing the data bit width in the transmission process, having higher efficiency and saving power consumption).

The method of use of the basic processing circuit (as in fig. 2 a);

The main processing circuit receives input data to be calculated from the outside of the device;

optionally, the main processing circuit performs arithmetic processing on the data using various arithmetic circuits, vector arithmetic circuits, inner product arithmetic circuits, accumulator circuits, and the like of the present unit;

the main processing circuit sends data (as shown in fig. 2 b) to the basic processing circuit array (the set of all basic processing circuits is referred to as the basic processing circuit array) through the data output interface;

the data transmission method may be a method of directly transmitting data to a part of the basic processing circuit, i.e. a multi-broadcast method;

the data transmission mode can respectively transmit different data, namely a distribution mode, to different basic processing circuits;

the basic processing circuit array calculates data;

the basic processing circuit receives input data and then carries out operation;

optionally, the basic processing circuit transmits the data from the data output interface of the unit after receiving the data; (to other base processing circuits that do not receive data directly from the main processing circuit.)

Optionally, the basic processing circuit transmits the operation result from the data output interface; (intermediate calculation results or final calculation results)

The main processing circuit receives output data returned from the basic processing circuit array;

optionally, the main processing circuitry continues processing (e.g., accumulation or activation operations) on data received from the underlying processing circuitry array;

and after the processing of the main processing circuit is finished, transmitting the processing result from the data output interface to the outside of the device.

Completing matrix multiplication vector operation by using the circuit device;

(a matrix-by-vector may be one in which each row in the matrix is respectively inner-product with the vector and the results are placed into a vector in the order of the corresponding row.)

The operation of calculating the multiplication of the matrix S of size M rows and L columns and the vector P of length L is described below, as shown in fig. 2 c.

This method applies to all or a portion of the underlying processing circuitry of the neural network computing device, assuming K underlying processing circuitry are applied;

the main processing circuit transmits data in part or all of the rows of the matrix S to each of the k basic processing circuits;

in an alternative, the control circuit of the main processing circuit sends one number or a part of the number of data of a certain row in the matrix S to a certain basic processing circuit at a time; (e.g., for one number per transmission, it may be for a certain basic processing circuit, 1 st transmission 3 rd line 1 st number, 2 nd transmission 3 rd line 2 nd number, 3 rd transmission 3 rd line 3 rd number … …, or for a part of the number per transmission, 1 st transmission 3 rd line first two numbers (i.e., 1 st, 2 nd numbers), 3 rd transmission 3 rd and 4 th numbers, 3 rd transmission 3 rd and 5 th and 6 th numbers … …; third transmission 3 rd line 3 th and … ….)

In an alternative scheme, the control circuit of the main processing circuit transmits a part of data of a plurality of rows in the matrix S to a certain basic processing circuit each time; (e.g., for a certain basic processing circuit, 1 st number of 3,4,5 th lines per line, 2 nd number of 3,4,5 th lines per line, 3 rd number of 3 th lines per line … … of 3 rd lines, 4,5 th lines per line, or 1 st two first numbers of 3,4,5 th lines per line, 3 rd and 4 th numbers of 3,4,5 th lines per line, third number of 3,4,5 th lines per line, 5 th and 6 th numbers … ….)

The control circuit of the main processing circuit sequentially transmits the data in the vector P to the 0 th basic processing circuit;

after the 0 th basic processing circuit receives the data of the vector P, the data is sent to the next basic processing circuit connected with the 0 th basic processing circuit, namely the basic processing circuit 1;

specifically, some basic processing circuits cannot directly obtain all data required for calculation from the main processing circuit, for example, the basic processing circuit 1 in fig. 2d has only one data input interface connected with the main processing circuit, so that only data of the matrix S can be directly obtained from the main processing circuit, while data of the vector P needs to be output to the basic processing circuit 1 by means of the basic processing circuit 0, and similarly, the basic processing circuit 1 also needs to continue to output data of the vector P to the basic processing circuit 2 after receiving the data.

Each base processing circuit performs operations on the received data including, but not limited to: inner product operations, multiplication operations, addition operations, and the like;

in one alternative, the base processing circuit calculates one or more sets of two multiplications at a time, and then accumulates the results on registers and or on-chip caches;

in one alternative, the base processing circuit calculates the inner products of one or more sets of two vectors at a time, and then accumulates the results onto a register and or on-chip cache;

after the basic processing circuit calculates the result, the result is transmitted out from the data output interface (namely, transmitted to other basic processing circuits connected with the basic processing circuit);

in one alternative, the calculation result may be the final result or the intermediate result of the inner product operation;

after receiving the calculation results from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or a main processing circuit connected with the basic processing circuit;

the main processing circuit receives the result of the inner product operation of each basic processing circuit, and processes the result to obtain a final result (the processing can be an accumulation operation or an activation operation, etc.).

The embodiment of the matrix multiplication vector method is realized by adopting the computing device:

In an alternative, the plurality of basic processing circuits used in the method are arranged as shown in FIG. 2d or FIG. 2e below;

as shown in fig. 2c, the data conversion operation circuit of the main processing circuit converts the matrix S and the matrix P into fixed-point type data; the control circuit of the main processing unit divides M rows of data of the matrix S into K groups, and the ith basic processing circuit is responsible for the operation of the ith group (the collection of rows in the group of data is marked as Ai) respectively;

the method for grouping M data is a grouping mode which is arbitrarily not repeatedly allocated;

in one alternative, the following allocation is used: dividing the j line to a j% K (% is a remainder operation) basic processing circuit;

in an alternative, a part of the rows may be allocated first on average for the case where the grouping cannot be averaged, and the remaining rows may be allocated in any manner.

The control circuit of the main processing circuit sequentially sends the data in part or all rows in the matrix S to the corresponding basic processing circuit each time;

in one alternative, the control circuit of the main processing circuit transmits one or more data of a row of data in the i-th group of data Mi for which it is responsible to the i-th basic processing circuit at a time;

In one alternative, the control circuit of the main processing circuit sends one or more data of each of part or all of the rows of the ith group of data Mi for which it is responsible to the ith base processing circuit each time;

the control circuit of the main processing circuit sequentially transmits the data in the vector P to the 1 st basic processing circuit;

in one alternative, the control circuitry of the main processing circuitry may send one or more data in vector P at a time;

the ith basic processing circuit receives the data of the vector P and then sends the data to the (i+1) th basic processing circuit connected with the ith basic processing circuit;

each basic processing circuit performs an operation (including but not limited to multiplication or addition) after receiving one or more data from a row or rows in the matrix S and one or more data from the vector P;

in an alternative, the data received by the base processing circuit may also be intermediate results, stored in registers and or on-chip caches;

The basic processing circuit transmits the local calculation result to the next basic processing circuit or the main processing circuit connected with the basic processing circuit;

in an alternative scheme, corresponding to the structure of fig. 2d, only the output interface of the last basic processing circuit of each column is connected with the main processing circuit, in this case, only the last basic processing circuit can directly transmit the local calculation result to the main processing circuit, the calculation result of other basic processing circuits is transmitted to the next basic processing circuit of the main processing circuit, the next basic processing circuit is transmitted to the next basic processing circuit until all the calculation result is transmitted to the last basic processing circuit, the last basic processing circuit performs accumulation calculation on the local calculation result and the received result of other basic processing circuits of the column to obtain an intermediate result, and the intermediate result is transmitted to the main processing circuit; it is of course also possible for the last basic processing circuit to send the results of the other basic circuits of the column as well as the local processing results directly to the main processing circuit.

In an alternative, corresponding to the architecture of fig. 2e, each basic processing circuit has an output interface connected to the main processing circuit, in which case each basic processing circuit directly transmits the local calculation result to the main processing circuit;

The basic processing circuit receives the calculation results transmitted by other basic processing circuits and transmits the calculation results to the next basic processing circuit or the main processing circuit connected with the basic processing circuit.

The main processing circuit receives the results of the M inner product operations as the operation results of the matrix multiplication vector.

Completing matrix multiplication matrix operation by using the circuit device;

the operation of calculating the multiplication of a matrix S of size M rows and L columns and a matrix P of size L rows and N columns (each row in the matrix S being the same length as each column of the matrix P, as shown in FIG. 2 f) is described below

The method is illustrated using the embodiment of the device shown in fig. 1 b;

the data conversion operation circuit of the main processing circuit converts the matrix S and the matrix P into fixed-point type data;

the control circuit of the main processing circuit sends the data in part or all of the rows of the matrix S to those basic processing circuits (e.g. the uppermost gray-filled vertical data path in fig. 1 b) which are directly connected to the main processing circuit via a lateral data input interface;

in an alternative, the control circuit of the main processing circuit sends one number or a part of the number of data of a certain row in the matrix S to a certain basic processing circuit at a time; (for example, for a certain basic processing circuit, 1 st transmission of 1 st number of 3 rd lines, 2 nd transmission of 2 nd number of 3 rd lines of 3 rd … … of 3 rd lines of 3 rd transmission, or 1 st transmission of two first numbers of 3 rd lines, 3 rd and 4 th numbers of 3 rd lines of 3, and 3 rd transmission of 5 th and 6 th numbers of … …;)

In an alternative, the control circuit of the main processing circuit sends a part of the data of a certain number of rows in the matrix S to a certain basic processing circuit each time; (e.g., for a certain basic processing circuit, 1 st number of 3,4,5 th lines per line, 2 nd number of 3,4,5 th lines per line, 3 rd number of 3,4,5 th lines per line … … th 3 th line, or 1 st two first numbers of 3,4,5 th lines per line, 3 rd and 4 th numbers of 3,4,5 th lines per line, 3 rd, 4 th, 5 th lines per line 5 th and 6 th numbers … … th; third line of 3,4,5 th lines per line.)

The control circuit of the main processing circuit sends the data in some or all columns in the matrix P to those basic processing circuits directly connected to the main processing circuit through the vertical data input interface (e.g. grey filled lateral data paths on the left side of the basic processing circuit array in fig. 1 b);

in an alternative, the control circuit of the main processing circuit sends one number or a part of the number of data of a certain column in the matrix P to a certain basic processing circuit at a time; (for example, for a certain basic processing circuit, 1 st transmission of 1 st number of 3 rd columns, 2 nd transmission of 2 nd numbers of 3 rd columns, 3 rd transmission of 3 rd numbers … … of 3 rd columns, or 1 st transmission of two numbers before 3 rd columns, 3 rd and 4 th numbers of 3 rd columns, and 3 rd transmission of 5 th and 6 th numbers … …; third transmission of 3 rd columns.)

In an alternative, the control circuit of the main processing circuit sends a part of the data of a certain column in the matrix P to a certain basic processing circuit each time; (e.g., for a certain basic processing circuit, 1 st number of 3,4,5 th column, 2 nd number of 3,4,5 th column, 3 rd number of … … of 3 rd column, 4,5 th column, or 1 st first two numbers of 3,4,5 th column, 3 rd and 4 th numbers of 3,4,5 th column, 3 th, 4,5 th column, 5 th and 6 th numbers … …; etc.)

After the basic processing circuit receives the data of the matrix S, it transmits the data to its connected next basic processing circuit (e.g. the white filled lateral data path in the middle of the basic processing circuit array in fig. 1 b) through its lateral data output interface; after the base processing circuit receives the data of the matrix P, the data is transmitted to the next base processing circuit connected with the base processing circuit through the vertical data output interface (for example, a white filled vertical data path in the middle of the base processing circuit array in fig. 1 b);

each basic processing circuit operates on the received data;

after the basic processing circuit calculates the result, the result can be transmitted out from the data output interface;

specifically, if the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface, and if not, the result is output in a direction toward the basic processing circuit that can directly output to the main processing circuit (for example, in fig. 1b, the bottom-most line of basic processing circuits directly output the result to the main processing circuit, and the other basic processing circuits transmit the operation result downward from the vertical output interface).

Outputting the result in a direction capable of being directly output to the main processing circuit (for example, in fig. 1b, the bottom row of basic processing circuits directly output the result to the main processing circuit, and the other basic processing circuits transmit the operation result downwards from the vertical output interface);

the main processing circuit receives the result of the inner product operation of each basic processing circuit, and an output result can be obtained.

Embodiments of the "matrix multiplication matrix" method:

the method uses a basic processing circuit array arranged in a manner shown in fig. 1b, and supposing that there are h rows and w columns;

the control circuit of the main processing circuit divides the h row data of the matrix S into h groups, and the ith basic processing circuit is responsible for the operation of the ith group (the collection of rows in the group of data is recorded as Hi) respectively;

the method for grouping the h data is a grouping mode which is arbitrarily not repeatedly allocated;

in one alternative, the following allocation is used: the control circuit of the main processing circuit distributes the j-th row to the j-th h base processing circuit;

The control circuit of the main processing circuit divides W column data of the matrix P into W groups, and the ith basic processing circuit is responsible for the operation of the ith group (the collection of rows in the group of data is marked as Wi) respectively;

the method for grouping the W column data is any grouping mode which can not be repeatedly allocated;

in one alternative, the following allocation is used: the control circuit of the main processing circuit distributes the j-th row to the j-th w basic processing circuit;

in an alternative, a part of columns may be allocated equally first for the case where grouping cannot be averaged, and allocated in an arbitrary manner for the remaining columns.

The control circuit of the main processing circuit transmits the data in part or all of the rows of the matrix S to the first basic processing circuit of each row in the basic processing circuit array;

in one alternative, the control circuit of the main processing circuit transmits one or more data of one row of data Hi of the i-th group of data in charge of it to the first basic processing circuit of the i-th row in the basic processing circuit array at a time;

in one alternative, the control circuit of the main processing circuit sends one or more data of each of part or all of the rows of the i-th group of data Hi for which it is responsible to the first basic processing circuit of the i-th row in the basic processing circuit array each time;

The control circuit of the main processing circuit transmits the data in part or all of the columns of the matrix P to the first basic processing circuit of each column in the basic processing circuit array;

in one alternative, the control circuit of the main processing circuit transmits one or more data of a column of data in the i-th group of data Wi for which it is responsible to the first basic processing circuit of the i-th column in the basic processing circuit array at a time;

in one alternative, the control circuit of the main processing circuit transmits one or more data of each column of part or all of the i-th set of data Ni for which it is responsible to the first basic processing circuit of the i-th column in the basic processing circuit array at a time;

Each basic processing circuit operates on the received data;

specifically, if the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface, and if not, the result is output in a direction toward the basic processing circuit that can directly output to the main processing circuit (for example, the lowermost line of basic processing circuits directly outputs the output result thereof to the main processing circuit, and the other basic processing circuits transmit the operation result downward from the vertical output interface).

Outputting the result in a direction capable of being directly output to the main processing circuit (for example, the bottom row of basic processing circuits directly output the result to the main processing circuit, and other basic processing circuits transmit the operation result downwards from the vertical output interface);

The words "horizontal", "vertical" and the like used in the above description are merely for describing the example shown in fig. 1b, and only the "horizontal" and "vertical" interfaces for distinguishing each unit are required to represent two different interfaces in actual use.

Using the circuit device to complete the full connection operation:

if the input data of the full connection layer is a vector (namely, if the input of the neural network is a single sample), taking a weight matrix of the full connection layer as a matrix S, taking the input vector as a vector P, and executing operation according to a matrix multiplication vector method of the device;

if the input data of the full connection layer is a matrix (i.e. the input of the neural network is a plurality of samples), taking the weight matrix of the full connection layer as a matrix S, the input vector as a matrix P, or taking the weight matrix of the full connection layer as a matrix P, the input vector as a matrix S, and performing operation according to the matrix multiplication matrix of the device;

Using the circuit arrangement to perform a convolution operation:

the convolution operation is described below, where a block in the following diagram represents a data, the input data is represented by fig. 3a (N samples, each sample having C channels, each channel having a feature map with height H and width W), and the weights, i.e., convolution kernels, are represented by fig. 3b (M convolution kernels, each convolution kernel having C channels, height and width KH and KW, respectively). The rule of convolution operation is the same for N samples of input data, the following explains the process of performing convolution operation on one sample, each of M convolution kernels performs the same operation on one sample, each convolution kernel operation obtains a plane feature map, M convolution kernels finally calculate to obtain M plane feature maps (for one sample, the output of convolution is M feature maps), for one convolution kernel, inner product operation is performed at each plane position of one sample, and then sliding is performed along H and W directions, for example, fig. 3c shows a corresponding map of one convolution kernel performing inner product operation at the position of the lower right corner in one sample of input data; fig. 3d shows the convolved position sliding one bin to the left and fig. 3e shows the convolved position sliding one bin up.

The method is illustrated using the embodiment of the device shown in fig. 1 b;

the data conversion operation circuit of the main processing circuit may convert the data in part or all of the convolution kernels of the weights into fixed-point type data, and the control circuit of the main processing circuit transmits the data in part or all of the convolution kernels of the weights to those basic processing circuits (for example, the uppermost gray-filled vertical data path in fig. 1 b) directly connected to the main processing circuit through the lateral data input interface;

in an alternative, the control circuit of the main processing circuit transmits one number or a part of the data of a certain convolution kernel in the weight to a certain basic processing circuit at a time; (for example, for a certain basic processing circuit, 1 st transmission of 1 st number of 3 rd lines, 2 nd transmission of 2 nd number of 3 rd lines of 3 rd … … of 3 rd lines of 3 rd transmission, or 1 st transmission of two first numbers of 3 rd lines, 3 rd and 4 th numbers of 3 rd lines of 3, and 3 rd transmission of 5 th and 6 th numbers of … …;)

In another alternative, the control circuit of the main processing circuit sends a part of the data of a certain convolution kernels in the weight to a certain basic processing circuit each time; (e.g., for a certain basic processing circuit, 1 st number of 3,4,5 th lines per line, 2 nd number of 3,4,5 th lines per line, 3 rd number of 3,4,5 th lines per line … … th 3 th line, or 1 st two first numbers of 3,4,5 th lines per line, 3 rd and 4 th numbers of 3,4,5 th lines per line, 3 rd, 4 th, 5 th lines per line 5 th and 6 th numbers … … th; third line of 3,4,5 th lines per line.)

The control circuitry of the main processing circuitry divides the input data by convolved locations, and the control circuitry of the main processing circuitry sends data in some or all of the convolved locations in the input data to those underlying processing circuitry directly connected to the main processing circuitry through the vertical data input interface (e.g., the gray-filled lateral data path to the left of the underlying processing circuitry array in fig. 1 b);

in one alternative, the control circuit of the main processing circuit transmits one number or a part of the data of a certain convolution position in the input data to a certain basic processing circuit at a time; (for example, for a certain basic processing circuit, 1 st transmission of 1 st number of 3 rd columns, 2 nd transmission of 2 nd numbers of 3 rd columns, 3 rd transmission of 3 rd numbers … … of 3 rd columns, or 1 st transmission of two numbers before 3 rd columns, 3 rd and 4 th numbers of 3 rd columns, and 3 rd transmission of 5 th and 6 th numbers … …; third transmission of 3 rd columns.)

In another alternative, the control circuit of the main processing circuit sends data of certain convolution positions in the input data to a certain basic processing circuit one number or a part of the numbers at a time; (e.g., for a certain basic processing circuit, 1 st number of 3,4,5 th column, 2 nd number of 3,4,5 th column, 3 rd number of … … of 3 rd column, 4,5 th column, or 1 st first two numbers of 3,4,5 th column, 3 rd and 4 th numbers of 3,4,5 th column, 3 th, 4,5 th column, 5 th and 6 th numbers … …; etc.)

After the base processing circuit receives the data of the weight, the data is transmitted to the next base processing circuit (for example, a white filled lateral data path in the middle of the base processing circuit array in fig. 1 b) connected with the base processing circuit through a lateral data output interface of the base processing circuit; after receiving the data of the input data, the basic processing circuit transmits the data to the next basic processing circuit connected with the basic processing circuit through a vertical data output interface (for example, a white filled vertical data path in the middle of the basic processing circuit array in fig. 1 b);

each basic processing circuit operates on the received data;

the backward operation of the convolution layer can be divided into two parts, as shown in fig. 4a, the solid arrow represents the forward calculation process of the convolution layer, and as shown in fig. 4b, the backward calculation process of the convolution layer.

The inverse operation of the convolution layers shown in fig. 4a and 4b may be performed using the apparatus shown in fig. 1a using the apparatus shown in fig. 1 b. In performing either the forward operation or the backward operation, the operations are actually a plurality of operations of the neural network, including but not limited to: the manner of one or any combination of matrix-by-matrix, matrix-by-vector, convolution operation, activation operation, etc., may be described in this disclosure and will not be repeated here.

Embodiments of the present disclosure provide a neural network processor board card that may be used in a number of general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, smart homes, appliances, multiprocessor systems, microprocessor-based systems, robots, programmable consumer electronics, network personal computers (personal computer, PCs), minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

Referring to fig. 5a, fig. 5a is a schematic structural diagram of a board card of a neural network processor according to an embodiment of the disclosure. As shown in fig. 5a, the neural network processor board 10 includes a neural network chip package structure 11, a first electrical and non-electrical connection device 12, and a first substrate 13.

The present disclosure is not limited to the specific structure of the neural network chip package structure 11, and optionally, as shown in fig. 5b, the neural network chip package structure 11 includes: a neural network chip 111, a second electrical and non-electrical connection device 112, and a second substrate 113.

The particular form of the neural network chip 111 to which the present disclosure relates is not limited, and the neural network chip 111 includes, but is not limited to, a neural network chip integrating a neural network processor, and the chip may be made of a silicon material, a germanium material, a quantum material, a molecular material, or the like. According to practical situations (such as a harsher environment) and different application requirements, the neural network chip can be packaged, so that most of the neural network chip is wrapped, and pins on the neural network chip are connected to the outside of the packaging structure through conductors such as gold wires for circuit connection with the outer layer.

The specific structure of the neural network chip 111 is not limited in this disclosure, and optionally, reference is made to the apparatus shown in fig. 1a or 1 b.

The present disclosure is not limited to the type of first substrate 13 and second substrate 113, and may be a printed circuit board (printed circuit board, PCB) or (printed wiring board, PWB), as well as other circuit boards. The material for manufacturing the PCB is not limited.

The second substrate 113 according to the present disclosure is configured to carry the neural network chip 111, and the neural network chip package structure 11 obtained by connecting the neural network chip 111 and the second substrate 113 through the second electrical and non-electrical connection device 112 is configured to protect the neural network chip 111, so as to facilitate further packaging the neural network chip package structure 11 and the first substrate 13.

The specific packaging manner and the corresponding structure of the second electrical and non-electrical connection device 112 are not limited, and suitable packaging manners may be selected and simply modified according to practical situations and different application requirements, for example: flip chip ball grid array packages (Flip Chip Ball Grid Array Package, FCBGAP), thin Quad Flat packages (Low-profile Quad Flat Package, LQFP), quad Flat packages with heat spreader (Quad Flat Package with Heat sink, HQFP), quad Flat packages without leads (QFN), or Fine-pitch Quad Flat packages (Fine-pitch Ball Grid Package, FBGA).

Flip Chip (Flip Chip) is suitable for use in situations where the area requirements after packaging are high or where the inductance of the wires and the transmission time of the signals are sensitive. Besides, the packaging mode of Wire Bonding (Wire Bonding) can be used, so that the cost is reduced, and the flexibility of the packaging structure is improved.

The Ball Grid Array (Ball Grid Array) can provide more pins, has short average wire length of the pins and has the function of transmitting signals at high speed, wherein the package can be replaced by Pin Grid Array Package (PGA), zero plugging force (Zero Insertion Force, ZIF), single-side contact connection (Single Edge Contact Connection, SECC), contact Array (LGA) and the like.

Optionally, the neural network chip 111 and the second substrate 113 are packaged by using a flip chip ball grid array (Flip Chip Ball Grid Array), and a schematic diagram of a specific neural network chip package structure may refer to fig. 6. As shown in fig. 6, the neural network chip package structure includes: the neural network chip 21, the bonding pads 22, the solder balls 23, the second substrate 24, the connection points 25 on the second substrate 24, and the pins 26.

The bonding pad 22 is connected with the neural network chip 21, and the bonding pad 22 and the second substrate 24 are connected by soldering to form a solder ball 23 between the bonding pad 22 and a connection point 25 on the second substrate 24, so that the neural network chip 21 is packaged.

The pins 26 are used for being connected with an external circuit (for example, the first substrate 13 on the board card 10 of the neural network processor) of the packaging structure, so that transmission of external data and internal data can be realized, and the neural network chip 21 or a neural network processor corresponding to the neural network chip 21 can process the data conveniently. The present disclosure is not limited by the type and number of pins, and different pin formats may be selected and arranged according to certain rules according to different packaging techniques.

Optionally, the neural network chip package structure further includes an insulating filler disposed in the gaps among the pads 22, the solder balls 23 and the connection points 25, for preventing interference between the solder balls.

Wherein, the material of the insulating filler can be silicon nitride, silicon oxide or silicon oxynitride; the interference includes electromagnetic interference, inductive interference, and the like.

Optionally, the above-mentioned neural network chip packaging structure further includes a heat dissipating device for dissipating heat generated during operation of the neural network chip 21. The heat sink may be a sheet of metal, a heat sink or a heat sink with good thermal conductivity, such as a fan.

For example, as shown in fig. 6a, the neural network chip package structure 11 includes: the neural network chip 21, the bonding pads 22, the solder balls 23, the second substrate 24, the connection points 25 on the second substrate 24, the pins 26, the insulating filler 27, the heat dissipating paste 28, and the metal-case heat sink 29. The heat dissipating paste 28 and the metal-case heat sink 29 are used to dissipate heat generated when the neural network chip 21 is operating.

Optionally, the neural network chip package structure 11 further includes a reinforcement structure connected to the pad 22 and embedded in the solder ball 23 to enhance the connection strength between the solder ball 23 and the pad 22.

The reinforcement structure may be a metal wire structure or a columnar structure, which is not limited herein.

The specific form of the first electrical and non-electrical device 12 is not limited in this disclosure, and reference may be made to the description of the second electrical and non-electrical device 112, that is, the neural network chip package structure 11 may be packaged by soldering, or the second substrate 113 may be connected to the first substrate 13 by connecting wires or plugging, so that the first substrate 13 or the neural network chip package structure 11 may be replaced later.

Optionally, the first substrate 13 includes an interface or the like for expanding a memory unit of a storage capacity, for example: synchronous dynamic random access memory (Synchronous Dynamic Random Access Memory, SDRAM), double Rate synchronous dynamic random access memory (DDR), etc., the processing capacity of the neural network processor is improved by expanding the memory.

The first substrate 13 may further include a fast external device interconnect bus (Peripheral Component Interconnect-Express, PCI-E or PCIe) interface, a Small Form-factor Pluggable (SFP) interface, an ethernet interface, a controller area network bus (Controller Area Network, CAN) interface, etc. for data transmission between the package structure and the external circuit, so as to improve the operation speed and the convenience of operation.

The neural network processor is packaged into a neural network chip 111, the neural network chip 111 is packaged into a neural network chip packaging structure 11, the neural network chip packaging structure 11 is packaged into a neural network processor board card 10, and data interaction is performed with an external circuit (such as a computer main board) through an interface (a slot or a plug-in core) on the board card, namely, the function of the neural network processor is directly realized by using the neural network processor board card 10, and the neural network chip 111 is protected. And other modules can be added on the neural network processor board card 10, so that the application range and the operation efficiency of the neural network processor are improved.

In one embodiment, the present disclosure discloses an electronic device that includes the neural network processor board 10 or the neural network chip package structure 11 described above.

The electronic device includes a data processing device, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.

While the foregoing is directed to embodiments of the present disclosure, other and further details of the invention may be had by the present disclosure, it is to be understood that the foregoing description is merely illustrative of the present disclosure and that no changes, substitutions, alterations, etc. that may be made without departing from the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A neural network processor board card, the neural network processor board card comprising: the semiconductor device comprises a neural network chip packaging structure, a first electrical and non-electrical connection device and a first substrate; the neural network chip packaging structure comprises: the device comprises a neural network chip, a second electrical and non-electrical connection device and a second substrate, wherein the second substrate bears the neural network chip and is connected with the neural network chip through the second electrical and non-electrical connection device;

The neural network chip: a main processing circuit and a plurality of basic processing circuits; the main processing circuit or at least one circuit of the plurality of basic processing circuits comprises: a data type operation circuit; the data type operation circuit is used for executing conversion between floating point type data and fixed point type data; the main processing circuit is connected with the second substrate through the second electric and non-electric connecting device;

the plurality of basic processing circuits are distributed in an m x n array; each basic processing circuit is connected with other adjacent basic processing circuits, and the main processing circuit is connected with n basic processing circuits of the 1 st row, n basic processing circuits of the m th row and m basic processing circuits of the 1 st column;

2. The neural network processor board of claim 1, wherein,

The neural network chip packaging structure further comprises: bonding pad, solder ball, reinforcement structure, the second base plate includes: a connection point and a pin;

the bonding pad is connected with the neural network chip, and the solder ball is formed between the bonding pad and the connection point of the second substrate; the pins are connected with the first substrate to realize transmission of external data and internal data; the reinforcing structure is connected with the bonding pad and is embedded in the solder ball.

3. The neural network processor board of claim 2, wherein the second substrate further comprises: insulating filler, heat dissipating paste and metal shell heat sink; the heat dissipation paste and the metal shell heat dissipation sheet are used for dissipating heat generated when the neural network chip operates.

4. The neural network processor board of any of claims 1-3, wherein the first substrate further comprises: a fast external device interconnect bus interface, a small form factor hot pluggable interface, an ethernet interface, or a controller area network bus interface.

5. A neural network processor board according to any one of claims 1 to 3, wherein the package structure of the neural network chip package structure is any one of the following packages:

Flip chip ball grid array package, thin quad flat package, quad flat package with heat spreader, leadless quad flat package, small pitch quad flat package.

6. The neural network processor board of claim 1, wherein,

the main processing circuit is used for acquiring a data block to be calculated and an operation instruction, converting the data block to be calculated into a fixed-point type data block through the data type operation circuit, and dividing the fixed-point type data block to be calculated into a distribution data block and a broadcast data block according to the operation instruction; splitting the distributed data blocks to obtain a plurality of basic data blocks, distributing the basic data blocks to a basic processing circuit connected with the basic data blocks, and broadcasting the broadcast data blocks to the basic processing circuit connected with the basic processing circuit;

The main processing circuit is used for converting the operation result into floating point type data through the data type operation circuit, and processing the floating point type data to obtain the data block to be calculated and an instruction result of an operation instruction;

the main processing circuit is specifically configured to send the broadcast data block to the base processing circuit connected to the main processing circuit through one broadcast;

the basic processing circuit is specifically configured to perform inner product processing on the basic data block and the broadcast data block in a fixed-point data type to obtain an inner product processing result, accumulate the inner product processing result to obtain an operation result, and send the operation result to the main processing circuit;

and the main processing circuit is used for accumulating the operation results to obtain an accumulated result when the operation results are the results of the inner product processing, and arranging the accumulated result to obtain the data block to be calculated and the instruction result of the operation instruction.

7. The neural network processor board of claim 6, wherein,

the main processing circuit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the basic processing circuit for a plurality of times; the plurality of partial broadcast data blocks are combined to form the broadcast data block;

The basic processing circuit is specifically configured to perform inner product processing on the partial broadcast data block and the basic data block in a fixed-point data type to obtain an inner product processing result, accumulate the inner product processing result to obtain a partial operation result, and send the partial operation result to the main processing circuit;

the basic processing circuit is specifically configured to multiplex n times of the partial broadcast data blocks, perform inner product operations on the partial broadcast data blocks and the n basic data blocks to obtain n partial processing results, respectively accumulate the n partial processing results to obtain n partial operation results, and send the n partial operation results to the main processing circuit, where n is an integer greater than or equal to 2.

8. The neural network processor board of claim 1, wherein,

the main processing circuit is used for acquiring a data block to be calculated and an operation instruction, and dividing the data block to be calculated into a distribution data block and a broadcast data block according to the operation instruction; splitting the distributed data blocks to obtain a plurality of basic data blocks, distributing the basic data blocks to a basic processing circuit connected with the basic data blocks, and broadcasting the broadcast data blocks to the basic processing circuit connected with the basic processing circuit;

the main processing circuit is used for processing the operation result to obtain the data block to be calculated and an instruction result of the operation instruction;

the data are: one or any combination of vectors, matrixes, three-dimensional data blocks, four-dimensional data blocks and n-dimensional data blocks;

if the operation instruction is a multiplication instruction, the main processing circuit determines that a multiplier data block is a broadcast data block and a multiplicand data block is a distribution data block;

9. A neural network computing device, comprising one or more neural network processor boards according to any one of claims 1-8.

10. A combination processing apparatus, characterized in that the combination processing apparatus comprises: the neural network computing device, the universal interconnect interface, and the universal processing device of claim 9;