CN111160541B

CN111160541B - Integrated circuit chip device and related products

Info

Publication number: CN111160541B
Application number: CN201911390541.9A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2017-12-14
Filing date: 2017-12-14
Publication date: 2023-05-19
Anticipated expiration: 2037-12-14
Also published as: CN111105033B; TW201931220A; CN111160541A; CN111126588A; CN111105033A; TWI768159B; CN109961134B; CN109961134A; CN111126588B

Abstract

The present disclosure provides an integrated circuit chip device and related products, the integrated circuit chip device comprising: an integrated circuit chip device, the integrated circuit chip device comprising: the system comprises a main processing circuit, k branch processing circuits and k groups of basic processing circuits, wherein the main processing circuit is respectively connected with the k branch processing circuits, each branch processing circuit in the k branch processing circuits corresponds to one group of basic processing circuits in the k groups of basic processing circuits, and the group of basic processing circuits comprises at least one basic processing circuit; the branch processing circuit includes: and the data type operation circuit is used for executing conversion between floating point type data and fixed point type data. The technical scheme provided by the disclosure has the advantages of small calculated amount and low power consumption.

Description

Integrated circuit chip device and related products

Technical Field

The present disclosure relates to the field of neural networks, and more particularly, to an integrated circuit chip device and related products.

Background

Artificial neural networks (Artificial Neural Network, ANN) are a growing research hotspot in the area of artificial intelligence since the 80 s of the 20 th century. The human brain nerve cell network is abstracted from the information processing perspective, a certain simple model is built, and different networks are formed according to different connection modes. Also commonly referred to in engineering and academia as neural networks or neural-like networks. A neural network is an operational model, which is formed by interconnecting a large number of nodes (or neurons). The operation of the existing neural network is realized based on a CPU (Central Processing Unit ) or a GPU (English: graphics Processing Unit, graphic processor), and the calculation amount of the operation is large and the power consumption is high.

Disclosure of Invention

The embodiment of the disclosure provides an integrated circuit chip device and related products, which can improve the processing speed of a computing device and improve the efficiency.

In a first aspect, there is provided an integrated circuit chip device comprising: the system comprises a main processing circuit, k branch circuits and k groups of basic processing circuits, wherein the main processing circuit is respectively connected with the k branch circuits, each branch circuit in the k branch circuits corresponds to one group of basic processing circuits in the k groups of basic processing circuits, and the group of basic processing circuits comprises at least one basic processing circuit;

the branch circuit includes: a data type operation circuit for performing conversion between floating-point type data and fixed-point type data;

the main processing circuit is used for executing each continuous operation in the neural network operation and transmitting data with the k branch circuits connected with the continuous operation;

the k branch circuits are used for forwarding the transmission data between the main processing circuit and the k groups of basic processing circuits, and controlling whether to start the data type operation circuit to perform conversion on the type of the transmission data according to the operation of the transmission data;

The k groups of basic processing circuits are used for executing operation in the neural network in a parallel mode according to the transmission data or the converted transmission data, and transmitting an operation result to the main processing circuit through a branch circuit connected with the main processing circuit.

In a second aspect, a neural network computing device is provided, the neural network computing device comprising one or more of the integrated circuit chip devices provided in the first aspect.

In a third aspect, there is provided a combination processing apparatus including: the neural network operation device, the universal interconnection interface and the universal processing device provided in the second aspect;

the neural network operation device is connected with the general processing device through the general interconnection interface.

In a fourth aspect, there is provided a chip integrating the apparatus of the first aspect, the apparatus of the second aspect or the apparatus of the third aspect.

In a fifth aspect, an electronic device is provided, the electronic device comprising the chip of the fourth aspect.

In a sixth aspect, there is provided a method of operating a neural network, the method being applied within an integrated circuit chip device, the integrated circuit chip device comprising: the integrated circuit chip device of the first aspect for performing operations of a neural network.

It can be seen that, according to the embodiment of the disclosure, the data conversion operation circuit is provided to perform post-conversion operation on the type of the data block, so that transmission resources and calculation resources are saved, and therefore, the data conversion operation circuit has the advantages of low power consumption and small calculation amount.

Drawings

FIG. 1a is a schematic diagram of an integrated circuit chip device.

FIG. 1b is a schematic diagram of another integrated circuit chip device.

FIG. 1c is a schematic diagram of a basic processing circuit.

FIG. 1d is a schematic block diagram of a fixed point data type.

Fig. 2 is a schematic diagram of a matrix-by-vector flow.

Fig. 2a is a schematic diagram of a matrix multiplied by a vector.

Fig. 2b is a schematic diagram of a matrix-by-matrix flow.

Fig. 2c is a schematic diagram of the matrix Ai multiplied by the vector B.

Fig. 2d is a schematic diagram of matrix a multiplied by matrix B.

Fig. 2e is a schematic diagram of the matrix Ai multiplied by the matrix B.

Fig. 3a is a schematic diagram of neural network training.

Fig. 3b is a schematic diagram of convolution operation.

Fig. 4a is a schematic diagram of a neural network forward operation.

Fig. 4b is a schematic diagram of the neural network reverse operation.

Detailed Description

In order that those skilled in the art will better understand the present disclosure, a more complete description of the same will be rendered by reference to the appended drawings, wherein it is to be understood that the embodiments are merely some, but not all, of the embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that a person of ordinary skill in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.

In the apparatus provided in the first aspect, the main processing circuit is configured to obtain a data block to be calculated and an operation instruction, and divide the data block to be calculated into a distribution data block and a broadcast data block according to the operation instruction; splitting the distributed data blocks to obtain a plurality of basic data blocks, distributing the basic data blocks to the k branch circuits connected with the basic data blocks, and broadcasting the broadcast data blocks to the k branch circuits connected with the broadcast data blocks;

the k branch circuits are used for receiving the basic data block and the broadcast data block, and starting the data type operation circuit to convert the basic data block and the broadcast data block into fixed-point data types; forwarding the basic data block and the broadcast data block to k groups of basic processing circuits in a fixed-point data type;

the basic processing circuit is used for performing inner product operation on the basic data block and the broadcast data block in a fixed-point data type to obtain an operation result, and sending the operation result to the k branch circuits;

the k branch circuits are used for converting the operation result into a floating point type operation result and sending the floating point type operation result to the main processing circuit;

And the main processing circuit is used for processing the floating point type operation result to obtain the data block to be calculated and an instruction result of the operation instruction.

In the apparatus provided in the first aspect, the main processing circuit is specifically configured to broadcast the broadcast data block to the k branch circuits at a time.

In the apparatus provided in the first aspect, the main processing circuit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the K branch circuits multiple times.

In the apparatus provided in the first aspect, the basic processing circuit is specifically configured to perform inner product processing on the partial broadcast data block and the basic data block in a fixed-point type to obtain an inner product processing result, accumulate the inner product processing result to obtain a partial operation result, send the partial operation result to the k branch circuits,

the k branch circuits are used for converting the partial operation result into floating point type data and sending the floating point type data to the main processing circuit.

In the apparatus provided in the first aspect, the basic processing circuit is specifically configured to multiplex n times of the partial broadcast data blocks, perform inner product operations of the partial broadcast data blocks and the n basic data blocks according to a fixed-point data type to obtain n partial processing results of the fixed-point data type, respectively accumulate the n partial processing results of the fixed-point data type to obtain n partial operation results of the fixed-point type, and send the n partial operation results of the fixed-point type to the branch circuit;

The branch circuit is configured to convert the n partial operation results of the fixed point type into n partial operation results of the floating point type, and send the n partial operation results of the floating point type to the main processing circuit, where n is an integer greater than or equal to 2.

In the apparatus provided in the first aspect, the main processing circuit includes: a master register or master on-chip cache circuit;

or the branch circuit comprises: a basic register or basic on-chip cache circuit;

or the base processing circuit comprises: basic registers or basic on-chip cache circuits.

In the apparatus provided in the first aspect, the main processing circuit includes: a vector operator circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transpose circuit, a direct memory access circuit, a data type operation circuit, or a data rearrangement circuit.

In the apparatus provided in the first aspect, the data is: vector, matrix, three-dimensional data block, four-dimensional data block, and n-dimensional data block.

In the apparatus provided in the first aspect, if the operation instruction is a multiplication instruction, the main processing circuit determines that a multiplier data block is a broadcast data block and a multiplicand data block is a distribution data block;

If the operation instruction is a convolution instruction, the main processing circuit determines that the input data block is a broadcast data block, and the convolution kernel is a distribution data block.

In the method provided in the fourth aspect, the operation of the neural network includes: one or any combination of convolution operation, matrix multiplication matrix operation, matrix multiplication vector operation, paranoid operation, full connection operation, GEMM operation, GEMV operation and activation operation.

Referring to fig. 1a, fig. 1a is a schematic structural diagram of an integrated circuit chip device, as shown in fig. 1a, the chip device includes: main processing circuit, basic processing circuit and branch processing circuit. Wherein, specifically, the integrated circuit chip device includes: a main processing circuit, k branch circuits (as shown in fig. 1a, k=4, which may be other values in practical application, such as 8, 16, etc.) and k groups of basic processing circuits, where the main processing circuit is connected to the k branch circuits, and each branch circuit in the k branch circuits corresponds to one group of basic processing circuits in the k groups of basic processing circuits, and the one group of basic processing circuits includes at least one basic processing circuit; the branch circuit includes: a data type operation circuit for performing conversion between floating-point type data and fixed-point type data; the main processing circuit is used for executing each continuous operation in the neural network operation and transmitting data with the k branch circuits connected with the continuous operation; the k branch circuits are used for forwarding the transmission data between the main processing circuit and the k groups of basic processing circuits, and controlling whether to start the data type operation circuit to perform conversion on the type of the transmission data according to the operation of the transmission data; the k groups of basic processing circuits are used for executing the operation in the neural network in a parallel mode according to the transmission data or the converted transmission data and transmitting the operation result to the main processing circuit through a branch circuit connected with the main processing circuit

The main processing circuit may include a register and/or an on-chip buffer circuit, and may further include a control circuit, a vector arithmetic unit circuit, an ALU (arithmetic and logic unit, arithmetic logic unit) circuit, an accumulator circuit, a DMA (Direct Memory Access ) circuit, etc., although in practical applications, the main processing circuit may be added, a conversion circuit (for example, a matrix transpose circuit), a data rearrangement circuit, an activation circuit, etc.;

alternatively, the main processing circuit may include: the data type conversion operation circuit can be used for converting received or transmitted data from floating point type data to fixed point type data, and can be used for converting fixed point type data into floating point type data in practical application. The present invention is not limited to the specific form of the data type conversion operation circuit described above.

The main processing circuit also comprises a data transmitting circuit, a data receiving circuit or an interface, wherein the data transmitting circuit can integrate the data distributing circuit and the data broadcasting circuit, and the data distributing circuit and the data broadcasting circuit can be respectively arranged in practical application; in practical applications, the data transmitting circuit and the data receiving circuit may be integrated together to form a data transceiver circuit. For broadcast data, i.e. data that needs to be sent to each basic processing circuit. For distributing data, that is, data that needs to be selectively sent to a part of the basic processing circuit, a specific selection mode can be specifically determined by the main processing circuit according to the load and the calculation mode. For the broadcast transmission scheme, broadcast data is transmitted in broadcast form to each base processing circuit. (in practical applications, broadcast data is transmitted to each basic processing circuit by a broadcast method, or broadcast data may be transmitted to each basic processing circuit by a broadcast method, and the number of times of the broadcast is not limited in the embodiments of the present application).

When the data distribution is realized, the control circuit of the main processing circuit transmits the data to part or all of the basic processing circuits (the data can be the same or different, specifically, if the data is transmitted in a distribution mode, the data received by the basic processing circuits of each received data can be different, and the data received by part of the basic processing circuits can be the same;

specifically, when broadcasting data, the control circuit of the main processing circuit transmits the data to part or all of the basic processing circuits, and the basic processing circuits receiving the data can receive the same data.

Alternatively, the vector operator circuit of the main processing circuit may perform vector operations, including but not limited to: two vectors add, subtract, multiply, divide, add to, subtract from, multiply, divide, or perform any operation on each element in the vector. The continuous operation may be vector and constant addition, subtraction, multiplication, division, activation, accumulation, etc.

Each base processing circuit may include a base register and/or a base on-chip cache circuit; each base processing circuit may further include: an inner product operator circuit, a vector operator circuit, an accumulator circuit, or the like. The inner product arithmetic circuit, the vector arithmetic circuit, and the accumulator circuit may be integrated circuits, or may be individually provided.

The chip arrangement may optionally further comprise one or more branch processing circuits, such as with a branch processing circuit, wherein the main processing circuit is connected to the branch processing circuit, the branch processing circuit is connected to the basic processing circuit, the inner product operator circuit of the basic processing circuit is arranged to perform inner product operations between data blocks, the control circuit of the main processing circuit controls the data receiving circuit or the data transmitting circuit to transmit and receive external data, and the control circuit controls the data transmitting circuit to distribute the external data to the branch processing circuit, the branch processing circuit is arranged to transmit and receive data of the main processing circuit or the basic processing circuit. The architecture shown in fig. 1a is suitable for the computation of complex data, because the number of connected units is limited for the main processing circuit, so that branch processing circuits need to be added between the main processing circuit and the basic processing circuit to realize the access of more basic processing circuits, thereby realizing the computation of complex data blocks. The connection structure of the branch processing circuit and the basic processing circuit may be arbitrary, and is not limited to the H-type structure of fig. 1 a. Alternatively, the main processing circuit to the base processing circuit is a broadcast or distributed structure, and the base processing circuit to the main processing circuit is a gather (gather) structure. The definition of broadcast, distribution and collection is as follows, and for a distribution or broadcast structure, the number of basic processing circuits at this time is greater than that of main processing circuits, i.e. 1 main processing circuit corresponds to a plurality of basic processing circuits, i.e. a structure from the main processing circuit to the plurality of basic processing circuits is broadcast or distributed, whereas from the plurality of basic processing circuits to the main processing circuit may be a collection structure.

The basic processing circuit receives data distributed or broadcast by the main processing circuit and stores the data in an on-chip cache of the basic processing circuit, can perform operation to generate a result, and can send the data to the main processing circuit.

The data involved in the basic processing circuit can be any data type, can be any floating point number data with any bit width or any fixed point number data with any bit width; all the arithmetic circuits and the memory circuits involved may be any data type arithmetic circuits and memory circuits which can be processed, and may be any floating point number arithmetic circuits and memory circuits with any bit width, or any fixed point number arithmetic circuits and memory circuits with any bit width.

Optionally, each basic processing circuit may include a data type conversion operation circuit, or a data type conversion operation circuit may be configured in part of the basic processing circuits; the data type conversion operation circuit may be used to convert received or transmitted data from floating point type data to fixed point type data, or may be used to convert fixed point type data to floating point type data. The present invention is not limited to the specific form of the data type conversion operation circuit described above.

Optionally, the vector arithmetic circuit of the basic processing circuit may perform vector arithmetic on the two vectors after the data type conversion, and of course, in practical application, the inner product arithmetic circuit of the basic processing circuit may perform inner product arithmetic on the two vectors after the data type conversion, and the accumulator circuit may also accumulate the results of the inner product arithmetic.

In one alternative, the two vectors may be stored in on-chip caches and/or registers, and the underlying processing circuitry may extract the two vectors to perform the operation as needed for the actual computation. The operation includes, but is not limited to: inner product operations, multiplication operations, addition operations, or other operations.

In one alternative, the results of the inner product operation may be accumulated onto an on-chip cache and/or register; the alternative scheme has the advantages of reducing the data transmission quantity between the basic processing circuit and the main processing circuit, improving the operation efficiency and reducing the data transmission power consumption.

In an alternative, the result of the inner product operation is not accumulated and is directly transmitted as a result; the technical scheme has the advantages that the operation amount in the basic processing circuit is reduced, and the operation efficiency of the basic processing circuit is improved.

In an alternative scheme, each basic processing circuit can execute inner product operation of multiple groups of two vectors, and can also respectively accumulate the results of the multiple groups of inner product operation;

in one alternative, multiple sets of two vector data may be stored in on-chip caches and/or registers;

in one alternative, the results of the multiple sets of inner-product operations may be accumulated in on-chip caches and/or registers, respectively;

in an alternative scheme, the results of the inner product operations of each group can be directly transmitted as the results without accumulation;

in one alternative, each basic processing circuit may perform an operation of performing inner product operations on the same vector and a plurality of vectors, respectively ("one-to-many" inner products, i.e., two vectors in each of the plurality of sets of inner products are shared), and accumulate the inner product results corresponding to each vector, respectively. According to the technical scheme, the same set of weight values can be used for calculating different input data for multiple times, so that the data multiplexing is increased, the data transmission quantity of the data in the basic processing circuit is reduced, the calculation efficiency is improved, and the power consumption is reduced.

Specifically, in the data used to calculate the inner product, the data sources of the shared vector of each set and the other vector of each set (i.e., the vector that differs between each set) may differ:

In one alternative, the sets of shared vectors are broadcast or distributed from the main processing circuit or branch processing circuit in calculating the inner product;

in one alternative, each set of shared vectors comes from an on-chip cache when computing the inner product;

in one alternative, the shared sets of vectors come from registers when the inner product is calculated;

in one alternative, in calculating the inner product, another unshared vector of each group is broadcast or distributed from the main processing circuit or the branch processing circuit;

in one alternative, in calculating the inner product, another unshared vector of each group is from the on-chip cache;

in one alternative, in calculating the inner product, another unshared vector for each group comes from a register;

in one alternative, each set of shared vectors remains arbitrary in the on-chip caches and/or registers of the underlying processing circuitry while performing multiple sets of inner product operations;

in one alternative, the shared vector may be kept one for each set of inner products;

in one alternative, the shared vector may be kept in only one part;

specifically, the results of the multiple sets of inner product operations may be accumulated in on-chip caches and/or registers, respectively;

Specifically, the results of each group of inner product operations may be directly transmitted as the results without accumulation;

referring to FIG. 1a, a structure is shown that includes a main processing circuit (capable of performing vector operations) and multiple basic processing circuits (capable of performing inner product operations). The advantages of such a combination are: the device can not only use the basic processing circuit to execute matrix and vector multiplication operation, but also use the main processing circuit to execute other arbitrary vector operation, so that the device can more quickly complete more operation under the configuration of a limited hardware circuit, the number of times of data transmission with the outside of the device is reduced, the calculation efficiency is improved, and the power consumption is reduced. In addition, the chip can be provided with a data type conversion operation circuit in the basic processing circuit and/or the main processing circuit, so that floating point type data can be converted into fixed point type data when the neural network calculation is carried out, the fixed point type data can also be converted into floating point type data, the chip can dynamically allocate the data type according to the operation amount (namely the load amount) of each circuit (mainly the main processing circuit and the basic processing circuit) to convert the data type by the circuit, the complex program of the data calculation can be reduced, the power consumption is reduced, and the dynamic allocation of the conversion of the data type can be realized without affecting the calculation efficiency of the chip. The manner of allocation includes, but is not limited to: load balancing, load minimum allocation, and the like.

Referring to the apparatus shown in fig. 1b, the apparatus shown in fig. 1b is a computing apparatus in which branch processing circuits are individually connected to basic processing circuits, and as shown in fig. 1b, the apparatus includes: the main processing circuit and the N basic processing circuits, where the main processing circuit (specific structure is shown in fig. 1 c) may be directly or indirectly connected to the N basic processing circuits, for example, in an indirect connection manner, an alternative scheme may include N/4 branch processing circuits as shown in fig. 1a, each branch processing circuit is connected to 4 basic processing circuits, the main processing circuit and the N basic processing circuits may be referred to the above description as shown in fig. 1a, and the basic processing circuits may be further provided in the branch processing circuits, and in addition, the number of the basic processing circuits connected to each branch processing circuit may be not limited to 4, and the manufacturer may configure the main processing circuit and the N basic processing circuits according to actual needs. The main processing circuit and/or the N basic processing circuits may include a data type conversion operation circuit, and specifically, the main processing circuit may include a data type operation circuit, or the N basic processing circuits or a part thereof may include a data type conversion circuit, or the main processing circuit and the N basic processing circuits or a part thereof may include the data type conversion circuit. The main processing circuit may dynamically allocate the operation entity of the data type conversion step according to the neural network calculation instruction, specifically, the main processing circuit may determine whether to perform the data type conversion step on the received data according to its own load, specifically, may set a plurality of intervals for each interval corresponding to the execution body of the data type conversion step, for example, taking 3 intervals as an example, the load value of interval 1 is lower, the data type conversion step may be separately performed by the main processing circuit, the load value of interval 2 is located between interval 1 and interval 3, the data type conversion step may be commonly performed by the main processing circuit or the N basic processing circuits, the load value of interval 3 is higher, and the data type conversion step may be performed by the N basic processing circuits. In this regard, it may be performed in an explicit manner, e.g., the main processing circuitry may be configured with a special instruction or instruction that, when received by the base processing circuitry, determines to perform the data type conversion step, e.g., when not received by the base processing circuitry, determines not to perform the data type conversion step. As another example, this may be performed implicitly, e.g., when the underlying processing circuitry receives data of which the data type is a floating point type and determines that an inner product operation needs to be performed, the data type is converted to fixed point type data.

The following provides a method for implementing computation by using the apparatus shown in fig. 1a, where the method of computing may specifically be a computation manner of a neural network, for example, a forward operation of the neural network, and training of the neural network, and in practical application, the forward operation may perform operations of matrix multiplication, convolution operation, activation operation, transformation operation, and the like according to different input data, where the operations may be implemented by using the apparatus shown in fig. 1 a.

The data conversion operation circuit of the main processing circuit firstly converts the type of data and then transmits the data to the basic processing circuit for operation by the control circuit, for example, the data conversion operation circuit of the main processing circuit can convert floating point numbers into fixed point numbers with lower bit width and then transmit the fixed point numbers to the basic processing circuit.

If the data received by the basic processing circuit is floating point data, the basic processing circuit can firstly convert the data into data types and then calculate the data types after receiving the data, for example, the basic processing circuit receives the floating point number transmitted by the main processing circuit, the data conversion operation circuit then converts the data types into fixed point number, and then the inner product operation circuit, the vector operation circuit or the accumulator circuit of the basic processing circuit performs operation, thereby improving the operation efficiency and reducing the power consumption.

The basic processing circuit can firstly perform data type conversion and then transmit the data type conversion to the main processing circuit after calculating the result, for example, the floating point number operation result calculated by the basic processing circuit can be firstly converted into a fixed point number with low bit width and then transmitted to the main processing circuit, and the method has the advantages of reducing the data bit width in the transmission process, being higher in efficiency and saving power consumption.

The main processing circuit transmits the data to be calculated to all or a part of the basic processing circuits; taking matrix multiplication and vector calculation as an example, the control circuit of the main processing circuit can split matrix data into each column as basic data, for example, an m×n matrix, and can split the matrix data into n m rows of vectors, and the control circuit of the main processing circuit distributes the split n m rows of vectors to a plurality of basic processing circuits. For vectors, the control circuitry of the main processing circuitry may broadcast the vector as a whole to each of the base processing circuitry. If the value of m is relatively large, the control circuit may split the m×n matrix into x×n vectors, for example, x=2, and specifically may split the m×n vectors, where each vector includes m/2 rows, i.e., each vector in n m rows is equally divided into 2 vectors, for example, the first vector of n m rows is equal to 1000 rows, for example, the first vector of n m rows may be equally divided into 2 vectors, the first 500 rows may be formed into a first vector, the second 500 rows may be formed into a second vector, and the control circuit broadcasts the 2 vectors to the plurality of basic processing circuits through 2 broadcasts.

The data transmission mode can be broadcasting or distributing, or any other possible transmission mode;

after receiving the data, the basic processing circuit executes operation to obtain an operation result;

the basic processing circuit transmits the operation result back to the main processing circuit;

the operation result may be an intermediate operation result or a final operation result.

Performing a matrix multiplication vector operation using the apparatus shown in FIG. 1 a;

(a matrix-by-vector may be one in which each row in the matrix is respectively inner-product with the vector and the results are placed into a vector in the order of the corresponding row.)

The operation of calculating the multiplication of a matrix S of size M rows and L columns and a vector P of length L is described below, as shown in fig. 2a, the neural network computing device having K basic processing circuits (each row in matrix S is the same length as vector P, data in them corresponds to position one by one):

referring to fig. 2, fig. 2 provides a method for implementing matrix multiplication vectors, which specifically may include:

step S201, a data conversion operation circuit of a main processing circuit converts each row of data in a matrix S into fixed-point type data, a control circuit of the main processing circuit distributes the data to one of K basic processing circuits, and the basic processing circuits store the received distributed data in on-chip caches and/or registers of the basic processing circuits;

In an alternative, if the number M < = K of rows of the matrix S, the control circuit of the main processing circuit distributes one row of the S matrix to K basic processing circuits, respectively;

in an alternative, the control circuit of the main processing circuit distributes the data of one or more rows of the S matrix to each base processing circuit, respectively, if the number M of rows of the matrix S > K.

The set of rows in S distributed to the ith basic processing circuit is Ai, with a total of Mi rows, as shown in fig. 2c for the calculations to be performed on the ith basic processing circuit.

In an alternative, in each basic processing circuit, for example in the i-th basic processing circuit, the received distribution data, for example the matrix Ai, may be saved in a register and/or on-chip cache of the i-th basic processing circuit; the method has the advantages of reducing the data transmission quantity of the distributed data, improving the calculation efficiency and reducing the power consumption.

Step S202, a data type operation circuit of a main processing circuit converts a vector P into fixed-point type data, and a control circuit of the main processing circuit transmits all parts of the fixed-point type vector P to K basic processing circuits in a broadcasting mode;

in an alternative, the control circuit of the main processing circuit may broadcast each portion of the vector P only once to the register or on-chip buffer of each basic processing circuit, and the ith basic processing circuit sufficiently multiplexes the data of the vector P obtained this time to complete the inner product operation corresponding to each row in the matrix Ai. The method has the advantages of reducing the data transmission quantity of repeated transmission of the vector P from the main processing circuit to the basic processing circuit, improving the execution efficiency and reducing the transmission power consumption.

In an alternative scheme, the control circuit of the main processing circuit may broadcast each part of the vector P to the register or on-chip buffer of each basic processing circuit for multiple times, where the ith basic processing circuit does not multiplex the data of the vector P obtained each time, and completes the inner product operation corresponding to each row in the matrix Ai in multiple times; the vector P data transmission method has the advantages that the data transmission quantity of the vector P of single transmission in the basic processing circuit is reduced, the capacity of a buffer memory and/or a register of the basic processing circuit can be reduced, the execution efficiency is improved, the transmission power consumption is reduced, and the cost is reduced.

In an alternative scheme, the control circuit of the main processing circuit may broadcast each part of the vector P to the register or on-chip buffer of each basic processing circuit for multiple times, and the ith basic processing circuit performs part multiplexing on the data of the vector P obtained each time to complete inner product operation corresponding to each row in the matrix Ai; the method has the advantages of reducing the data transmission quantity from the main processing circuit to the basic processing circuit, reducing the data transmission quantity in the basic processing circuit, improving the execution efficiency and reducing the transmission power consumption.

Step S203, the inner product arithmetic circuit of the K basic processing circuits calculates the inner products of the data of the matrix S and the vector P, for example, the i-th basic processing circuit calculates the inner products of the data of the matrix Ai and the data of the vector P;

And S204, accumulating the results of the inner product operation by the accumulator circuits of the K basic processing circuits to obtain accumulated results, and transmitting the accumulated results back to the main processing circuit in a fixed-point type mode.

In an alternative, the partial sums (a part of the partial sums, i.e. the accumulated result, for example, f1×g1+f2×g2+f3×g3+f4×g4+f5×g5) obtained by each time the basic processing circuit performs the inner product operation may be transmitted back to the main processing circuit for accumulation; the method has the advantages of reducing the operation amount in the basic processing circuit and improving the operation efficiency of the basic processing circuit.

In an alternative scheme, the part obtained by the inner product operation executed by the basic processing circuit each time can be stored in a register and/or an on-chip cache of the basic processing circuit, and the part is transmitted back to the main processing circuit after accumulation is finished; the method has the advantages of reducing the data transmission quantity between the basic processing circuit and the main processing circuit, improving the operation efficiency and reducing the data transmission power consumption.

In an alternative scheme, the part obtained by the inner product operation executed by the basic processing circuit and the part stored in the register and/or the on-chip buffer of the basic processing circuit are accumulated in part cases, and the part is transmitted to the main processing circuit for accumulation, and the part is transmitted back to the main processing circuit after the accumulation is finished; the method has the advantages of reducing the data transmission quantity between the basic processing circuit and the main processing circuit, improving the operation efficiency, reducing the data transmission power consumption, reducing the operation quantity in the basic processing circuit and improving the operation efficiency of the basic processing circuit.

Referring to FIG. 2b, the operation of matrix multiplication is performed using the apparatus shown in FIG. 1 a;

the operation of calculating the multiplication of a matrix S of size M rows and L columns and a matrix P of size L rows and N columns (each row in matrix S being the same length as each column of matrix P, as shown in fig. 2 d) is described below, the neural network calculation device having K basic processing circuits:

step S201b, a control circuit of the main processing circuit distributes data of each row in the matrix S to one of K basic processing circuits, and the basic processing circuits store the received data in on-chip caches and/or registers;

in an alternative, if the number of rows M < = K of S, the control circuit of the main processing circuit distributes one row of the S matrix to M base processing circuits, respectively;

in one alternative, if the number of rows M > K of S, the control circuit of the main processing circuit distributes the data of one or more rows of the S matrix to each of the base processing circuits, respectively.

In S there are Mi rows distributed to the i-th basic processing circuit, the set of which is called Ai, as fig. 2e shows the calculations to be performed on the i-th basic processing circuit.

In one alternative, in each basic processing circuit, for example, in the i-th basic processing circuit:

The received matrix Ai distributed by the main processing circuit is stored in an ith basic processing circuit register and/or an on-chip cache; the method has the advantages of reducing the subsequent data transmission quantity, improving the calculation efficiency and reducing the power consumption.

Step S202b, the control circuit of the main processing circuit transmits each part of the matrix P to each basic processing circuit in a broadcast mode;

in an alternative scheme, each part in the matrix P can be broadcast to a register or an on-chip buffer of each basic processing circuit only once, and the ith basic processing circuit sufficiently multiplexes the data of the matrix P obtained at this time to complete the inner product operation corresponding to each row in the matrix Ai; the multiplexing in this embodiment may be specifically repeated for the basic processing circuit in the calculation, for example, multiplexing of the data of the matrix P may be repeated for multiple uses of the data of the matrix P.

In an alternative scheme, the control circuit of the main processing circuit may broadcast each part of the matrix P to the register or on-chip buffer of each basic processing circuit for multiple times, where the ith basic processing circuit does not multiplex the data of the matrix P obtained each time, and completes the inner product operation corresponding to each row in the matrix Ai in multiple times;

In an alternative scheme, the control circuit of the main processing circuit may broadcast each part of the matrix P to the register or on-chip buffer of each basic processing circuit for multiple times, and the ith basic processing circuit performs part multiplexing on the data of the matrix P obtained each time to complete inner product operation corresponding to each row in the matrix Ai;

in one alternative, each elementary processing circuit, for example the ith elementary processing circuit, calculates the inner product of the data of matrix Ai and the data of matrix P;

in step S203b, the accumulator circuit of each basic processing circuit accumulates the result of the inner product operation and transmits the accumulated result back to the main processing circuit.

In an alternative, the base processing circuit may accumulate the partial sum of each execution of the inner product operation and transmit it back to the main processing circuit;

in an alternative scheme, the part obtained by the inner product operation executed by the basic processing circuit each time can be stored in a register and/or an on-chip cache of the basic processing circuit, and the part is transmitted back to the main processing circuit after accumulation is finished;

in an alternative scheme, the part obtained by the inner product operation executed by the basic processing circuit and the part stored in the register and/or the on-chip buffer of the basic processing circuit are accumulated in part cases, and the part is transmitted to the main processing circuit for accumulation, and the part is transmitted back to the main processing circuit after the accumulation is finished;

Referring to fig. 3a, the full join operation is performed using the apparatus shown in fig. 1 a:

if the input data of the full connection layer is a vector (i.e. the input of the neural network is a single sample), taking the weight matrix of the full connection layer as a matrix S, taking the input vector as a vector P, and performing the operation of matrix multiplication vector as shown in FIG. 2 according to the using method of the device;

if the input data of the full connection layer is a matrix (i.e. the input of the neural network is a plurality of samples as the batch), the weight matrix of the full connection layer is taken as a matrix S, the input vector is taken as a matrix P, or the weight matrix of the full connection layer is taken as a matrix P, the input vector is taken as a matrix S, and the operation is performed according to the use of the device as the matrix multiplication matrix shown in fig. 2 c;

referring to fig. 3b, the convolution operation is performed using the apparatus shown in fig. 1 a:

for one convolution layer, recording the number of convolution kernels as M;

step S301, a control circuit of a main processing circuit distributes the weight of each convolution kernel in the weight of a convolution layer to one of K basic processing circuits, and the weight is stored in an on-chip buffer and/or a register of the basic processing circuits;

in an alternative scheme, if the number M < = K of convolution kernels, the control circuit of the main processing circuit distributes weights of one convolution kernel to M basic processing circuits respectively;

In one alternative, if the number of convolution kernels M > K, the control circuitry of the main processing circuitry distributes weights of one or more convolution kernels to each of the base processing circuitry, respectively.

A common Mi convolution kernels are distributed to the i-th basic processing circuit, and the set of these convolution kernel weights is called Ai.

the received convolution kernel weight Ai distributed by the main processing circuit is stored in a register and/or an on-chip cache thereof;

step S302, the control circuit of the main processing circuit transmits all parts of the input data P to all basic processing circuits in a broadcasting mode;

in an alternative scheme, the control circuit of the main processing circuit may broadcast each part of the input data P to the register or on-chip buffer of each basic processing circuit only once, and the ith basic processing circuit sufficiently multiplexes the data of the input data P obtained this time to complete the inner product operation corresponding to each convolution kernel in Ai;

in an alternative scheme, the control circuit of the main processing circuit may broadcast each part of the input data P to the register or on-chip buffer of each basic processing circuit for multiple times, where the ith basic processing circuit does not multiplex the data of the input data P obtained each time, and completes the inner product operation corresponding to each convolution kernel in Ai in multiple times;

In an alternative scheme, the control circuit of the main processing circuit may broadcast each part of the input data P to the register or on-chip buffer of each basic processing circuit for multiple times, and the ith basic processing circuit performs partial multiplexing on the data of the input data P obtained each time, so as to complete inner product operation corresponding to each convolution kernel in Ai;

step S303, each basic processing circuit calculates the inner product of the convolution kernel and the data of the input data P, for example, the i-th basic processing circuit calculates the inner product of each convolution kernel of Ai and the data of the input data P;

step S304, the accumulator circuit of each basic processing circuit accumulates the result of the inner product operation and transmits the result back to the main processing circuit:

in an alternative, the processing circuit may be based to accumulate the partial sum of each execution of the inner product operation back to the main processing circuit;

in an alternative scheme, the basic processing circuit may also store the partial sum obtained by the inner product operation executed each time in a register and/or an on-chip buffer of the basic processing circuit, and transmit the partial sum back to the main processing circuit after accumulation is finished;

in an alternative scheme, the basic processing circuit can also store the part obtained by the inner product operation executed each time and the part stored in a register and/or an on-chip cache of the basic processing circuit for accumulation under the condition of part, and transmit the part to the main processing circuit for accumulation under the condition of part, and transmit the part back to the main processing circuit after the accumulation is finished;

Method of updating weights using the device shown in fig. 1 a:

the function of updating the weight in the training process of the neural network is realized by using a vector arithmetic circuit of the main processing circuit, and specifically, the weight updating refers to a method for updating the weight by using a gradient of the weight.

In an alternative scheme, a vector arithmetic circuit of the main processing circuit is used for carrying out addition and subtraction on the two vectors, namely the weight and the weight gradient, to obtain an operation result, namely the updated weight.

In an alternative scheme, a vector arithmetic circuit of the main processing circuit is used for multiplying or dividing the weight and the weight gradient by a number to obtain an intermediate weight and an intermediate weight gradient value, and the vector arithmetic circuit carries out addition and subtraction on the intermediate weight and the intermediate weight gradient value to obtain an operation result, namely the updated weight.

In an alternative scheme, a group of momentum is calculated by using the gradient of the weight, and then the updated weight is obtained by adding and subtracting the momentum and the weight;

method for implementing reverse operation of full connection layer by using device shown in figure 1a

The reverse operation of the fully connected layer can be divided into two parts, as shown in fig. 4a below, the solid arrows represent the forward calculation process of the fully connected layer, and as shown in fig. 4b below, the reverse calculation process of the fully connected layer.

The inverse operation of the full connection layer shown in fig. 4a and 4b can be accomplished by using the matrix multiplication method shown in fig. 2b by the apparatus shown in fig. 1 a;

using the apparatus shown in fig. 1a to implement the inverse operation of the convolutional layer;

the backward operation of the convolution layer can be divided into two parts, as shown in fig. 4a, the solid arrow represents the forward calculation process of the convolution layer, and as shown in fig. 4b, the backward calculation process of the convolution layer.

The inverse operation of the convolution layers shown in fig. 4a and 4b may be performed using the apparatus shown in fig. 1a using the method shown in fig. 3 b.

Method for implementing BLAS (Basic Linear Algebra Subprograms) function using apparatus as shown in fig. 1a

GEMM calculation refers to: operations of matrix-matrix multiplication in the BLAS library. The general representation of this operation is: c=alpha×op (S) ×op (P) +beta×c, where S and P are two matrices of inputs, C is an output matrix, alpha and beta are scalar, op represents some operation on the matrix S or P, and there are some auxiliary integers as parameters to describe the width and height of the matrix S and P;

the steps for implementing GEMM computation using the apparatus as in fig. 1a include:

the data type conversion operation circuit of the main processing circuit can perform data type conversion on the matrix S and the matrix P;

The conversion circuit of the main processing circuit performs respective corresponding op operations on the input matrix S and the matrix P;

in one alternative, the ops may be transpose operations of the matrix; the matrix transposition operation may be implemented with a matrix transposition circuit of the main processing circuit;

in an alternative, after the OP operations of the matrix S and the matrix P are performed, the data type conversion operation may also be performed by the data conversion operation circuit of the main processing circuit, that is, the data conversion operation circuit converts the data types of the OP (S) and the OP (P) from floating point type data to fixed point type data, and then performs the matrix multiplication operation as shown in fig. 2 b.

In one alternative, the ops of a certain matrix may be empty, with no op operations being performed;

performing a matrix multiplication calculation between op (S) and op (P) with the apparatus shown in fig. 1a using the matrix multiplication calculation method described in fig. 2 b;

multiplying each value in the result of op (S) by op (P) by an arithmetic logic unit of a main processing circuit;

in one alternative, the operation of multiplying alpha with alpha being 1 is not performed;

the arithmetic logic unit of the main processing circuit is utilized to realize the operation of beta;

In an alternative, in the case where beta is 1, the operation of multiplying beta is not performed;

and the step of adding the corresponding positions between the matrix alpha(s) op (P) and beta (C) is realized by using a vector arithmetic circuit of the main processing circuit to obtain a GEMM calculation result.

In an alternative, in the case where beta is 0, this step is not performed;

GEMV calculation refers to: operations of matrix-vector multiplication in the BLAS library. The general representation of this operation is: c=alpha op (S) p+beta C, where S is the input matrix, P is the input vector, C is the output vector, alpha and beta are scalar, op represents some operation on the matrix S;

the steps for implementing GEMV calculation using the apparatus as in fig. 1a are:

the data type conversion operation circuit of the main processing circuit can perform data type conversion on the input matrix S and the matrix P;

the conversion circuit of the main processing circuit performs corresponding op operation on the input matrix S;

in one alternative, the ops may be transpose operations of the matrix; the matrix transposition operation is realized by utilizing a conversion circuit of the main processing circuit;

in one alternative, the ops of a certain matrix may be empty and the transpose operation does not occur;

performing a matrix-vector multiplication calculation between the matrix op (S) and the vector P with the apparatus shown in fig. 1a using the matrix-vector calculation method described in fig. 2 a;

Multiplying each value in the result of op (S) by alpha by an arithmetic logic unit of a main processing circuit;

the step of adding the corresponding positions between the matrix alpha x op (S) P and beta x C is realized by using a vector arithmetic circuit of the main processing circuit to obtain the result of the GEMV.

In an alternative, in the case where beta is 0, the step operation of adding is not performed;

method for implementing an activation function using a device as in fig. 1a

Inputting a vector by using an activation circuit of the main processing circuit, and calculating an activation vector of the vector;

in an alternative, the main processing circuit activates the circuit to output each value in the input vector to the corresponding position of the output vector by an activation function (the input of the activation function is a numerical value, and the output is a numerical value);

in one alternative, the activation function may be: y=max (m, x), where x is the input value, y is the output value, and m is a constant;

In one alternative, the activation function may be: y=tanh (x), where x is the input value and y is the output value;

in one alternative, the activation function may be: y=sigmoid (x), where x is the input value and y is the output value;

in one alternative, the activation function may be a piecewise linear function;

in an alternative, the activation function may be a function that inputs a number arbitrarily and outputs a number.

In one alternative, the source of the input vector is (including but not limited to):

an external data source for the device;

in one alternative, the input data is from the result of a matrix-multiply vector operation by the device;

in one alternative, the input data is from the result of a matrix multiplication by the device;

the main processing circuit of the device calculates a result;

in one alternative, the input data is from the calculation result after the device main processing circuit implements the biasing.

It should be noted that, the above-mentioned activation operation may be implemented by an arithmetic logic circuit and an accumulator circuit in the main processing circuit, or an activation circuit may be added to the main processing circuit alone to implement the activation operation.

Biasing is achieved using the apparatus of fig. 1 a:

the vector arithmetic circuit of the main processing circuit can be utilized to realize the function of adding two vectors or two matrixes;

the function of adding a vector to each row of a matrix, or each column, can be implemented using vector operator circuitry of the main processing circuitry.

In one alternative, the matrix may be derived from the result of the device performing a matrix multiplication matrix operation;

in one alternative, the matrix may be from the result of the device performing a matrix-multiply vector operation;

in one alternative, the matrix may be externally received data from the main processing circuitry of the device.

In one alternative, the vector may be from data externally accepted by the device's main processing circuitry.

Including but not limited to these data sources.

Data type conversion is achieved using the apparatus as in fig. 1 a:

the data type conversion operation circuit of the main processing circuit is utilized to realize the conversion of the data type;

in one alternative, the data type conversion of the set of data is implemented using a data type conversion operation circuit of the main processing circuit;

In one alternative, the form of data type conversion includes, but is not limited to: floating point number to fixed point number, fixed point number to floating point number, etc.;

the present invention also provides a chip comprising a computing device comprising:

the data related in the main processing circuit can be any data type, and in an alternative scheme, the data can be any floating point number with any bit width or any fixed point number with any bit width; all of the arithmetic circuits and memory circuits involved may be any data type of arithmetic circuits and memory circuits, and in one alternative may be any bit wide floating point number of arithmetic circuits and memory circuits or any bit wide fixed point number of arithmetic circuits and memory circuits.

In one alternative, the main processing circuit includes a data type conversion operation circuit;

in one alternative, the main processing circuit includes a vector operation unit that performs data type conversion;

specifically, a data input interface is included that receives input data;

in one alternative, the received data source may be: a part or all of basic processing circuits outside the neural network operation circuit device or the neural network operation circuit device;

In one alternative, there may be a plurality of said data input interfaces; in particular, a data output interface may be included that outputs data;

in one alternative, the output data may be destined for: a part or all of basic processing circuits outside the neural network operation device or the neural network operation circuit device;

in an alternative, there may be a plurality of said data output interfaces;

in one alternative, the main processing circuitry includes on-chip caches and/or registers;

in an alternative, the main processing circuit includes an operation unit, and may perform data operation;

in one alternative, the main processing circuit includes an arithmetic operation unit therein;

in an alternative, the main processing circuit includes a vector operation unit, and may perform an operation on a set of data at the same time; in particular, the arithmetic and/or vector operations may be any type of operation, including, but not limited to: two numbers are added, subtracted, multiplied and divided, one number is added, subtracted, multiplied and divided with a constant, an exponential operation, a power operation, a logarithmic operation, various nonlinear operations are performed on one number, a comparison operation, a logic operation, and the like are performed on two numbers. The addition, subtraction, multiplication and division of two vectors, the addition, subtraction, multiplication and division of each element in one vector with a constant, the execution of an exponential operation, a power operation, a logarithmic operation, various nonlinear operations, etc. on each element in one vector, the execution of a comparison operation, a logic operation, etc. on each two corresponding elements in one vector.

In an alternative, the main processing circuit includes a data rearrangement unit for transmitting data to the base processing circuit in a certain order or rearranging data in situ in a certain order;

in one alternative, the order of the data arrangement includes: carrying out dimension sequence transformation on a multidimensional data block; the order of the data arrangement may further include: one block of data is partitioned for transmission to different underlying processing circuits.

The computing device further includes a plurality of base processing circuits: each basic processing circuit is used for calculating the inner product of two vectors, the calculation method is that the basic processing circuit receives two groups of numbers, elements in the two groups of numbers are correspondingly multiplied, and the multiplied results are accumulated; the result of the inner product is transmitted out, where it may be transmitted to other basic processing circuits, or directly to the main processing circuit, depending on the location of the basic processing circuit.

The data involved in the basic processing circuitry may be of any data type, in one alternative may be represented by floating point numbers of any bit width or fixed point numbers of any bit width; all of the arithmetic circuits and memory circuits involved may be any data type of arithmetic circuits and memory circuits, and in one alternative may be any bit wide floating point number of arithmetic circuits and memory circuits or any bit wide fixed point number of arithmetic circuits and memory circuits.

In one alternative, the base processing circuitry includes data type conversion operation circuitry;

in one alternative, the base processing circuit includes a vector operation unit that performs data type conversion;

specifically, the memory unit comprises an on-chip cache and/or a register;

specifically, the system comprises one or more data input interfaces for receiving data;

in one alternative, two data input interfaces are included, from which one or more data may be obtained at a time, respectively;

in one alternative, the underlying processing circuitry may store input data received from the data input interface in registers and/or on-chip caches;

the source of the data received by the data input interface may be: other basic processing circuitry and/or main processing circuitry.

A main processing circuit of the neural network operation circuit device;

other basic processing circuits of the neural network operation circuit device (the neural network operation circuit device has a plurality of basic processing circuits);

specifically, the system comprises one or more data output interfaces for transmitting output data;

in one alternative, one or more data may be transmitted from the data output interface;

Specifically, the data transmitted through the data output interface may be: data received from the data input interface, data stored in on-chip caches and/or registers, multiplier operation results, accumulator operation results, or inner product operation results, or any combination thereof.

In an alternative scheme, the device comprises three data output interfaces, two of which respectively correspond to two data input interfaces, each layer outputs data received from the data input interface at the previous layer, and the third data output interface is responsible for outputting operation results;

specifically, the data output interface may transmit data to: the above data sources and the data destinations herein determine the connection of the underlying processing circuitry in the device.

A main processing circuit of the neural network operation circuit device;

the neural network operation circuit device is provided with a plurality of basic processing circuits;

specifically, the arithmetic operation circuit is included: the arithmetic operation circuit may specifically be: one or more multiplier circuits, one or more accumulator circuits, one or more circuits that perform two sets of digital inner product operations, or any combination thereof.

In an alternative, a two-number multiplication may be performed, the results of which may be stored on-chip caches and/or registers, or may be accumulated directly into the registers and/or on-chip caches;

in an alternative, an inner product operation of two sets of data may be performed, and the result may be stored in an on-chip buffer and/or a register, or may be directly accumulated in the register and/or the on-chip buffer;

in one alternative, an accumulation operation of the data may be performed, accumulating the data into an on-chip cache and/or register;

specifically, the data accumulated by the accumulator circuit may be: data received from the data input interface, data stored in on-chip caches and/or registers, multiplier operation results, accumulator operation results, inner product operation results, or any combination thereof.

It should be noted that, as used in the above description of the basic processing circuits, "data input interface" and "data output interface" refer to the data input and output interface of each basic processing circuit, and not the data input and output interface of the entire apparatus.

The disclosure also discloses a neural network computing device, which comprises one or more chips shown in fig. 1a or fig. 1b, and is used for acquiring data to be computed and control information from other processing devices, executing specified neural network computing, and transmitting the execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one chip is included as shown in fig. 1a or fig. 1b, the chips as shown in fig. 1a or fig. 1b may be linked and data transmitted through a specific structure, for example, interconnected and data transmitted through a PCIE bus, so as to support operation of a larger-scale neural network. At this time, the same control system may be shared, or independent control systems may be provided; the memory may be shared, or each accelerator may have its own memory. In addition, the interconnection mode can be any interconnection topology.

The neural network operation device has higher compatibility and can be connected with various servers through PCIE interfaces.

Claims

1. An integrated circuit chip device, the integrated circuit chip device comprising: the system comprises a main processing circuit, k branch processing circuits and k groups of basic processing circuits, wherein the main processing circuit is respectively connected with the k branch processing circuits, each branch processing circuit in the k branch processing circuits corresponds to one group of basic processing circuits in the k groups of basic processing circuits, and the group of basic processing circuits comprises at least one basic processing circuit;

the branch processing circuit includes: a data type operation circuit for performing conversion between floating-point type data and fixed-point type data;

the main processing circuit is used for acquiring an input data block, a weight data block and a multiplication instruction, dividing the input data block into component data blocks according to the multiplication instruction, and dividing the weight data block into broadcast data blocks; splitting the distributed data block to obtain a plurality of basic data blocks, distributing the basic data blocks to at least one branch processing circuit in k branch processing circuits, and broadcasting the broadcast data blocks to the k branch processing circuits;

The k branch processing circuits are used for converting the broadcast data block and the received basic data block into a fixed-point type broadcast data block and a fixed-point type received basic data block through the data type operation circuit; forwarding the broadcast data block of the fixed point type and the received basic data block of the fixed point type to a basic processing circuit;

the k groups of basic processing circuits are used for executing operation on the fixed-point type broadcast data block and the fixed-point type received basic data block in a parallel mode to obtain fixed-point type operation results, and sending the operation results to the k branch processing circuits;

the k branch processing circuits are used for converting the fixed-point type operation result into a floating-point type operation result through the data type operation circuit and sending the floating-point type operation result to the main processing circuit;

the main processing circuit is used for processing the floating point type operation result to obtain an instruction result of the multiplication instruction;

the input data block is: vectors or matrices;

the weight data block is: vector or matrix.

2. The integrated circuit chip device of claim 1, wherein,

The k groups of basic processing circuits are specifically configured to perform multiplication operation on the broadcast data block and the received basic data block according to a fixed-point data type to obtain a product result of a fixed-point data class, and send the product result to the k branch processing circuits as an operation result;

the k branch processing circuits are used for converting the operation result into a floating point type operation result through the data type operation circuit and sending the floating point type operation result to the main processing circuit;

and the main processing circuit is used for executing accumulation operation on the floating point type operation result to obtain an accumulation result, and sequencing the accumulation result to obtain the instruction result.

3. The integrated circuit chip device of claim 1, wherein,

the k groups of basic processing circuits are specifically configured to perform inner product operation on the broadcast data block and the received basic data block according to a fixed-point data type to obtain an inner product result of a fixed-point data class, and send the inner product result to the k branch processing circuits as an operation result;

And the main processing circuit is used for sequencing the floating-point type inner product result to obtain the instruction result.

4. An integrated circuit chip device according to any one of claims 1-3, wherein,

the main processing circuit is specifically configured to broadcast the broadcast data block to the k branch processing circuits at a time.

5. The integrated circuit chip device of claim 4, wherein,

the main processing circuit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the K branch processing circuits for a plurality of times.

6. The integrated circuit chip device of claim 5, wherein,

the k groups of basic processing circuits are specifically configured to perform inner product processing on the partial broadcast data block and the basic data block according to a fixed-point data type to obtain an inner product processing result, and send the inner product processing result to the main processing circuit.

7. The integrated circuit chip device of claim 6, wherein,

the k groups of basic processing circuits are specifically configured to multiplex n times of the partial broadcast data blocks to execute inner product operations on the partial broadcast data blocks and the n basic data blocks to obtain n partial processing results, and send the n partial processing results to the main processing circuit, where n is an integer greater than or equal to 2.

8. The integrated circuit chip device of claim 1, wherein the multiplication instruction is a matrix multiplication vector operation, and wherein the control circuitry of the main processing circuitry is configured to send data in a row of the matrix one number or portion at a time to one base processing circuitry;

or a control circuit of the main processing circuit for transmitting one number or a part of the number of data of a certain number of rows in the matrix to one basic processing circuit each time.

9. The integrated circuit chip device of claim 1, wherein,

the main processing circuit includes: a master register or master on-chip cache circuit;

or the branch processing circuit comprises: a basic register or basic on-chip cache circuit;

or the base processing circuit comprises: a basic register or basic on-chip cache circuit;

the main processing circuit includes: a vector operator circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transpose circuit, a direct memory access circuit, a data type operation circuit, or a data rearrangement circuit.

10. A neural network computing device, characterized in that it comprises one or more integrated circuit chip devices as claimed in any one of claims 1-9.