Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of the present disclosure and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the disclosure. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The following describes an operation method of a neural network in which multiplication of a matrix and a matrix is used in a large amount in the neural network, taking a CPU as an example, and operation of the CPU is described here by taking multiplication of a matrix a and a matrix B as an example. Assuming that the result of matrix a and matrix B is C, i.e., C ═ a × B; as shown below:
for the CPU, the steps adopted for obtaining C by calculation may be to complete calculation on the first line, then complete calculation on the second line, and finally complete calculation on the third line, that is, for the CPU, the calculation of the second line of data is performed after the calculation of one line of data is completed. Taking the above formula as an example, specifically, first, the CPU completes the calculation on the first line, i.e., needs to complete, a11*b11+a12*b21+a13*b31、a11*b12+a12*b22+a13*b32And a11*b13+a12*b23+a13*b33(ii) a After the above calculation, in the calculation of a21*b11+a22*b21+a23*b31、a21*b12+a22*b22+a23*b32And a21*b13+a22*b23+a23*b33(ii) a Finally, a is recalculated31*b11+a32*b21+a33*b31、a31*b12+a32*b22+a33*b32And a31*b13+a32*b23+a33*b33。
Therefore, for a CPU or a GPU, it needs to calculate one line for one line, that is, after the first line is calculated, the second line is calculated, and then the third line is calculated until all lines are calculated.
Referring to fig. 1b, fig. 1b is a schematic structural diagram of a chip apparatus, as shown in fig. 1b, the chip apparatus includes: a main cell circuit, a basic cell circuit, and a branch cell circuit. Wherein the main cell circuit may include a register and/or an on-chip cache circuit, and the main cell may further include: one or any combination of a vector arithmetic unit (vector operator) circuit, an Arithmetic and Logic Unit (ALU) circuit, an accumulator circuit, a matrix transpose circuit, a Direct Memory Access (DMA) circuit, a data rearrangement circuit, and the like; each base unit may include a base register and/or a base on-chip cache circuit; each base unit may further include: an inner product operator circuit, a vector operator circuit, an accumulator circuit, or the like, in any combination. The circuits may all be integrated circuits. If a branch unit is provided, the main unit is connected to the branch unit, the branch unit is connected to the basic unit, the basic unit is used for performing inner product operation between data blocks, the main unit is used for transceiving external data, and distributing the external data to the branch unit, and the branch unit is used for transceiving data of the main unit or the basic unit. The structure shown in fig. 1b is suitable for the computation of complex data, because the number of units connected to the main unit is limited, so that a branch unit needs to be added between the main unit and the basic unit to realize the access of more basic units, thereby realizing the computation of complex data blocks.
The connection structure of the branch unit and the base unit may be arbitrary and is not limited to the H-shaped structure of fig. 1 b. Alternatively, the master unit to the base unit is a broadcast or distribution structure, and the base unit to the master unit is a gather structure. The definitions of broadcast, distribution and collection are as follows:
the data transfer mode from the main unit to the base unit can include:
the main unit is connected to a plurality of branch units, respectively, each of which is connected to a plurality of base units, respectively.
The main unit is connected with a branch unit, the branch unit is connected with a branch unit, and the like, a plurality of branch units are connected in series, and then, each branch unit is connected with a plurality of base units respectively.
The main unit is connected with a plurality of branch units respectively, and each branch unit is connected with a plurality of basic units in series.
The main unit is connected to a branching unit which in turn is connected to a branching unit, and so on, a plurality of branching units are connected in series, and then each branching unit is connected in series with a plurality of base units.
When distributing data, the main unit transmits data to part or all of the basic units, and the data received by each basic unit for receiving data can be different;
when broadcasting data, the master unit transmits data to some or all of the base units, and each base unit receiving data receives the same data.
When collecting data, some or all of the base units transmit the data to the master unit. It should be noted that the chip device shown in fig. 1a or fig. 1b may be a single physical chip, and of course, in practical applications, the chip device may also be integrated into other chips (e.g., CPU, GPU).
Referring to fig. 1c, fig. 1c is a schematic data distribution diagram of a chip device, as shown by an arrow in fig. 1c, the arrow is a distribution direction of data, as shown in fig. 1c, after receiving external data, a main unit splits the external data and distributes the split external data to a plurality of branch units, and the branch units send the split data to a base unit.
Referring to fig. 1d, fig. 1d is a schematic diagram of data return of a chip device, as shown by an arrow in fig. 1d, the arrow is a data return direction, as shown in fig. 1d, a basic unit returns data (e.g., inner product calculation result) to a branch unit, and the branch unit returns the data to a main unit.
Referring to fig. 1a, fig. 1a is a schematic structural diagram of another chip device, which includes: a main unit and a base unit, the main unit being connected to the base unit. The configuration shown in fig. 1a has a limited number of connected base units, which is suitable for simple data calculation, since the base units are physically connected directly to the main unit.
Referring to fig. 2, fig. 2 provides a method for performing a neural network operation using the chip device, where the method is performed by using the chip device shown in fig. 1a or fig. 1b, and the method shown in fig. 2 includes the following steps:
step S201, the main unit of the chip device obtains a data block to be calculated and an operation instruction.
The data block to be calculated in step S201 may specifically be a matrix, a vector, three-dimensional data, four-dimensional data, multidimensional data, etc., and the embodiments of the present disclosure do not limit the specific representation of the data block, and the operation instruction may specifically be a multiplication instruction, a convolution instruction, an addition instruction, a subtraction instruction, a BLAS (Basic Linear Algebra subprogram) function or an activation function, etc.
Step S202, the main unit divides the data block to be calculated into a distribution data block and a broadcast data block according to the operation instruction.
The implementation method of the step S202 may specifically be:
if the operation instruction is a multiplication instruction, the multiplier data block is determined to be a broadcast data block, and the multiplicand data block is determined to be a distribution data block.
And if the operation instruction is a convolution instruction, determining the input data block as a broadcast data block and the convolution kernel as a distribution data block.
Step S2031, the master unit splits the distributed data block to obtain a plurality of basic data blocks, distributes the plurality of basic data blocks to a plurality of basic units,
in step S2032, the main unit broadcasts the broadcast data block to a plurality of base units.
Optionally, step S2031 and step S2032 may also be performed in a loop, where when the data amount is large, the master unit splits the distributed data block to obtain a plurality of basic data blocks, splits each basic data block into m basic data sub-blocks, and splits the broadcast data block into m broadcast data sub-blocks, and the master unit distributes one basic data sub-block and one broadcast data sub-block each time, where the basic data sub-block and the broadcast data sub-block are data blocks capable of performing parallel neural network computation. For example, taking a 1000 x 1000 matrix a x 1000 matrix B as an example, the basic data block may be the z-th row of the matrix a, the basic data sub-block may be the first 20 columns of the z-th row of the matrix a, and the broadcast data sub-block may be the first 20 rows of the z-th column of the matrix B.
The basic data block in step S203 may specifically be a minimum data block capable of performing inner product operation, taking matrix multiplication as an example, the basic data block may be a row of data of a matrix, taking convolution as an example, and the basic data block may be a weight of a convolution kernel.
The distribution manner in step S203 may refer to the description of the following embodiment, which is not described herein again, and the method for broadcasting the broadcast data block may also refer to the description of the following embodiment, which is not described herein again.
In step S2041, the basic unit of the chip device performs an inner product operation on the basic data block and the broadcast data block to obtain an operation result (possibly an intermediate result).
Step S2042, if the operation result is not an intermediate result, the operation result is returned to the main unit.
The foregoing turning manner in step S204 can be referred to the following description of the embodiments, and is not described herein again.
In step S205, the main unit processes the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.
The processing manner in step S205 may be accumulation, sorting, and the like, and the disclosure is not limited to the specific manner of the processing, which needs to be configured according to different operation instructions, and may further include performing nonlinear transformation, and the like.
According to the technical scheme, when operation is executed, a main unit receives external data, the external data comprises a data block to be calculated and an operation instruction, the data block to be calculated and the operation instruction are obtained, a distribution data block and a broadcast data block of the data block to be calculated are determined according to the operation instruction, the distribution data block is divided into a plurality of basic data blocks, the broadcast data block is broadcasted to a plurality of basic units, the basic data blocks are distributed to the basic units, the basic units respectively execute inner product operation on the basic data block and the broadcast data block to obtain an operation result, the basic units return the operation result to the main unit, and the main unit obtains an instruction result of the operation instruction according to the returned operation result. The technical point of this technical solution is that, for the neural network, a large amount of operation is the inner product operation between the data block and the data block, the inner product operation has a large overhead and a long calculation time, so the disclosed embodiment first distinguishes the distribution data block and the broadcast data block in the data block to be calculated by the operation instruction and the instruction to be operated, for the broadcast data block, i.e. the data block that must be used to implement the inner product operation, for the distribution data block, which belongs to the data block that can be split in the inner product operation, taking matrix multiplication as an example, the data block to be calculated is the matrix a and the matrix B, the operation instruction is the multiplication instruction (a × B), according to the rule of matrix multiplication, the matrix a is determined to be the distribution data block that can be split, the matrix B is determined to be the broadcast data block, because for the matrix multiplication, the matrix a can be split into a plurality of basic data blocks, the multiplier matrix B may be a broadcast data block. According to the definition of matrix multiplication, the multiplicand matrix A needs to perform inner product operation with the multiplier matrix B respectively for each row of data, so the technical scheme of the application divides the matrix A into M basic data blocks, and each basic data block can be a row of data of the matrix A in the M basic data blocks. Therefore, for matrix multiplication, the calculation time which is relatively long is respectively executed by the plurality of basic units, so that in the inner product calculation, the plurality of basic units can quickly calculate the result in parallel, thereby reducing the calculation time, and the working time of the chip device can be reduced by the less calculation time, thereby reducing the power consumption.
The effects of the technical solutions provided by the present disclosure are illustrated below by practical examples. As shown in fig. 2a, it is a schematic diagram of a matrix a multiplied by a vector B, as shown in fig. 2a, the matrix a has M rows and L columns, and the vector B has L rows, and assuming that the time required for an operator to calculate the inner product of one row of the matrix a and the vector B is T1, if the CPU or GPU calculates that it needs to calculate one row and then proceed to the next row, the time T0 calculated by the GPU or CPU calculation method is M × T1. With the solution provided by the embodiment of the present disclosure, assuming that the basic units have M basic units, the matrix a is split into M basic data blocks, each basic data block is a row of data of the matrix a, and the M basic units perform the inner product operation simultaneously, the computation time is T1, and the time required for the solution provided by the embodiment of the present disclosure is T1 ═ T1+ T2+ T3, where T2 may be the time for the main unit to split data, and T3 may be the time required for processing the operation result of the inner product operation to obtain the instruction result, and since the computation amount of splitting data and processing the operation result is very small, the time required is very small, so T0 > T1, the solution according to the embodiment of the present disclosure can significantly reduce the computation time, and for the power consumption generated by the data to be operated, since T0 > T1, the chip device provided by the present disclosure has the advantage of saving power consumption because its operation time is extremely short, and experiments prove that when the operation time of the chip device is very short, the power consumption is much lower than that of the chip device with long operation time.
There are various implementations of the main unit broadcasting the broadcast data block to the multiple basic units in step S203, and specifically, the implementation may be:
the first mode is to broadcast the broadcast data block to the plurality of basic units by one time. (the broadcast refers to performing "one-to-many" data transmission, i.e., transmitting the same data block to a plurality (all or a part of) basic units by the master unit at the same time.) for example, the matrix a is the matrix B, where the matrix B is a broadcast data block, and the matrix B is broadcast to the plurality of basic units at one time, and for example, in convolution, the input data is a broadcast data block, and the input data block is broadcast to the plurality of basic units at one time. This has the advantage that the amount of data transmission between the master unit and the base unit can be saved, i.e. all broadcast data can be transmitted to a plurality of base units via only one broadcast.
The second method is to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the plurality of basic units for a plurality of times, for example, the matrix B broadcasts to the plurality of basic units for a plurality of times, specifically, broadcasts N columns of data of the matrix B each time. The method has the advantages that the configuration of the basic unit can be reduced, because the storage space of the register configured for the basic unit cannot be large, if the matrix B with larger data size is issued to the basic unit at one time, the basic unit needs larger register space for storing the data, because the number of the basic units is large, the increase of the register space inevitably has great influence on the increase of the cost, the scheme of broadcasting the broadcast data block for multiple times is adopted at the moment, namely, for the basic unit, only part of data of the broadcast data block broadcasted every time needs to be stored, and therefore the cost is reduced.
It should be noted that the above-mentioned method a or method b may be adopted to distribute a plurality of basic data blocks to a plurality of basic units in the above-mentioned step S203, and the difference is only that the transmission method is a unicast method and the transmitted data is a basic data block.
The implementation method of the step S204 may specifically be:
if the broadcast data block is broadcast in the mode a and the basic data block is distributed in the mode a (as shown in fig. 3 a), the basic unit performs the inner product processing on the basic data block and the broadcast data block to obtain an inner product processing result, that is, one row of inner product operation is performed at a time, and the inner product processing result (one of the operation results) is sent to the main unit, and the main unit accumulates the inner product processing result. The above-mentioned method can reduce the data transmission quantity between main unit and basic unit, and then raise the speed of calculation.
If the mode b broadcast data block is adopted, in an optional technical solution, the base unit performs a partial inner product operation of the basic data block and the partial broadcast data block once every time the base unit receives the partial broadcast data block to obtain a partial processing result, the base unit sends the processing result to the main unit, and the main unit accumulates the processing result. In another alternative, if the number of the basic data blocks received by the base unit is n, the broadcast data block is multiplexed to perform the integral operation of the broadcast data block and the n basic data blocks to obtain n partial processing results, the base unit sends the n processing results to the main unit, and the main unit respectively accumulates the n processing results. Of course the above-mentioned accumulation can also be performed in the basic unit.
The data amount of the broadcast data block is very large and the distribution data block is also large for the above case, because for the chip device, because it belongs to the configuration of hardware, the basic unit of the configuration can be numerous in theory, but in practice, the number is limited, generally dozens of basic units, and the number may be changed, for example, increased, continuously with the development of technology. However, in the operation of matrix-by-matrix of the neural network, the number of rows of the matrix a may be thousands of rows, and the number of columns of the matrix B also may be thousands of columns, so that one broadcast data issue the matrix B to the basic unit cannot be realized, so the implementation manner may be that, one broadcast data broadcasts part of the data of the matrix B, for example, the first 5 columns of data, and the matrix a may also adopt a similar manner, for the basic unit, it may perform partial inner product calculation each time, then store the result of the partial inner product calculation in the register, and after all the inner product operations of the row are performed, accumulate the results of all the partial inner product calculations of the row to obtain an operation result, and send the operation result to the main unit. This approach has the advantage of increasing the speed of computation.
Referring to fig. 3, fig. 3 provides a calculation method of a neural network, in this embodiment, the calculation is described in a calculation manner of a matrix a × matrix B, which may be a schematic matrix shown in fig. 3a, for convenience of description, the calculation method of the neural network shown in fig. 3 is performed in a chip device shown in fig. 1B, which has 16 basic units, as shown in fig. 1B, and for convenience of description and allocation, it is set that a value of M shown in fig. 3a may be 32, a value of N may be 15, and a value of L may be 20. It will of course be appreciated that any number of base units may be present in the computing device. The method, as shown in fig. 3, includes the following steps:
step S301, the main unit receives the matrix A, the matrix B and the multiplication instruction A and B.
Step S302, the main unit determines that the matrix B is a broadcast data block and the matrix A is a distribution data block according to the multiplication instruction A and B, and splits the matrix A into 32 basic data blocks, wherein each basic data block is a row of data of the matrix A.
Step S303, the main unit uniformly allocates 32 basic data blocks to 16 basic units, and uniformly allocates 32 basic data blocks to 16 basic units, that is, each basic unit receives 2 basic data blocks, and the allocation manner of the two basic data blocks may be any non-repeating allocation order.
The allocation manner of step S303 may adopt some other allocation manners, for example, when the number of data blocks cannot be evenly allocated to each base unit, the database may be unevenly allocated to each base unit; some of the data blocks that cannot be evenly divided may also be divided and then evenly distributed, and the embodiments of the present disclosure do not limit how the basic data blocks are distributed to a plurality of basic units.
In step S304, the main unit extracts partial data of the first few columns (for example, the first 5 columns) of the matrix B, and the matrix B broadcasts the partial data of the first 5 columns to the 16 basic units.
And S305, the 16 basic units perform inner product operation and accumulation operation on the partial data of the first 5 rows and 2 basic data blocks for twice multiplexing to obtain 32 multiplied by 5 preprocessing results, and the 32 multiplied by 5 preprocessing results are sent to the main unit.
Step S306, the main unit extracts the partial data of the middle 5 columns of the matrix B, and the matrix B broadcasts the partial data of the middle 5 columns to the 16 basic units.
And step S307, the 16 basic units perform inner product operation and accumulation operation on the partial data of the middle 5 columns and the 2 basic data blocks for twice multiplexing to obtain 32 multiplied by 5 middle processing results, and the 32 multiplied by 5 middle processing results are sent to the main unit.
Step S308, the main unit extracts the partial data of the last 5 columns of the matrix B, and the matrix B broadcasts the partial data of the last 5 columns to the 16 basic units.
Step S309, the 16 basic units multiplex the partial data of the last 5 columns and 2 basic data blocks for twice to execute inner product operation and accumulation operation to obtain 32 multiplied by 5 post-processing results, and the 32 multiplied by 5 post-processing results are sent to the main unit.
And S310, the main unit combines the 32 x 5 pre-processing results, the 32 x 5 intermediate processing results and the 32 x 5 post-processing results according to the pre-processing, the intermediate processing and the post-processing to obtain a matrix C of 32 x 15, wherein the matrix C is the instruction result of the matrix A x the matrix B.
The technical scheme shown in fig. 3 is to divide the matrix a into 32 basic data blocks and then broadcast the matrix B in batches, so that the basic units can obtain instruction results in batches, and since the inner product is divided into 16 basic units for calculation, the calculation time can be greatly reduced, and therefore, the method has the advantages of short calculation time and low energy consumption.
Referring to fig. 1a, fig. 1a is a chip apparatus provided by the present disclosure, the chip apparatus including: the main unit is a hardware chip unit, and the basic unit is also a hardware chip unit;
the main unit is used for executing each continuous operation in the neural network operation and transmitting data with the basic unit;
and the basic unit is used for executing parallel acceleration operation in a neural network according to the data transmitted by the main unit and transmitting an operation result to the main unit.
The parallel accelerated operations include, but are not limited to: multiplication, convolution and the like between data blocks are massive and can be operated in parallel.
Each of the above successive operations includes, but is not limited to: accumulation operation, matrix transposition operation, data sorting operation, and the like.
The system comprises a main unit and a plurality of basic units, wherein the main unit is used for acquiring a data block to be calculated and an operation instruction, and dividing the data block to be calculated into a distribution data block and a broadcast data block according to the operation instruction; splitting the distribution data block to obtain a plurality of basic data blocks, distributing the plurality of basic data blocks to the plurality of basic units, and broadcasting the broadcast data block to the plurality of basic units; the basic unit is used for executing inner product operation on the basic data block and the broadcast data block to obtain an operation result and sending the operation result to the main unit; and the main unit is used for processing the operation result to obtain the data block to be calculated and an instruction result of the operation instruction.
Optionally, the chip device further includes: a branch unit disposed between the main unit and the base unit; the branch unit is used for forwarding data.
Optionally, the master unit is specifically configured to broadcast the broadcast data block to the plurality of base units at a time.
Optionally, the basic unit is specifically configured to perform inner product processing on the basic data block and the broadcast data block to obtain an inner product processing result, accumulate the inner product processing result to obtain an operation result, and send the operation result to the main unit.
Optionally, the main unit is configured to, when the operation result is a result of inner product processing, accumulate the operation result to obtain an accumulation result, and arrange the accumulation result to obtain the data block to be calculated and an instruction result of the operation instruction.
Optionally, the master unit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the plurality of base units by multiple times.
Optionally, the basic unit is specifically configured to perform an inner product processing on the partial broadcast data block and the basic data block once to obtain an inner product processing result, accumulate the inner product processing result to obtain a partial operation result, and send the partial operation result to the main unit.
Optionally, the base unit is specifically configured to multiplex the partial broadcast data block n times to perform an operation of performing an inner product of the partial broadcast data block and the n basic data blocks to obtain n partial processing results, accumulate the n partial processing results respectively to obtain n partial operation results, and send the n partial operation results to the main unit, where n is an integer greater than or equal to 2.
The present disclosure also provides an application method of the chip apparatus shown in fig. 1a, which can be specifically used for performing one or any combination of matrix multiplication, matrix multiplication vector, convolution or full-link operation.
Specifically, the master unit may also perform a pooling operation, a regularization operation, and a neural network operation step such as batch normalization, lrn.
The present embodiments also provide a chip comprising a chip arrangement as shown in fig. 1a or 1 b.
The specific embodiment of the present application also provides an intelligent device, which includes the above chip, and the chip is integrated with a chip apparatus as shown in fig. 1a or fig. 1 b. The smart devices include, but are not limited to: smart devices such as smart phones, tablet computers, personal digital assistants, smart watches, smart cameras, smart televisions, smart refrigerators, and the like are merely examples, and embodiments of the present application are not limited to specific manifestations of the devices.
The matrix-by-matrix operation described above can be seen in the description of the embodiment shown in fig. 3. And will not be described in detail herein.
Performing full connection operation by using a chip device;
if the input data of the fully-connected layer is a vector with length L (such as vector B in the fully-connected 1-single sample shown in FIG. 3 a) (i.e. the input of the neural network is a single sample), the output of the fully-connected layer is a vector with length M, the weight of the fully-connected layer is a matrix with length M (such as matrix A in the fully-connected 1-single sample shown in FIG. 3B), the weight matrix of the fully-connected layer is used as matrix A (i.e. splitting data block), the input data is used as vector B (i.e. broadcasting data block), and the operation is performed according to the first method shown in FIG. 2. The specific operation method may be:
if the input data of the fully-connected layer is a matrix (i.e. the input of the neural network is the case where a plurality of samples are operated together as a batch) (the input data of the fully-connected layer represents N input samples, each sample is a vector of length L, the input data is represented by an L x N matrix, such as the matrix B in "fig. 3B fully-connected 1-multisample"), the output of the fully-connected layer is a vector of length M for each sample, the output data of the fully-connected layer is an M x N matrix, such as the result matrix in "fig. 3a fully-connected 1-multisample", the weight of the fully-connected layer is an M x L matrix (such as the matrix a in "fig. 3a fully-connected 1-multisample"), the weight matrix of the fully-connected layer is taken as the matrix a (i.e. the splitting data block), the matrix of the input data is taken as the matrix B (i.e. the broadcasting data block), or taking the weight matrix of the full link layer as the matrix B (i.e. broadcast data block) and the input vector as the matrix a (i.e. split data block), and performing the operation according to the first method shown in fig. 2.
Chip device
When the chip device is used to perform an artificial neural network operation, input data of a convolutional layer, a pooling layer, a regularization layer (also called a normalization layer, such as bn (batch normalization) or lrn (local Response normalization)) and the like in the neural network are shown as "convolution 2-input data" in fig. 3d (for clarity, C is 5, H is 10, and W is 12 are used as examples of a three-dimensional data block representing each sample, and the size of N, C, H, and W is not limited to the values shown in fig. 3d in actual use), where each three-dimensional data block in fig. 3d represents input data corresponding to one sample in the layer, and three dimensions of each three-dimensional data block are C, H and W, and there are N such three-dimensional data blocks in total.
When the calculation of the neural network layers is carried out, after the main unit receives input data, the data rearrangement circuit of the main unit is used for arranging the input data according to a certain sequence for each sample of the input data, and the sequence can be any sequence;
optionally, the order may place the input data, such as NHWC and NWHC, in a manner that the C-dimension coordinate represented by the schematic diagram changes most quickly. Where C represents the dimension of the innermost layer of the data block, N represents the dimension of the outermost layer of the data block, and H and W are the dimensions of the middle layer. This has the effect that the data of C are close together, which makes it easy to improve the parallelism of the operation and to perform parallel operation on a plurality of Feature maps (Feature maps).
The following explains how C, H and W understand for different neural network operations. For convolution and pooling, H and W are the relative window sliding dimensions (an exemplary graph of sliding the window in the W dimension is shown in FIG. 3e convolution 3-slide a "and" FIG. 3f convolution 3-slide b ", and a schematic graph of sliding the window in the H dimension is shown in FIG. 3g, where the size of the window coincides with the size in one of the M convolution kernels, M convolution kernels are shown in FIG. 3c, each convolution kernel is a three-dimensional block of 5 3H 3, then the window of operation is also a three-dimensional block of 5H 3, for KH and KW of the M convolution kernels shown in FIG. 3c represent the H dimension of the input data for KH, the corresponding dimension of which is the W dimension of the input data, and the grey portion blocks of FIGS. 3e, 3f, 3g are the data used for each sliding window of operation, the sliding direction may be H, W, or H after the sliding direction is H. Specifically, for convolution, the operation at each sliding window is to perform inner product operation on a data block represented by a gray part square in the graph and M convolution kernel data blocks represented by "convolution 1-convolution kernel" in fig. 3c, and the convolution outputs a value corresponding to each convolution kernel for each sliding window position, that is, M output values exist for each sliding window; for pooling, the operation at each sliding window is to select the maximum value in H and W dimensions (in the example of the figure, 9 numbers on the same plane in the gray data block) of the data block represented by the gray square in the figure, or to calculate the average value, etc., and pooling outputs C values for each sliding window position. C is the other dimension of the three-dimensional block of single samples, except H and W, where N represents a total of N samples for this layer of operations to be performed simultaneously. For an LRN in a regularization algorithm, the definition of the C dimension is: each basic LRN operation selects a continuous data block (namely a data block of Y multiplied by 1) along the C dimension, wherein Y in the data block of Y multiplied by 1 is a value on the C dimension, the value of Y is less than or equal to the maximum value of the C dimension, the first 1 represents the H dimension, and the second 1 represents the W dimension; the remaining two dimensions are defined as H and W dimensions, that is, for each LRN regularization operation performed on the three-dimensional data block of each sample, a continuous portion of data in different C coordinates in the same W coordinate and the same H coordinate is performed. For the regularization algorithm BN, the values of all the coordinates in the three-dimensional data block of N samples that have the same C dimension are averaged and squared (or standard deviation).
In the above "fig. 3 c-fig. 3 g", a square is used to represent a numerical value, which may also be referred to as a weight; the numbers used in the diagram are only for illustration, and the dimension data may be any number in practical cases (including a case where a certain dimension is 1, in which case the four-dimensional data block automatically becomes a three-dimensional data block, for example, when the number of samples calculated at the same time is 1, the input data is a three-dimensional data block, and for example, when the number of convolution kernels is 1, the volume sum data is a three-dimensional data block). Performing convolution operation between input data B and a convolution kernel A by using the chip device;
for a convolutional layer, the weight (all convolution kernels) is shown as "convolution 1-convolution kernel" in fig. 3C, the number of convolution kernels is recorded as M, and each convolution kernel is composed of C matrices of KH rows and KW columns, so the weight of the convolutional layer can be represented as a four-dimensional data block with four dimensions of M, C, KH and KW; the input data of the convolutional layer is a four-dimensional data block which is composed of N three-dimensional data blocks, and each three-dimensional data block is composed of C characteristic matrixes of H rows and W columns (namely, the four dimensions are respectively N, C, H and W data blocks); as shown in fig. 3d convolution 2-input data. Distributing the weight of each convolution kernel in the M convolution kernels from the main unit to one of the K basic units, and storing the weight in an on-chip cache and/or a register of the basic unit (at this time, the M convolution kernels are distribution data blocks, each convolution kernel may be a basic data block, and in practical application, the basic data block may also be changed to a smaller temperature, for example, a planar matrix of one convolution kernel); the specific distribution method may be: if the number M of the convolution kernels is less than K, distributing a weight of the convolution kernel to the M basic units respectively; and if the number M of the convolution kernels is larger than K, distributing the weight values of one or more convolution kernels to each basic unit respectively. (the set of convolution kernel weights distributed to the ith base unit is Ai, with a total of Mi convolution kernels.) in each base unit, e.g., the ith base unit: storing the received convolution kernel weight Ai distributed by the main unit in a register and/or an on-chip cache of the main unit; transmitting each part (i.e. the sliding window shown in fig. 3e, fig. 3f or fig. 3 g) in the input data to each basic unit in a broadcast manner (the broadcast manner may be the first or the second manner), and in the broadcast, broadcasting the weight of the calculation window to all basic units in a manner of broadcasting for multiple times, specifically, broadcasting the weight of the calculation window for each time for a part, for example, broadcasting a matrix of one plane each time, taking fig. 3e as an example, broadcasting a KH HW matrix of one C plane each time, and in practical application, broadcasting data of the first n rows or the first n columns in a KH HW matrix of one C plane at a time, which does not limit the transmission manner of the part data and the arrangement manner of the part data; the placing mode of the input data is changed into a placing mode of any dimensionality sequence, and then all parts of the input data are sequentially broadcasted to the basic unit in sequence. Optionally, the sending method of the distribution data, i.e., the convolution kernel, may also adopt a method similar to the operation window of the input data, and is not described here again. Optionally, the placing mode of the input data is changed into a cycle with C as the innermost layer. This has the effect that the data of C are close together, thereby increasing the parallelism of the convolution operation and facilitating parallel operation of a plurality of Feature maps (Feature maps). Optionally, the placing mode of the input data is converted into each basic unit, for example, the ith basic unit, of which the dimensional order is NHWC or NWHC, and the inner product of the convolution kernel in the weight Ai and the corresponding part (i.e., the operation window) of the received broadcast data is calculated; the data of the corresponding part in the weight Ai can be directly read from the on-chip buffer for use, or can be read into a register first for multiplexing. The results of the inner product operations of each base unit are accumulated and transmitted back to the master unit. The partial sum obtained by performing the inner product operation by each basic unit can be transmitted back to the main unit for accumulation; the partial sum obtained by the inner product operation executed by the basic unit each time can be stored in a register and/or an on-chip cache of the basic unit, and is transmitted back to the main unit after the accumulation is finished; or the partial sum obtained by the inner product operation executed by the basic unit each time is stored in a register and/or an on-chip buffer of the basic unit in partial cases for accumulation, is transmitted to the main unit for accumulation in partial cases, and is transmitted back to the main unit after the accumulation is finished.
Method for realizing BLAS (Basic Linear Algebra subprogram) function by adopting chip device
GEMM, GEMM calculation refers to: the operation of matrix-matrix multiplication in the BLAS library. The general representation of this operation is: c ═ alpha (op) (a) op (B) + beta (C) C, where a and B are two input matrices, C is the output matrix, alpha and beta are scalars, op represents some operation on matrix a or B, and there are additional integers as parameters to account for the width and height of matrices a and B;
the steps of using the device to realize GEMM calculation are as follows:
carrying out respective corresponding op operations on the input matrix A and the matrix B; the op operation may be a transpose operation of the matrix, but may of course be other operations, such as non-linear function operations, pooling, etc. Realizing the operation of the matrix op by utilizing the vector operation function of the main unit; if the op of a certain matrix can be null, the master unit does not perform any operation on the matrix;
the matrix multiplication between op (A) and op (B) is completed by adopting the method shown in FIG. 2;
multiplying each value in the result of op (A) and op (B) by alpha by using the vector operation function of the main unit;
a step of adding corresponding positions of the matrix alpha op (A) op (B) and beta C by using the vector operation function of the main unit;
GEMV
the GEMV calculation means: the operation of matrix-vector multiplication in the BLAS library. The general representation of this operation is: c ═ alpha (op) (a) B + beta (C), where a is the input matrix, B is the vector of inputs, C is the output vector, alpha and beta are scalars, op represents some operation on the matrix a;
the steps of using the device to realize GEMV calculation are as follows:
carrying out corresponding op operation on the input matrix A; the chip device completes the matrix-vector multiplication calculation between the matrix op (A) and the vector B by using the method shown in FIG. 2; multiplying each value in the result of op (A) and (B) by alpha by using the vector operation function of the main unit; and a step of adding the corresponding positions of the matrixes alpha, op, (A), B and beta, C by using the vector operation function of the main unit.
Method for realizing activation function by chip device
An activation function generally refers to performing a non-linear operation on each number in a block of data (which may be a vector or a multidimensional matrix). For example, the activation function may be: y ═ max (m, x), where x is the input value, y is the output value, and m is a constant; the activation function may also be: y ═ tanh (x), where x is the input value and y is the output value; the activation function may also be: y is sigmoid (x), where x is the input value and y is the output value; the activation function may also be a piecewise linear function; the activation function may be any function that inputs a number and outputs a number.
When the activation function is realized, the chip device inputs a vector by utilizing the vector calculation function of the main unit and calculates the activation vector of the vector; the main unit calculates a value to be output to a corresponding position of the output vector by each value in the input vector through an activation function (the input value of the activation function is one value, and the output value is also one value);
sources of the above input vectors include, but are not limited to: external data of the chip device, and calculation result data of the basic unit forwarded by the branch unit of the chip device.
The calculation result data may be specifically an operation result of performing matrix multiplication on a vector; the calculation result data may specifically be a result of matrix multiplication; the input data may be the result of the calculation performed for the master unit after biasing.
Using chip devices to implement biasing operations
The function of adding two vectors or two matrixes can be realized by using the master unit; the function of adding a vector to each row, or to each column, of a matrix can be implemented by the master unit.
Alternatively, the matrix may be the result of the device performing a matrix-by-matrix operation; the matrix may be from the result of the device performing a matrix multiply vector operation; the matrix may be from data received externally by the main unit of the apparatus. The vector may be from data received externally by a master unit of the apparatus.
The input data and the calculation result data are only examples, and in practical applications, the input data and the calculation result data may also be other types or sources of data.
It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated units/modules are all implemented in hardware. For example, the hardware may be circuitry, including digital circuitry, analog circuitry, and so forth. Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like. The computing module in the computing device may be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, and the like. The memory unit may be any suitable magnetic or magneto-optical storage medium, such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, etc.
The illustrated elements may or may not be physically separate, may be located in one place, or may be distributed across multiple network elements. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for the person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as limiting the present disclosure.