CN111860838B - Full connection layer calculation method and device of neural network - Google Patents

Full connection layer calculation method and device of neural network Download PDF

Info

Publication number
CN111860838B
CN111860838B CN202010725384.9A CN202010725384A CN111860838B CN 111860838 B CN111860838 B CN 111860838B CN 202010725384 A CN202010725384 A CN 202010725384A CN 111860838 B CN111860838 B CN 111860838B
Authority
CN
China
Prior art keywords
input data
data
interface
matrix
tensor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010725384.9A
Other languages
Chinese (zh)
Other versions
CN111860838A (en
Inventor
宋小妹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202010725384.9A priority Critical patent/CN111860838B/en
Publication of CN111860838A publication Critical patent/CN111860838A/en
Application granted granted Critical
Publication of CN111860838B publication Critical patent/CN111860838B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a method and a device for calculating a full connection layer of a neural network, wherein the method comprises the following steps: receiving input data with a format of 8-bit integer variables, and judging whether input dimensionality, output dimensionality and single sample of the input data meet the shape requirement of a calling interface of a tensor core; defining a calculation rule based on input data of the 8-bit integer variable, a weight of the 8-bit integer variable, and a bias of the 32-bit integer variable in response to satisfying the shape requirement; scheduling input data, weight and bias based on a calculation rule, and calling a calculation unit of a tensor core through a calling interface to perform matrix multiplication to obtain a two-dimensional tensor; and outputting the two-dimensional tensor to the next layer as a calculation result of the full connection layer. The method can call the TenscorCore calculation unit to execute convolution aiming at int8 input in the dense layer, and greatly improves the calculation performance.

Description

Full connection layer calculation method and device of neural network
Technical Field
The present invention relates to the field of neural networks, and more particularly, to a method and an apparatus for computing a full connection layer of a neural network.
Background
TVM is an open source project developed primarily by the SAMPL group contribution of Washington university. At present, deep learning communities are active, and researchers propose new op (layers) every day to expect better accuracy of the model. Meanwhile, as more and more manufacturers begin to make hardware, more and more backend devices can be selected when the neural network is trained.
This is a headache for the framed person, both trying to provide support for emerging ops and implementing existing ops on emerging backend devices. The TVM project is therefore in return, with the hope of achieving the goal of the researcher writing an op only once, and then the TVM automatically generates code for various backend devices with appreciable performance. By official definition, a TVM is a complete set of stacks including parts of neural network graph optimization (e.g., op fusion) and single op optimization. The TVM is positioned below the neural network diagram and above the underlying hardware.
TensorCore (tensor core) is the heaviest weight characteristic of a Volta architecture, is a special unit specially designed for deep learning application, and is actually a matrix multiply-accumulate calculation unit. Matrix multiply-accumulate computation is the most important and time consuming part of deep learning network layer algorithms, such as convolutional layer, dense (fully-connected) layer, etc. Support for the TensorCore scheme of float16 is currently provided in TVM, but is not compatible with int8 type input, int8 can only select a dp4a scheme, but the dp4a instruction has a gap in performance compared with the tensorbe instruction.
Aiming at the problem that performance is reduced because a dense layer cannot call TensorCore for int8 input in the prior art, no effective solution exists at present.
Disclosure of Invention
In view of this, an object of the embodiments of the present invention is to provide a method and an apparatus for computing a fully connected layer of a neural network, which can call a TensorCore computing unit to execute convolution for int8 input at a dense layer, thereby greatly improving computing performance.
In view of the above object, a first aspect of the embodiments of the present invention provides a full connection layer calculation method for a neural network, including the following steps:
receiving input data with a format of 8-bit integer variables, and judging whether input dimensionality, output dimensionality and single sample of the input data meet the shape requirement of a calling interface of a tensor core;
defining a calculation rule based on input data of the 8-bit integer variable, a weight of the 8-bit integer variable, and a bias of the 32-bit integer variable in response to satisfying the shape requirement;
scheduling input data, weight and bias based on a calculation rule, and calling a calculation unit of a tensor core through a calling interface to perform matrix multiplication to obtain a two-dimensional tensor;
and outputting the two-dimensional tensor to the next layer as a calculation result of the full connection layer.
In some embodiments, determining whether the input dimensions, the output dimensions, and the single-time samples of the input data satisfy the shape requirements of the invocation interface of the tensor core comprises:
determining a shape of the input data, a shape of the weight, and a shape of the bias based on the input data;
determining input dimensions, output dimensions, and single sample values of the input data based on the shape of the input data, the shape of the weights, and the biased shape;
the determination to meet the shape requirement is made in response to the input dimension, the output dimension, and the single sample values all being divisible by 16, or the input dimension value being divisible by 16, the output dimension value being divisible by 8, the single sample value being divisible by 32, or the input dimension value being divisible by 16, the output dimension value being divisible by 32, the single sample value being divisible by 8.
In some embodiments, the method further comprises: the calculation unit using the DP4A instruction set is invoked to directly calculate the two-dimensional tensor in response to the satisfaction of the shape requirement, and output to the next layer as a calculation result of the fully-connected layer.
In some embodiments, scheduling the input data, the weights, and the biases to perform a matrix multiplication operation by invoking a computation unit of the tensor core based on the computation rule to obtain the two-dimensional tensor comprises: defining key parameters used for executing scheduling, and executing the following steps based on the key parameters:
performing data partitioning on the input data and the weights so as to load the input data and the weights from the global memory to the shared memory;
loading input data and weight which are subjected to data division from a shared memory to a matrix memory scope of a calling interface through a matrix loading interface of the calling interface;
and in the matrix memory action domain, a computing unit which uses a tensor core through a matrix multiplication computing interface of a calling interface performs matrix multiplication on the input data and the weight which are subjected to data division based on a computing rule to obtain an unbiased two-dimensional tensor which is subjected to data division.
In some embodiments, the method further comprises: additionally performing the following steps based on the key parameters:
loading the unbiased two-dimensional tensor divided by data from the action domain of the matrix memory to the shared memory through a matrix storage interface of a calling interface, and adding bias to the unbiased two-dimensional tensor divided by the data in the shared memory based on a calculation rule to obtain the two-dimensional tensor divided by the data, wherein the bias is executed by using a preset scheduling mode instead of scheduling by calling the interface;
and loading the two-dimensional tensor divided by the data from the shared memory to the global memory, and further splicing the two-dimensional tensor into the two-dimensional tensor.
In some embodiments, further comprising: before calling a matrix loading interface, a matrix multiplication calculation interface or a matrix storage interface, performing Tensorize operation based on step size parameters on input data, weights and two-dimensional tensors subjected to data division so as to allow a calculation platform used by the matrix loading interface, the matrix multiplication calculation interface or the matrix storage interface to be identified.
In some embodiments, performing data partitioning on the input data and the weights to load from the global memory to the shared memory further comprises: double-buffering optimization is performed using a function provided by a compiler of a neural network.
A second aspect of an embodiment of the present invention provides a full connection layer computing apparatus for a neural network, including:
a processor; and
a memory storing program code executable by the processor, the program code when executed sequentially performing the steps of:
receiving input data with a format of 8-bit integer variables, and judging whether input dimensionality, output dimensionality and single sample of the input data meet the shape requirement of a calling interface of a tensor core;
defining a calculation rule based on input data of the 8-bit integer variable, a weight of the 8-bit integer variable, and a bias of the 32-bit integer variable in response to satisfying the shape requirement;
scheduling input data, weight and bias based on a calculation rule, and calling a calculation unit of a tensor core through a calling interface to perform matrix multiplication to obtain a two-dimensional tensor;
and outputting the two-dimensional tensor to the next layer as a calculation result of the full connection layer.
In some embodiments, scheduling the input data, the weights, and the biases based on a computation rule to perform a matrix multiplication operation by invoking a computation unit of the interface call tensor core to obtain the two-dimensional tensor comprises: defining key parameters used for executing scheduling, and executing the following steps based on the key parameters:
performing data partitioning on the input data and the weights so as to load the input data and the weights from the global memory to the shared memory;
loading the input data and the weight which are subjected to data division from the shared memory to a matrix memory scope of the calling interface through a matrix loading interface of the calling interface;
and in the matrix memory action domain, a computing unit which uses a tensor core through a matrix multiplication computing interface of a calling interface performs matrix multiplication on the input data and the weight which are subjected to data division based on a computing rule to obtain an unbiased two-dimensional tensor which is subjected to data division.
In some embodiments, the steps further comprise: additionally performing the following steps based on the key parameters:
loading the unbiased two-dimensional tensor divided by data from the action domain of the matrix memory to the shared memory through a matrix storage interface of a calling interface, and adding bias to the unbiased two-dimensional tensor divided by the data in the shared memory based on a calculation rule to obtain the two-dimensional tensor divided by the data, wherein the bias is executed by using a preset scheduling mode instead of scheduling by calling the interface;
and loading the two-dimensional tensors subjected to data division from the shared memory to the global memory, and further splicing the two-dimensional tensors into the two-dimensional tensor.
The invention has the following beneficial technical effects: according to the full connection layer computing method and device of the neural network, input data with a format of 8-bit integer variables are received, and whether input dimensionality, output dimensionality and single sample of the input data meet the shape requirement of a calling interface of a tensor kernel is judged; defining a calculation rule based on input data of the 8-bit integer variable, a weight of the 8-bit integer variable, and a bias of the 32-bit integer variable in response to satisfying the shape requirement; scheduling input data, weight and bias based on a calculation rule, and calling a calculation unit of a tensor core through a calling interface to perform matrix multiplication to obtain a two-dimensional tensor; the technical scheme that the two-dimensional tensor is used as the calculation result of the full connection layer and is output to the next layer can call a TenscorCore calculation unit to execute convolution aiming at int8 input in a dense layer, and calculation performance is greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic flow chart of a full connection layer calculation method of a neural network according to the present invention;
FIG. 2 is a detailed flowchart of a full link layer calculation method of a neural network according to the present invention;
FIG. 3 is a detailed flowchart of the C phase of the full link layer calculation method of the neural network according to the present invention;
FIG. 4 is a detailed flowchart of the CS phase of the full link layer calculation method of the neural network according to the present invention;
FIG. 5 is a detailed flowchart of the CF stage of the full connection layer calculation method of the neural network provided by the present invention;
FIG. 6 is a detailed flowchart of the AF phase of the full connection layer calculation method of the neural network provided by the present invention;
FIG. 7 is a detailed flow chart of the BF stage of the full connection layer calculation method of the neural network provided by the present invention;
FIG. 8 is a detailed flowchart of the AS/BS phase of the full connection layer calculation method of the neural network provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are only used for convenience of expression and should not be construed as a limitation to the embodiments of the present invention, and no description is given in the following embodiments.
In view of the above object, a first aspect of embodiments of the present invention proposes an embodiment of a fully-connected layer calculation method capable of invoking a TensorCore calculation unit at the dense layer for int8 input to perform convolution. Fig. 1 is a schematic flow chart of a full connection layer calculation method of a neural network provided by the present invention.
The method for calculating the full connection layer of the neural network, as shown in fig. 1, includes the following steps:
step S101: receiving input data with a format of 8-bit integer variables, and judging whether input dimensionality, output dimensionality and single sample of the input data meet the shape requirement of a calling interface of a tensor core;
step S103: defining a calculation rule based on input data of the 8-bit integer variable, a weight of the 8-bit integer variable, and a bias of the 32-bit integer variable in response to satisfying the shape requirement;
step S105: scheduling input data, weight and bias based on a calculation rule, and calling a calculation unit of a tensor core through a calling interface to perform matrix multiplication to obtain a two-dimensional tensor;
step S107: and outputting the two-dimensional tensor to the next layer as a calculation result of the full connection layer.
The invention mainly provides support of a TenscorCore int8 scheme for dense operation in TVM, fills in the blank that dense uses TenscorCore under the condition of int8, and brings great performance improvement. According to the existing rules of TVM, computing and scheduling of dense based on TenscorCore are designed, according to int8 data characteristics, related data division, wmma (matrix multiplication and accumulation operation with warp as a unit) interface calling and the like are defined, conflict reduction is added, data are read from the global to the shared memory in a vectorization mode, the data are written back to the global memory in a vectorization mode from the shared memory, double-buffer optimization is carried out, and the application of the device in a neural network such as resnet50 is ensured, and the network performance is improved. Through tests, the int8 scheme of TenscorCore improves the performance by about one time compared with the dp4a scheme.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by instructing relevant hardware by a computer program, and the program may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like. Embodiments of the computer program may achieve the same or similar effects as any of the preceding method embodiments to which it corresponds.
In some implementations, determining whether the input dimensions, the output dimensions, and the single-time samples of the input data satisfy the shape requirements of the invocation interface of the tensor kernel includes:
determining a shape of the input data, a shape of the weight, and a shape of the bias based on the input data;
determining an input dimension, an output dimension, and a value of a single sample of the input data based on the shape of the input data, the shape of the weight, and the biased shape;
the determination to meet the shape requirement is made in response to the input dimension, the output dimension, and the single sample values all being divisible by 16, or the input dimension value being divisible by 16, the output dimension value being divisible by 8, the single sample value being divisible by 32, or the input dimension value being divisible by 16, the output dimension value being divisible by 32, the single sample value being divisible by 8.
In some embodiments, the method further comprises: the calculation unit using the DP4A instruction set is called in response to satisfaction of the shape requirement to directly calculate the two-dimensional tensor, and output to the next layer as a calculation result of the fully-connected layer.
In some embodiments, scheduling the input data, the weights, and the biases to perform a matrix multiplication operation by invoking a computation unit of the tensor core based on the computation rule to obtain the two-dimensional tensor comprises: defining key parameters used for executing scheduling, and executing the following steps based on the key parameters:
performing data partitioning on the input data and the weights so as to load the input data and the weights from the global memory to the shared memory;
loading the input data and the weight which are subjected to data division from the shared memory to a matrix memory scope of the calling interface through a matrix loading interface of the calling interface;
and in the matrix memory action domain, a computing unit which uses a tensor core through a matrix multiplication computing interface of a calling interface performs matrix multiplication on the input data and the weight which are subjected to data division based on a computing rule to obtain an unbiased two-dimensional tensor which is subjected to data division.
In some embodiments, the method further comprises: additionally performing the following steps based on the key parameters:
loading the unbiased two-dimensional tensor subjected to data division from the action domain of the matrix memory to the shared memory through a matrix storage interface of a calling interface, adding bias to the unbiased two-dimensional tensor subjected to data division in the shared memory based on a calculation rule, and obtaining the two-dimensional tensor subjected to data division, wherein the bias is executed by using a preset scheduling mode instead of scheduling by using the calling interface;
and loading the two-dimensional tensor divided by the data from the shared memory to the global memory, and further splicing the two-dimensional tensor into the two-dimensional tensor.
In some embodiments, the method further comprises: before calling a matrix loading interface, a matrix multiplication calculation interface or a matrix storage interface, performing Tensorize operation based on step size parameters on input data, weights and two-dimensional tensors subjected to data division so as to allow a calculation platform used by the matrix loading interface, the matrix multiplication calculation interface or the matrix storage interface to be identified.
In some embodiments, performing data partitioning on the input data and the weights to load from the global memory to the shared memory further comprises: double-buffering optimization is performed using a function provided by a compiler of the neural network.
The following further illustrates embodiments of the invention in accordance with the specific example shown in fig. 2.
The call criteria for dense _ tensorbore _ int8 involves three inputs: inputting data, the shape of which is (batch, in _ dim); weight, shape (out _ dim, in _ dim); biased, shaped as (out _ dim). Where in _ dim is the input dimension of dense and out _ dim is the output dimension of dense. The embodiment of the invention adopts a mode of directly calling the wmma interface to call a calculation unit of the underlying tensorcore.
The core of the wmma interface is:
in wmma, load _ matrix _ sync interface (used for loading matrix);
wma _ sync interface (for calculation of matrix multiplication);
memory _ matrix _ sync interface (for storage of the matrix after completion of the matrix multiplication).
These interfaces involve three important dimensions, < wmma _ m, wmma _ n, wmma _ k >, and calling a wmma interface once will complete a matrix multiplication of wmma _ m _ wmma _ k matrix and wmma _ n _ wmma _ k matrix. The wmma interface then currently supports the input of several data types fp16, int8, int4 (4 bit int), respectively. For int8 type input, only three shapes of <8,32,16> <32,8,16> <16,16 > are supported, that is, if the wmma interface of <8,32,16> shape is called, the matrix multiplication of 8 × 16 matrix and 32 × 16 matrix will be completed by calling once, and finally an 8 × 32 matrix result will be obtained. wmma _ m, wmma _ n, wmma _ k correspond to the base, out _ dim, in _ dim in the dense algorithm, respectively, so to adopt the sensor Core, the < base, out _ dim, in _ dim > must be divided by any one of the three interfaces of wmma to enter the dense _ tensorbore _ int8 solution. For example, if the input data dimension of the dense is (8, 256), the weight dimension is (128, 256), and the bias dimension is (128), then batch =8 (corresponding to wmma _ m), in _ dim =256 (corresponding to wmma _ k), out _ dim =128 (corresponding to wmma _ n), and the calling cases of < wmma _ n =8, wmma \\ n =32, wmma_k =16> can be satisfied, so that the newly added solution dense _ tensorbore _ int8 can be called.
Adding a new dense _ tensorbore _ int8 solution to the TVM, the most important part is the definition of the calculation rule and schedule, and before defining the calculation rule and schedule, the name of the solution needs to be registered, so that the TVM can find the corresponding calculation rule and schedule later. The definition of the computation rule of dense using int8 interface of tensorcore is not identical to the computation method of dense of dp4a, except that both the input data and the weight are int8 type, because after invoking tensocore interface, the result of multiplication of two int8 matrices is int32 (32 bit int) type, so the type of added offset is int32 type.
Figure BDA0002601487770000091
Where i represents base, j represents the output dimension out _ dim of the dense, and k represents the input dimension in _ dim of the dense. Output adds an identifier to the computation rule, for example, the identifier is dense _ tensorcore _ int8, and then the TVM will walk into the corresponding scheduling portion according to the identifier.
Declare data matrix as a, weight matrix as B, and bias as bias. Where the schedule of bias options is already defined in TVM, we may not have to deal with it. The transmitted A and B matrixes originally exist in a global memory, the access speed of the global memory is very low, the time consumption for reading and returning data of the shared memory is very low, and the matrix multiplication needs to perform frequent data access, so that the A and B matrixes are defined to be read into the shared memory "shared" from the global memory and are respectively marked AS AS and BS. Because of the wmma interface, data must be loaded from shared memory into memory scope "wmma:: fragment", there are three types of memory scopes, "wmma. Matrix _ a", "wmma. Matrix _ b", and "wmma. Accumulator", where AS is loaded into wmma. Matrix _ a, BS is loaded into wmma. Matrix _ b, and the result calculated through the wmma interface is stored in wmma. Accumulator, defined AS CF. The scheduling of the bias as described above is well defined, but if wmma is used to calculate the matrix multiplication, the bias must be in the shared memory when it is added to the final result, and cannot be added directly to wmma. And finally, writing the data of which the bias is processed from the shared memory back to the global memory.
Before defining scheduling, firstly defining several parameters which are key to data division, such as wmma _ m, wmma _ n, wmma _ k, block _ row _ wars, block _ col _ wars, warp _ row _ tiles, warp _ col _ tiles, offset CS, vec, vec _ out and chunk.
For wmma _ m, wmma _ n, wmma _ k, which have been previously defined, if wmma _ m =8, wmma _ n naturally equals 32 and wmma _ k equals 16. This set of parameters essentially defines which interface of which <8,32,16> <32,8,16> <16,16 > is invoked.
block _ row _ wars and block _ col _ wars are respectively bound with threadadx.y and threadadx.z, because wmma is executed in unit of warp (32 threads), so the threadadx.x range is fixedly set to 32, so that threadadx.y and threadadx.z have no more limitation of digits and only need to be set to multiples of 2, for example, they can be set to 2 and 4.
warp _ row _ tiles and warp _ col _ tiles respectively carry one block to calculate wmma
Figure BDA0002601487770000111
The number of _, wmma _ n.
Offset represents Offset set for reducing bank conflict of the shared memory of the input data and the weight data (int 8), and the bank conflict of the shared memory can cause great influence on performance; offsetCS represents the offset for storing the matrix multiplication result (int 32) into the shared memory to mitigate bank conflicts. Because of the difference between int8 type and int32 type, these two need to be set separately.
Vec is the input data and weight data (int 8) vectorization parameter, vec _ out is the vectorization parameter of the matrix multiplication result. Because the vectorization method supported now has int2 (two int's are processed together) and int4 (4 int's are processed together), the vec parameter can be set to 1 (int 8 type), 4 (4 int8 compose a complete int 32), 8 (8 int8, that is int2 vectorization), and 16 (16 int8, that is int4 vectorization) for the input data and the weight data. Similarly, the vec _ out parameter may be set to 1,2,4. The writing here is somewhat confusing, just to say that 8 and 32 in int8 and int32 represent 8bit,32bit, while 2 and 4 in int2 and int4 represent the treatment of a few int32, respectively.
Chunk represents the number of processing wmma _ k, such as Chunk =2, which means we process two wmma _ k at a time, and then in int8, wmma _ k is always 16, which corresponds to processing in _ dim (input dimension) on the data, that is, processing 32 input dimension data at a time.
Let the layout of the input data be (batch, in _ dim), and the layout of the weight matrix be (out _ dim, in _ dim); the input data has two axes, batch, in _ dim, and the weight matrix also has two axes, out _ dim, in _ dim. The scheduling of the C procedure is detailed in fig. 3: for dense calculation process C, two axes are mainly involved, a batch axis and an output dimension axis. In TVM we will perform data partitioning separately for these two axes. Dividing the batch into block Idx.x and wmma _ m _ warp _ row _ tiles _ block _ row _ warps; dividing output dimensions into block Idx.y and wmma _ n _ warp _ col _ tiles and performing reordering operation, wherein the division sequence is changed into
And the final form of the whole division is block Idx.x, block Idx.y, wma _ m, warp _ row _ tiles, wma _ n, warp _ col, tile _ col _ wars, and then the final form of the division is block Idx.x, block Idx.y, threadadx.z, threadadx.y, threadadIdx, and vectorization parameters.
In order to represent the operation, the final division sequence is the final generated sequence, and the backward dimension is the inner dimension of the whole calculation cycle, for example, the final threadidx.x and the vectorization parameter represent that the vectorization parameter is in the innermost dimension, and the threadidx.x is at the outer layer of the vectorization parameter. For example, in
for(int j=0;j<threadIdx.x;j++)
for (int i =0
The scheduling of the CS procedure is detailed in fig. 4, similar to C.
The scheduling of the CF is detailed in fig. 5. The data loaded into the wmma, matrix _ a and wmma, matrix _ b are subjected to wmma, the result obtained by the calculation of the mma _ sync interface is put into the wmma, the calculation of the CS block _ col _ warps is mainly performed on the part behind the CS block _ col _ warps, and the CF supports the previous division of the other parts of the CS.
The scheduling of AFs is detailed in fig. 6. Load input data from shared memory to memory scope of wmma. The part after CF chunk is mainly calculated, and AF is carried with the division of the other parts of CF before.
The scheduling of BF is detailed in fig. 7. And loading the weight-removed data from the shared memory to a memory action domain of wmma. The part after the CF chunk is mainly calculated, and the BF carries the division of the other parts of the CF before.
Scheduling of AS (BS) see fig. 8 in detail, including scheduling of incoming data and weight matrix, which share a schedule. This schedule is identified AS an AS (BS). The subsequent part of the CF chunk number is mainly calculated, and the AS (BS) carries the division of the other parts of the CF. In this process, double buffer optimization may be added, that is, data required for the next calculation is fetched into the shared memory in advance, and the double _ buffer () function may be called in the TVM.
In addition, in order to complete the replacement of the wmma interface with the ordinary code, it is necessary to provide an accurate step size parameter to perform the tensorize operation before calling the wmma interface.
It can be seen from the foregoing embodiment that, in the full connection layer calculation method of a neural network provided in the embodiment of the present invention, by receiving input data in the format of an 8-bit integer variable, and determining whether an input dimension, an output dimension, and a single sample of the input data satisfy a shape requirement of a call interface of a tensor core; defining a calculation rule based on input data of the 8-bit integer variable, a weight of the 8-bit integer variable, and a bias of the 32-bit integer variable in response to satisfying the shape requirement; scheduling input data, weight and bias based on a calculation rule, and calling a calculation unit of a tensor core through a calling interface to perform matrix multiplication to obtain a two-dimensional tensor; the technical scheme that the two-dimensional tensor is used as the calculation result of the full connection layer and is output to the next layer can call a TenscorCore calculation unit to execute convolution aiming at int8 input in a dense layer, and calculation performance is greatly improved.
It should be particularly noted that, the steps in the embodiments of the fully-connected layer calculation method for a neural network described above may be mutually intersected, replaced, added, and deleted, and therefore, the fully-connected layer calculation method for a neural network transformed by these reasonable permutations and combinations shall also fall within the scope of the present invention, and shall not limit the scope of the present invention to the described embodiments.
In view of the above objects, a second aspect of embodiments of the present invention proposes an embodiment of a fully-connected layer computation means capable of performing convolution at the dense layer for int8 input calls to a TensorCore computation unit. The fully-connected layer computing device of the neural network comprises:
a processor; and
a memory storing program code executable by the processor, the program code when executed sequentially performing the steps of:
receiving input data with a format of 8-bit integer variables, and judging whether input dimensionality, output dimensionality and single sample of the input data meet the shape requirement of a calling interface of a tensor core;
defining a calculation rule based on input data of the 8-bit integer variable, a weight of the 8-bit integer variable, and a bias of the 32-bit integer variable in response to satisfying the shape requirement;
scheduling input data, weight and bias based on a calculation rule, and calling a calculation unit of a tensor core through a calling interface to perform matrix multiplication to obtain a two-dimensional tensor;
and outputting the two-dimensional tensor to the next layer as a calculation result of the full connection layer.
In some embodiments, scheduling the input data, the weights, and the biases to perform a matrix multiplication operation by invoking a computation unit of the tensor core based on the computation rule to obtain the two-dimensional tensor comprises: defining key parameters used for executing scheduling, and executing the following steps based on the key parameters:
performing data partitioning on the input data and the weights so as to load the input data and the weights from the global memory to the shared memory;
loading input data and weight which are subjected to data division from a shared memory to a matrix memory scope of a calling interface through a matrix loading interface of the calling interface;
and in the matrix memory action domain, a computing unit which uses a tensor core through a matrix multiplication computing interface of a calling interface performs matrix multiplication on the input data and the weight which are subjected to data division based on a computing rule to obtain an unbiased two-dimensional tensor which is subjected to data division.
In some embodiments, the steps further comprise: additionally performing the following steps based on the key parameters:
loading the unbiased two-dimensional tensor subjected to data division from the action domain of the matrix memory to the shared memory through a matrix storage interface of a calling interface, adding bias to the unbiased two-dimensional tensor subjected to data division in the shared memory based on a calculation rule, and obtaining the two-dimensional tensor subjected to data division, wherein the bias is executed by using a preset scheduling mode instead of scheduling by using the calling interface;
and loading the two-dimensional tensors subjected to data division from the shared memory to the global memory, and further splicing the two-dimensional tensors into the two-dimensional tensor.
As can be seen from the foregoing embodiments, the fully-connected layer computing device of a neural network according to the embodiments of the present invention receives input data in an 8-bit integer variable format, and determines whether an input dimension, an output dimension, and a single sample of the input data satisfy a shape requirement of a calling interface of a tensor core; defining a calculation rule based on input data of the 8-bit integer variable, a weight of the 8-bit integer variable, and a bias of the 32-bit integer variable in response to satisfying the shape requirement; scheduling input data, weight and bias based on a calculation rule, and calling a calculation unit of a tensor core through a calling interface to perform matrix multiplication to obtain a two-dimensional tensor; the technical scheme that the two-dimensional tensor is used as the calculation result of the full connection layer and is output to the next layer can call a TenscorCore calculation unit to execute convolution aiming at int8 input in a dense layer, and calculation performance is greatly improved.
It should be noted that, the embodiment of the fully-connected layer computing apparatus of the neural network described above adopts an embodiment of the fully-connected layer computing method of the neural network to specifically describe the working process of each module, and those skilled in the art can easily think that these modules are applied to other embodiments of the fully-connected layer computing method of the neural network. Of course, since the steps in the embodiment of the method for computing a fully-connected layer of a neural network can be mutually intersected, replaced, added, or deleted, these reasonably arranged, combined and transformed fully-connected layer computing devices of the neural network should also belong to the scope of the present invention, and should not limit the scope of the present invention to the embodiment.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant only to be exemplary, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit or scope of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (8)

1. A full connection layer calculation method of a neural network is characterized by comprising the following steps of:
receiving input data with a format of 8-bit integer variables, and judging whether input dimensionality, output dimensionality and single sample of the input data meet the shape requirement of a calling interface of a tensor core;
defining a calculation rule based on the input data for an 8-bit integer variable, a weight for the 8-bit integer variable, and a bias for the 32-bit integer variable in response to the shape requirement being satisfied;
scheduling the input data, the weight and the bias based on the calculation rule so as to call a calculation unit of the tensor core through the calling interface to perform matrix multiplication operation to obtain a two-dimensional tensor;
outputting the two-dimensional tensor to the next layer as a calculation result of the full connection layer;
wherein scheduling the input data, the weights, and the biases based on the computation rule to invoke the computation unit of the tensor core through the invocation interface to perform matrix multiplication, and obtaining a two-dimensional tensor comprises: defining key parameters used for executing scheduling, and executing the following steps based on the key parameters:
performing data partitioning on the input data and the weights to load the input data and the weights from the global memory to a shared memory;
loading the input data and the weight which are subjected to data division from the shared memory to a matrix memory scope of the calling interface through a matrix loading interface of the calling interface;
and in the matrix memory action domain, a computing unit using the tensor core through a matrix multiplication computing interface of the calling interface performs matrix multiplication on the input data subjected to data division and the weight based on the computing rule to obtain an unbiased two-dimensional tensor subjected to data division.
2. The method of claim 1, wherein determining whether the input dimensions, output dimensions, and single-time samples of the input data satisfy shape requirements of a calling interface of a tensor kernel comprises:
determining a shape of the input data, a shape of the weights, and a shape of the bias based on the input data;
determining values of the input dimension, the output dimension, and the single sample of the input data based on a shape of the input data, a shape of the weight, and a shape of the bias;
determining that the shape requirement is satisfied in response to the input dimension, the output dimension, and the value of the single sample all being divisible by 16, or the input dimension value being divisible by 16, the value of the output dimension being divisible by 8, the value of the single sample being divisible by 32, or the input dimension value being divisible by 16, the value of the output dimension being divisible by 32, the value of the single sample being divisible by 8.
3. The method of claim 2, further comprising: invoking a computational unit that uses a DP4A instruction set to directly compute the two-dimensional tensor in response to satisfying the shape requirement and output to a next layer as a result of computation of a fully-connected layer.
4. The method of claim 1, further comprising: additionally performing the following steps based on the key parameters:
loading the unbiased two-dimensional tensor divided by data from the action domain of the matrix memory to the shared memory through a matrix storage interface of the calling interface, and adding the bias to the unbiased two-dimensional tensor divided by data in the shared memory based on the calculation rule to obtain the two-dimensional tensor divided by data, wherein the bias is executed by using a preset scheduling mode but not the calling interface;
and loading the two-dimensional tensor divided by the data from the shared memory to the global memory, and further splicing the two-dimensional tensor into the two-dimensional tensor.
5. The method of claim 1, further comprising: before calling the matrix loading interface, the matrix multiplication calculation interface, or the matrix storage interface of the calling interface, performing a Tensorize operation based on step size parameters on the input data, the weights, and the two-dimensional tensor subjected to data partitioning to allow a computing platform used by the matrix loading interface, the matrix multiplication calculation interface, or the matrix storage interface to be identified.
6. The method of claim 1, wherein performing data partitioning on the input data and the weights to load from local global memory to shared memory further comprises: double-buffering optimization is performed using a function provided by a compiler of a neural network.
7. A fully-connected layer computing device for a neural network, comprising:
a processor; and
a memory storing program code executable by the processor, the program code when executed sequentially performing the steps of:
receiving input data with a format of 8-bit integer variables, and judging whether input dimensionality, output dimensionality and single sample of the input data meet the shape requirement of a calling interface of a tensor core;
defining a calculation rule based on the input data of 8-bit integer variables, weights of 8-bit integer variables, and biases of 32-bit integer variables in response to satisfying the shape requirement;
scheduling the input data, the weight and the bias based on the calculation rule so as to call a calculation unit of the tensor core through the calling interface to perform matrix multiplication operation to obtain a two-dimensional tensor;
outputting the two-dimensional tensor to the next layer as a calculation result of the full connection layer;
wherein scheduling the input data, the weight, and the bias based on the calculation rule to call the calculation unit of the tensor core through the calling interface to perform matrix multiplication, and obtaining a two-dimensional tensor comprises: defining key parameters used for executing scheduling, and executing the following steps based on the key parameters:
performing data partitioning on the input data and the weights to load the input data and the weights from the global memory to a shared memory;
loading the input data and the weight which are subjected to data division from the shared memory to a matrix memory scope of the calling interface through a matrix loading interface of the calling interface;
and in the matrix memory action domain, a computing unit using the tensor core through a matrix multiplication computing interface of the calling interface performs matrix multiplication on the input data subjected to data division and the weight based on the computing rule to obtain an unbiased two-dimensional tensor subjected to data division.
8. The apparatus of claim 7, wherein the steps further comprise: additionally performing the following steps based on the key parameters:
loading the unbiased two-dimensional tensor divided by data from the action domain of the matrix memory to the shared memory through a matrix storage interface of the calling interface, and adding the bias to the unbiased two-dimensional tensor divided by data in the shared memory based on the calculation rule to obtain the two-dimensional tensor divided by data, wherein the bias is executed by using a preset scheduling mode but not the calling interface;
and loading the two-dimensional tensor divided by the data from the shared memory to the global memory, and further splicing the two-dimensional tensor into the two-dimensional tensor.
CN202010725384.9A 2020-07-24 2020-07-24 Full connection layer calculation method and device of neural network Active CN111860838B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010725384.9A CN111860838B (en) 2020-07-24 2020-07-24 Full connection layer calculation method and device of neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010725384.9A CN111860838B (en) 2020-07-24 2020-07-24 Full connection layer calculation method and device of neural network

Publications (2)

Publication Number Publication Date
CN111860838A CN111860838A (en) 2020-10-30
CN111860838B true CN111860838B (en) 2022-12-20

Family

ID=72950071

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010725384.9A Active CN111860838B (en) 2020-07-24 2020-07-24 Full connection layer calculation method and device of neural network

Country Status (1)

Country Link
CN (1) CN111860838B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112232496A (en) * 2020-09-17 2021-01-15 苏州浪潮智能科技有限公司 Method, system, equipment and medium for processing int4 data type based on Tenscorore

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180181503A1 (en) * 2015-08-02 2018-06-28 Wave Computing, Inc. Data flow computation using fifos
CN109740739A (en) * 2018-12-29 2019-05-10 北京中科寒武纪科技有限公司 Neural computing device, neural computing method and Related product
CN110059733A (en) * 2019-04-01 2019-07-26 苏州科达科技股份有限公司 The optimization and fast target detection method, device of convolutional neural networks
CN110807479A (en) * 2019-10-23 2020-02-18 中国人民解放军国防科技大学 Neural network convolution calculation acceleration method based on Kmeans algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180181503A1 (en) * 2015-08-02 2018-06-28 Wave Computing, Inc. Data flow computation using fifos
CN109740739A (en) * 2018-12-29 2019-05-10 北京中科寒武纪科技有限公司 Neural computing device, neural computing method and Related product
CN110059733A (en) * 2019-04-01 2019-07-26 苏州科达科技股份有限公司 The optimization and fast target detection method, device of convolutional neural networks
CN110807479A (en) * 2019-10-23 2020-02-18 中国人民解放军国防科技大学 Neural network convolution calculation acceleration method based on Kmeans algorithm

Also Published As

Publication number Publication date
CN111860838A (en) 2020-10-30

Similar Documents

Publication Publication Date Title
US20210319284A1 (en) System and architecture including processor and neural network accelerator
CN107844826B (en) Neural network processing unit and processing system comprising same
US20220222531A1 (en) Asynchronous neural network training
US20210350214A1 (en) Convolutional neural network computing method and system based on weight kneading
US11093682B2 (en) Language and compiler that generate synchronous digital circuits that maintain thread execution order
CN115437795B (en) Video memory recalculation optimization method and system for heterogeneous GPU cluster load perception
US20210182682A1 (en) Learning task compiling method of artificial intelligence processor and related products
CN111860838B (en) Full connection layer calculation method and device of neural network
CN110633785A (en) Method and system for calculating convolutional neural network
CN113379070A (en) Deep learning framework conversion method, system, storage medium and equipment
CN112001491A (en) Search method and device for determining neural network architecture for processor
Cohen et al. An adaptive robust optimization model for parallel machine scheduling
US8826199B2 (en) System and method for development of a system architecture
US8346704B2 (en) Controlled constraint sharing in parallel problem solvers
EP3754559A1 (en) Optimization system and control method for optimization system
EP3745319A1 (en) Optimization apparatus and optimization method
WO2022057459A1 (en) Tensorcore-based int4 data type processing method and system, device, and medium
CN112990461B (en) Method, device, computer equipment and storage medium for constructing neural network model
CN115983429B (en) Construction strategy optimization method, system, terminal and medium based on BIM model
CN114358253A (en) Time estimation method of neural network model and related product
US20230078203A1 (en) Configurable nonlinear activation function circuits
CN115496181A (en) Chip adaptation method, device, chip and medium of deep learning model
CN115480919A (en) Convolution optimization operation method and device, computer equipment and storage medium
CN114912570A (en) Method, device and equipment for accelerating neural network model optimization and readable medium
CN112214718A (en) Method for solving bounded knapsack problem based on improved dynamic programming algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant