CN111860838B

CN111860838B - Full connection layer calculation method and device of neural network

Info

Publication number: CN111860838B
Application number: CN202010725384.9A
Authority: CN
Inventors: 宋小妹
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2022-12-20
Anticipated expiration: 2040-07-24
Also published as: CN111860838A

Abstract

The invention discloses a method and a device for calculating a full connection layer of a neural network, wherein the method comprises the following steps: receiving input data with a format of 8-bit integer variables, and judging whether input dimensionality, output dimensionality and single sample of the input data meet the shape requirement of a calling interface of a tensor core; defining a calculation rule based on input data of the 8-bit integer variable, a weight of the 8-bit integer variable, and a bias of the 32-bit integer variable in response to satisfying the shape requirement; scheduling input data, weight and bias based on a calculation rule, and calling a calculation unit of a tensor core through a calling interface to perform matrix multiplication to obtain a two-dimensional tensor; and outputting the two-dimensional tensor to the next layer as a calculation result of the full connection layer. The method can call the TenscorCore calculation unit to execute convolution aiming at int8 input in the dense layer, and greatly improves the calculation performance.

Description

Full connection layer calculation method and device of neural network

Technical Field

The present invention relates to the field of neural networks, and more particularly, to a method and an apparatus for computing a full connection layer of a neural network.

Background

TVM is an open source project developed primarily by the SAMPL group contribution of Washington university. At present, deep learning communities are active, and researchers propose new op (layers) every day to expect better accuracy of the model. Meanwhile, as more and more manufacturers begin to make hardware, more and more backend devices can be selected when the neural network is trained.

This is a headache for the framed person, both trying to provide support for emerging ops and implementing existing ops on emerging backend devices. The TVM project is therefore in return, with the hope of achieving the goal of the researcher writing an op only once, and then the TVM automatically generates code for various backend devices with appreciable performance. By official definition, a TVM is a complete set of stacks including parts of neural network graph optimization (e.g., op fusion) and single op optimization. The TVM is positioned below the neural network diagram and above the underlying hardware.

TensorCore (tensor core) is the heaviest weight characteristic of a Volta architecture, is a special unit specially designed for deep learning application, and is actually a matrix multiply-accumulate calculation unit. Matrix multiply-accumulate computation is the most important and time consuming part of deep learning network layer algorithms, such as convolutional layer, dense (fully-connected) layer, etc. Support for the TensorCore scheme of float16 is currently provided in TVM, but is not compatible with int8 type input, int8 can only select a dp4a scheme, but the dp4a instruction has a gap in performance compared with the tensorbe instruction.

Aiming at the problem that performance is reduced because a dense layer cannot call TensorCore for int8 input in the prior art, no effective solution exists at present.

Disclosure of Invention

In view of this, an object of the embodiments of the present invention is to provide a method and an apparatus for computing a fully connected layer of a neural network, which can call a TensorCore computing unit to execute convolution for int8 input at a dense layer, thereby greatly improving computing performance.

In view of the above object, a first aspect of the embodiments of the present invention provides a full connection layer calculation method for a neural network, including the following steps:

receiving input data with a format of 8-bit integer variables, and judging whether input dimensionality, output dimensionality and single sample of the input data meet the shape requirement of a calling interface of a tensor core;

defining a calculation rule based on input data of the 8-bit integer variable, a weight of the 8-bit integer variable, and a bias of the 32-bit integer variable in response to satisfying the shape requirement;

scheduling input data, weight and bias based on a calculation rule, and calling a calculation unit of a tensor core through a calling interface to perform matrix multiplication to obtain a two-dimensional tensor;

and outputting the two-dimensional tensor to the next layer as a calculation result of the full connection layer.

In some embodiments, determining whether the input dimensions, the output dimensions, and the single-time samples of the input data satisfy the shape requirements of the invocation interface of the tensor core comprises:

determining a shape of the input data, a shape of the weight, and a shape of the bias based on the input data;

determining input dimensions, output dimensions, and single sample values of the input data based on the shape of the input data, the shape of the weights, and the biased shape;

the determination to meet the shape requirement is made in response to the input dimension, the output dimension, and the single sample values all being divisible by 16, or the input dimension value being divisible by 16, the output dimension value being divisible by 8, the single sample value being divisible by 32, or the input dimension value being divisible by 16, the output dimension value being divisible by 32, the single sample value being divisible by 8.

In some embodiments, the method further comprises: the calculation unit using the DP4A instruction set is invoked to directly calculate the two-dimensional tensor in response to the satisfaction of the shape requirement, and output to the next layer as a calculation result of the fully-connected layer.

In some embodiments, scheduling the input data, the weights, and the biases to perform a matrix multiplication operation by invoking a computation unit of the tensor core based on the computation rule to obtain the two-dimensional tensor comprises: defining key parameters used for executing scheduling, and executing the following steps based on the key parameters:

performing data partitioning on the input data and the weights so as to load the input data and the weights from the global memory to the shared memory;

loading input data and weight which are subjected to data division from a shared memory to a matrix memory scope of a calling interface through a matrix loading interface of the calling interface;

and in the matrix memory action domain, a computing unit which uses a tensor core through a matrix multiplication computing interface of a calling interface performs matrix multiplication on the input data and the weight which are subjected to data division based on a computing rule to obtain an unbiased two-dimensional tensor which is subjected to data division.

In some embodiments, the method further comprises: additionally performing the following steps based on the key parameters:

loading the unbiased two-dimensional tensor divided by data from the action domain of the matrix memory to the shared memory through a matrix storage interface of a calling interface, and adding bias to the unbiased two-dimensional tensor divided by the data in the shared memory based on a calculation rule to obtain the two-dimensional tensor divided by the data, wherein the bias is executed by using a preset scheduling mode instead of scheduling by calling the interface;

and loading the two-dimensional tensor divided by the data from the shared memory to the global memory, and further splicing the two-dimensional tensor into the two-dimensional tensor.

In some embodiments, further comprising: before calling a matrix loading interface, a matrix multiplication calculation interface or a matrix storage interface, performing Tensorize operation based on step size parameters on input data, weights and two-dimensional tensors subjected to data division so as to allow a calculation platform used by the matrix loading interface, the matrix multiplication calculation interface or the matrix storage interface to be identified.

In some embodiments, performing data partitioning on the input data and the weights to load from the global memory to the shared memory further comprises: double-buffering optimization is performed using a function provided by a compiler of a neural network.

A second aspect of an embodiment of the present invention provides a full connection layer computing apparatus for a neural network, including:

a processor; and

a memory storing program code executable by the processor, the program code when executed sequentially performing the steps of:

In some embodiments, scheduling the input data, the weights, and the biases based on a computation rule to perform a matrix multiplication operation by invoking a computation unit of the interface call tensor core to obtain the two-dimensional tensor comprises: defining key parameters used for executing scheduling, and executing the following steps based on the key parameters:

loading the input data and the weight which are subjected to data division from the shared memory to a matrix memory scope of the calling interface through a matrix loading interface of the calling interface;

In some embodiments, the steps further comprise: additionally performing the following steps based on the key parameters:

and loading the two-dimensional tensors subjected to data division from the shared memory to the global memory, and further splicing the two-dimensional tensors into the two-dimensional tensor.

The invention has the following beneficial technical effects: according to the full connection layer computing method and device of the neural network, input data with a format of 8-bit integer variables are received, and whether input dimensionality, output dimensionality and single sample of the input data meet the shape requirement of a calling interface of a tensor kernel is judged; defining a calculation rule based on input data of the 8-bit integer variable, a weight of the 8-bit integer variable, and a bias of the 32-bit integer variable in response to satisfying the shape requirement; scheduling input data, weight and bias based on a calculation rule, and calling a calculation unit of a tensor core through a calling interface to perform matrix multiplication to obtain a two-dimensional tensor; the technical scheme that the two-dimensional tensor is used as the calculation result of the full connection layer and is output to the next layer can call a TenscorCore calculation unit to execute convolution aiming at int8 input in a dense layer, and calculation performance is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic flow chart of a full connection layer calculation method of a neural network according to the present invention;

FIG. 2 is a detailed flowchart of a full link layer calculation method of a neural network according to the present invention;

FIG. 3 is a detailed flowchart of the C phase of the full link layer calculation method of the neural network according to the present invention;

FIG. 4 is a detailed flowchart of the CS phase of the full link layer calculation method of the neural network according to the present invention;

FIG. 5 is a detailed flowchart of the CF stage of the full connection layer calculation method of the neural network provided by the present invention;

FIG. 6 is a detailed flowchart of the AF phase of the full connection layer calculation method of the neural network provided by the present invention;

FIG. 7 is a detailed flow chart of the BF stage of the full connection layer calculation method of the neural network provided by the present invention;

FIG. 8 is a detailed flowchart of the AS/BS phase of the full connection layer calculation method of the neural network provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are only used for convenience of expression and should not be construed as a limitation to the embodiments of the present invention, and no description is given in the following embodiments.

In view of the above object, a first aspect of embodiments of the present invention proposes an embodiment of a fully-connected layer calculation method capable of invoking a TensorCore calculation unit at the dense layer for int8 input to perform convolution. Fig. 1 is a schematic flow chart of a full connection layer calculation method of a neural network provided by the present invention.

The method for calculating the full connection layer of the neural network, as shown in fig. 1, includes the following steps:

step S101: receiving input data with a format of 8-bit integer variables, and judging whether input dimensionality, output dimensionality and single sample of the input data meet the shape requirement of a calling interface of a tensor core;

step S103: defining a calculation rule based on input data of the 8-bit integer variable, a weight of the 8-bit integer variable, and a bias of the 32-bit integer variable in response to satisfying the shape requirement;

step S105: scheduling input data, weight and bias based on a calculation rule, and calling a calculation unit of a tensor core through a calling interface to perform matrix multiplication to obtain a two-dimensional tensor;

step S107: and outputting the two-dimensional tensor to the next layer as a calculation result of the full connection layer.

The invention mainly provides support of a TenscorCore int8 scheme for dense operation in TVM, fills in the blank that dense uses TenscorCore under the condition of int8, and brings great performance improvement. According to the existing rules of TVM, computing and scheduling of dense based on TenscorCore are designed, according to int8 data characteristics, related data division, wmma (matrix multiplication and accumulation operation with warp as a unit) interface calling and the like are defined, conflict reduction is added, data are read from the global to the shared memory in a vectorization mode, the data are written back to the global memory in a vectorization mode from the shared memory, double-buffer optimization is carried out, and the application of the device in a neural network such as resnet50 is ensured, and the network performance is improved. Through tests, the int8 scheme of TenscorCore improves the performance by about one time compared with the dp4a scheme.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by instructing relevant hardware by a computer program, and the program may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like. Embodiments of the computer program may achieve the same or similar effects as any of the preceding method embodiments to which it corresponds.

In some implementations, determining whether the input dimensions, the output dimensions, and the single-time samples of the input data satisfy the shape requirements of the invocation interface of the tensor kernel includes:

determining an input dimension, an output dimension, and a value of a single sample of the input data based on the shape of the input data, the shape of the weight, and the biased shape;

In some embodiments, the method further comprises: the calculation unit using the DP4A instruction set is called in response to satisfaction of the shape requirement to directly calculate the two-dimensional tensor, and output to the next layer as a calculation result of the fully-connected layer.

loading the unbiased two-dimensional tensor subjected to data division from the action domain of the matrix memory to the shared memory through a matrix storage interface of a calling interface, adding bias to the unbiased two-dimensional tensor subjected to data division in the shared memory based on a calculation rule, and obtaining the two-dimensional tensor subjected to data division, wherein the bias is executed by using a preset scheduling mode instead of scheduling by using the calling interface;

In some embodiments, the method further comprises: before calling a matrix loading interface, a matrix multiplication calculation interface or a matrix storage interface, performing Tensorize operation based on step size parameters on input data, weights and two-dimensional tensors subjected to data division so as to allow a calculation platform used by the matrix loading interface, the matrix multiplication calculation interface or the matrix storage interface to be identified.

In some embodiments, performing data partitioning on the input data and the weights to load from the global memory to the shared memory further comprises: double-buffering optimization is performed using a function provided by a compiler of the neural network.

The following further illustrates embodiments of the invention in accordance with the specific example shown in fig. 2.

The call criteria for dense _ tensorbore _ int8 involves three inputs: inputting data, the shape of which is (batch, in _ dim); weight, shape (out _ dim, in _ dim); biased, shaped as (out _ dim). Where in _ dim is the input dimension of dense and out _ dim is the output dimension of dense. The embodiment of the invention adopts a mode of directly calling the wmma interface to call a calculation unit of the underlying tensorcore.

The core of the wmma interface is:

in wmma, load _ matrix _ sync interface (used for loading matrix);

wma _ sync interface (for calculation of matrix multiplication);

memory _ matrix _ sync interface (for storage of the matrix after completion of the matrix multiplication).

These interfaces involve three important dimensions, < wmma _ m, wmma _ n, wmma _ k >, and calling a wmma interface once will complete a matrix multiplication of wmma _ m _ wmma _ k matrix and wmma _ n _ wmma _ k matrix. The wmma interface then currently supports the input of several data types fp16, int8, int4 (4 bit int), respectively. For int8 type input, only three shapes of <8,32,16> <32,8,16> <16,16 > are supported, that is, if the wmma interface of <8,32,16> shape is called, the matrix multiplication of 8 × 16 matrix and 32 × 16 matrix will be completed by calling once, and finally an 8 × 32 matrix result will be obtained. wmma _ m, wmma _ n, wmma _ k correspond to the base, out _ dim, in _ dim in the dense algorithm, respectively, so to adopt the sensor Core, the < base, out _ dim, in _ dim > must be divided by any one of the three interfaces of wmma to enter the dense _ tensorbore _ int8 solution. For example, if the input data dimension of the dense is (8, 256), the weight dimension is (128, 256), and the bias dimension is (128), then batch =8 (corresponding to wmma _ m), in _ dim =256 (corresponding to wmma _ k), out _ dim =128 (corresponding to wmma _ n), and the calling cases of < wmma _ n =8, wmma \\ n =32, wmma_k =16> can be satisfied, so that the newly added solution dense _ tensorbore _ int8 can be called.

Adding a new dense _ tensorbore _ int8 solution to the TVM, the most important part is the definition of the calculation rule and schedule, and before defining the calculation rule and schedule, the name of the solution needs to be registered, so that the TVM can find the corresponding calculation rule and schedule later. The definition of the computation rule of dense using int8 interface of tensorcore is not identical to the computation method of dense of dp4a, except that both the input data and the weight are int8 type, because after invoking tensocore interface, the result of multiplication of two int8 matrices is int32 (32 bit int) type, so the type of added offset is int32 type.

Where i represents base, j represents the output dimension out _ dim of the dense, and k represents the input dimension in _ dim of the dense. Output adds an identifier to the computation rule, for example, the identifier is dense _ tensorcore _ int8, and then the TVM will walk into the corresponding scheduling portion according to the identifier.

Declare data matrix as a, weight matrix as B, and bias as bias. Where the schedule of bias options is already defined in TVM, we may not have to deal with it. The transmitted A and B matrixes originally exist in a global memory, the access speed of the global memory is very low, the time consumption for reading and returning data of the shared memory is very low, and the matrix multiplication needs to perform frequent data access, so that the A and B matrixes are defined to be read into the shared memory "shared" from the global memory and are respectively marked AS AS and BS. Because of the wmma interface, data must be loaded from shared memory into memory scope "wmma:: fragment", there are three types of memory scopes, "wmma. Matrix _ a", "wmma. Matrix _ b", and "wmma. Accumulator", where AS is loaded into wmma. Matrix _ a, BS is loaded into wmma. Matrix _ b, and the result calculated through the wmma interface is stored in wmma. Accumulator, defined AS CF. The scheduling of the bias as described above is well defined, but if wmma is used to calculate the matrix multiplication, the bias must be in the shared memory when it is added to the final result, and cannot be added directly to wmma. And finally, writing the data of which the bias is processed from the shared memory back to the global memory.

Before defining scheduling, firstly defining several parameters which are key to data division, such as wmma _ m, wmma _ n, wmma _ k, block _ row _ wars, block _ col _ wars, warp _ row _ tiles, warp _ col _ tiles, offset CS, vec, vec _ out and chunk.

For wmma _ m, wmma _ n, wmma _ k, which have been previously defined, if wmma _ m =8, wmma _ n naturally equals 32 and wmma _ k equals 16. This set of parameters essentially defines which interface of which <8,32,16> <32,8,16> <16,16 > is invoked.

block _ row _ wars and block _ col _ wars are respectively bound with threadadx.y and threadadx.z, because wmma is executed in unit of warp (32 threads), so the threadadx.x range is fixedly set to 32, so that threadadx.y and threadadx.z have no more limitation of digits and only need to be set to multiples of 2, for example, they can be set to 2 and 4.

warp _ row _ tiles and warp _ col _ tiles respectively carry one block to calculate wmma

The number of _, wmma _ n.

Offset represents Offset set for reducing bank conflict of the shared memory of the input data and the weight data (int 8), and the bank conflict of the shared memory can cause great influence on performance; offsetCS represents the offset for storing the matrix multiplication result (int 32) into the shared memory to mitigate bank conflicts. Because of the difference between int8 type and int32 type, these two need to be set separately.

Vec is the input data and weight data (int 8) vectorization parameter, vec _ out is the vectorization parameter of the matrix multiplication result. Because the vectorization method supported now has int2 (two int's are processed together) and int4 (4 int's are processed together), the vec parameter can be set to 1 (int 8 type), 4 (4 int8 compose a complete int 32), 8 (8 int8, that is int2 vectorization), and 16 (16 int8, that is int4 vectorization) for the input data and the weight data. Similarly, the vec _ out parameter may be set to 1,2,4. The writing here is somewhat confusing, just to say that 8 and 32 in int8 and int32 represent 8bit,32bit, while 2 and 4 in int2 and int4 represent the treatment of a few int32, respectively.

Chunk represents the number of processing wmma _ k, such as Chunk =2, which means we process two wmma _ k at a time, and then in int8, wmma _ k is always 16, which corresponds to processing in _ dim (input dimension) on the data, that is, processing 32 input dimension data at a time.

Let the layout of the input data be (batch, in _ dim), and the layout of the weight matrix be (out _ dim, in _ dim); the input data has two axes, batch, in _ dim, and the weight matrix also has two axes, out _ dim, in _ dim. The scheduling of the C procedure is detailed in fig. 3: for dense calculation process C, two axes are mainly involved, a batch axis and an output dimension axis. In TVM we will perform data partitioning separately for these two axes. Dividing the batch into block Idx.x and wmma _ m _ warp _ row _ tiles _ block _ row _ warps; dividing output dimensions into block Idx.y and wmma _ n _ warp _ col _ tiles and performing reordering operation, wherein the division sequence is changed into

And the final form of the whole division is block Idx.x, block Idx.y, wma _ m, warp _ row _ tiles, wma _ n, warp _ col, tile _ col _ wars, and then the final form of the division is block Idx.x, block Idx.y, threadadx.z, threadadx.y, threadadIdx, and vectorization parameters.

In order to represent the operation, the final division sequence is the final generated sequence, and the backward dimension is the inner dimension of the whole calculation cycle, for example, the final threadidx.x and the vectorization parameter represent that the vectorization parameter is in the innermost dimension, and the threadidx.x is at the outer layer of the vectorization parameter. For example, in

for(int j＝0；j<threadIdx.x；j++)

for (int i =0

The scheduling of the CS procedure is detailed in fig. 4, similar to C.

The scheduling of the CF is detailed in fig. 5. The data loaded into the wmma, matrix _ a and wmma, matrix _ b are subjected to wmma, the result obtained by the calculation of the mma _ sync interface is put into the wmma, the calculation of the CS block _ col _ warps is mainly performed on the part behind the CS block _ col _ warps, and the CF supports the previous division of the other parts of the CS.

The scheduling of AFs is detailed in fig. 6. Load input data from shared memory to memory scope of wmma. The part after CF chunk is mainly calculated, and AF is carried with the division of the other parts of CF before.

The scheduling of BF is detailed in fig. 7. And loading the weight-removed data from the shared memory to a memory action domain of wmma. The part after the CF chunk is mainly calculated, and the BF carries the division of the other parts of the CF before.

Scheduling of AS (BS) see fig. 8 in detail, including scheduling of incoming data and weight matrix, which share a schedule. This schedule is identified AS an AS (BS). The subsequent part of the CF chunk number is mainly calculated, and the AS (BS) carries the division of the other parts of the CF. In this process, double buffer optimization may be added, that is, data required for the next calculation is fetched into the shared memory in advance, and the double _ buffer () function may be called in the TVM.

In addition, in order to complete the replacement of the wmma interface with the ordinary code, it is necessary to provide an accurate step size parameter to perform the tensorize operation before calling the wmma interface.

It can be seen from the foregoing embodiment that, in the full connection layer calculation method of a neural network provided in the embodiment of the present invention, by receiving input data in the format of an 8-bit integer variable, and determining whether an input dimension, an output dimension, and a single sample of the input data satisfy a shape requirement of a call interface of a tensor core; defining a calculation rule based on input data of the 8-bit integer variable, a weight of the 8-bit integer variable, and a bias of the 32-bit integer variable in response to satisfying the shape requirement; scheduling input data, weight and bias based on a calculation rule, and calling a calculation unit of a tensor core through a calling interface to perform matrix multiplication to obtain a two-dimensional tensor; the technical scheme that the two-dimensional tensor is used as the calculation result of the full connection layer and is output to the next layer can call a TenscorCore calculation unit to execute convolution aiming at int8 input in a dense layer, and calculation performance is greatly improved.

It should be particularly noted that, the steps in the embodiments of the fully-connected layer calculation method for a neural network described above may be mutually intersected, replaced, added, and deleted, and therefore, the fully-connected layer calculation method for a neural network transformed by these reasonable permutations and combinations shall also fall within the scope of the present invention, and shall not limit the scope of the present invention to the described embodiments.

In view of the above objects, a second aspect of embodiments of the present invention proposes an embodiment of a fully-connected layer computation means capable of performing convolution at the dense layer for int8 input calls to a TensorCore computation unit. The fully-connected layer computing device of the neural network comprises:

a processor; and

As can be seen from the foregoing embodiments, the fully-connected layer computing device of a neural network according to the embodiments of the present invention receives input data in an 8-bit integer variable format, and determines whether an input dimension, an output dimension, and a single sample of the input data satisfy a shape requirement of a calling interface of a tensor core; defining a calculation rule based on input data of the 8-bit integer variable, a weight of the 8-bit integer variable, and a bias of the 32-bit integer variable in response to satisfying the shape requirement; scheduling input data, weight and bias based on a calculation rule, and calling a calculation unit of a tensor core through a calling interface to perform matrix multiplication to obtain a two-dimensional tensor; the technical scheme that the two-dimensional tensor is used as the calculation result of the full connection layer and is output to the next layer can call a TenscorCore calculation unit to execute convolution aiming at int8 input in a dense layer, and calculation performance is greatly improved.

It should be noted that, the embodiment of the fully-connected layer computing apparatus of the neural network described above adopts an embodiment of the fully-connected layer computing method of the neural network to specifically describe the working process of each module, and those skilled in the art can easily think that these modules are applied to other embodiments of the fully-connected layer computing method of the neural network. Of course, since the steps in the embodiment of the method for computing a fully-connected layer of a neural network can be mutually intersected, replaced, added, or deleted, these reasonably arranged, combined and transformed fully-connected layer computing devices of the neural network should also belong to the scope of the present invention, and should not limit the scope of the present invention to the embodiment.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant only to be exemplary, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit or scope of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A full connection layer calculation method of a neural network is characterized by comprising the following steps of:

defining a calculation rule based on the input data for an 8-bit integer variable, a weight for the 8-bit integer variable, and a bias for the 32-bit integer variable in response to the shape requirement being satisfied;

scheduling the input data, the weight and the bias based on the calculation rule so as to call a calculation unit of the tensor core through the calling interface to perform matrix multiplication operation to obtain a two-dimensional tensor;

outputting the two-dimensional tensor to the next layer as a calculation result of the full connection layer;

wherein scheduling the input data, the weights, and the biases based on the computation rule to invoke the computation unit of the tensor core through the invocation interface to perform matrix multiplication, and obtaining a two-dimensional tensor comprises: defining key parameters used for executing scheduling, and executing the following steps based on the key parameters:

performing data partitioning on the input data and the weights to load the input data and the weights from the global memory to a shared memory;

and in the matrix memory action domain, a computing unit using the tensor core through a matrix multiplication computing interface of the calling interface performs matrix multiplication on the input data subjected to data division and the weight based on the computing rule to obtain an unbiased two-dimensional tensor subjected to data division.

2. The method of claim 1, wherein determining whether the input dimensions, output dimensions, and single-time samples of the input data satisfy shape requirements of a calling interface of a tensor kernel comprises:

determining a shape of the input data, a shape of the weights, and a shape of the bias based on the input data;

determining values of the input dimension, the output dimension, and the single sample of the input data based on a shape of the input data, a shape of the weight, and a shape of the bias;

determining that the shape requirement is satisfied in response to the input dimension, the output dimension, and the value of the single sample all being divisible by 16, or the input dimension value being divisible by 16, the value of the output dimension being divisible by 8, the value of the single sample being divisible by 32, or the input dimension value being divisible by 16, the value of the output dimension being divisible by 32, the value of the single sample being divisible by 8.

3. The method of claim 2, further comprising: invoking a computational unit that uses a DP4A instruction set to directly compute the two-dimensional tensor in response to satisfying the shape requirement and output to a next layer as a result of computation of a fully-connected layer.

4. The method of claim 1, further comprising: additionally performing the following steps based on the key parameters:

loading the unbiased two-dimensional tensor divided by data from the action domain of the matrix memory to the shared memory through a matrix storage interface of the calling interface, and adding the bias to the unbiased two-dimensional tensor divided by data in the shared memory based on the calculation rule to obtain the two-dimensional tensor divided by data, wherein the bias is executed by using a preset scheduling mode but not the calling interface;

5. The method of claim 1, further comprising: before calling the matrix loading interface, the matrix multiplication calculation interface, or the matrix storage interface of the calling interface, performing a Tensorize operation based on step size parameters on the input data, the weights, and the two-dimensional tensor subjected to data partitioning to allow a computing platform used by the matrix loading interface, the matrix multiplication calculation interface, or the matrix storage interface to be identified.

6. The method of claim 1, wherein performing data partitioning on the input data and the weights to load from local global memory to shared memory further comprises: double-buffering optimization is performed using a function provided by a compiler of a neural network.

7. A fully-connected layer computing device for a neural network, comprising:

a processor; and

defining a calculation rule based on the input data of 8-bit integer variables, weights of 8-bit integer variables, and biases of 32-bit integer variables in response to satisfying the shape requirement;

wherein scheduling the input data, the weight, and the bias based on the calculation rule to call the calculation unit of the tensor core through the calling interface to perform matrix multiplication, and obtaining a two-dimensional tensor comprises: defining key parameters used for executing scheduling, and executing the following steps based on the key parameters:

8. The apparatus of claim 7, wherein the steps further comprise: additionally performing the following steps based on the key parameters: