CN112732630A

CN112732630A - Floating-point matrix multiplier many-core parallel optimization method for deep learning

Info

Publication number: CN112732630A
Application number: CN201910975075.4A
Authority: CN
Inventors: 刘沙; 刘鑫; 黄则强; 陈德训; 朱传家; 彭超; 裴阳; 陆旭峰
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2019-10-14
Filing date: 2019-10-14
Publication date: 2021-04-30

Abstract

The invention discloses a floating-point matrix multiplication operator many-core parallel optimization method aiming at deep learning, which comprises the steps of expanding an input matrix and an output matrix and calculating on a slave core array based on block matrix multiplication; the expanding the input matrix and the output matrix comprises the following steps: s1, pre-applying for the space of an expansion matrix; s2, arranging the input matrix and the output matrix in the pre-application space; s3, for the part with row number less than M under the condition of non-integer division in the row direction M, reading N data according to the row, expanding the non-integer division column into N _ size data after zero assignment, and writing back the data to the corresponding position of the expansion matrix; s4, for the N-odd-divided part in the column direction, firstly, the data of the right N-odd-divided part is read into a local memory from the original matrix in a stepping mode, and the N +1 th to the N _ size columns are expanded in a reverse order and are listed as 0. The invention can reduce the cost of access and storage, can expand the application range of the algorithm and has good support for any dimensionality.

Description

Floating-point matrix multiplier many-core parallel optimization method for deep learning

Technical Field

The invention relates to a floating-point matrix multiplier many-core parallel optimization method for deep learning, and belongs to the technical field of deep learning.

Background

In deep learning, data are propagated forwards and backwards in different neural network layers in a tensor mode, parameters are optimized through gradient correction so as to achieve the purpose of accurate prediction, a large number of matrix calculations are involved in the process, the calculation amount is the largest through matrix product and convolution, and the two calculation cores are matrix product. The performance of the matrix multiplication operator directly influences the performance of matrix multiplication and convolution, and has great influence on the overall performance of deep learning calculation, so that the optimization of the related algorithm has profound significance to the field.

The matrix multiplication is a typical operator in deep learning, on one hand, the calculation specification is realized, on the other hand, all possible matrix sizes need to be considered according to the random matrix shape in actual calculation, so that the optimization of the matrix multiplication needs to fully utilize the uncorrelated characteristic of the matrix multiplication and also considers the applicability of the matrix with any shape.

The current matrix multiplication optimization algorithm adopts a uniform block division strategy, utilizes the normalization and the non-relevance of calculation, and mainly has two problems: (1) repeated data reading and writing exist, and data sharing is not realized; (2) the implementation of blocking algorithms for different shapes, especially when a certain dimension is small, is lacking, and such computation is very common in deep learning.

Disclosure of Invention

The invention aims to provide a floating-point matrix multiplication operator many-core parallel optimization method for deep learning, which can reduce the access cost, expand the application range of an algorithm and well support any dimensionality.

In order to achieve the purpose, the invention adopts the technical scheme that: a floating-point matrix multiplication operator many-core parallel optimization method aiming at deep learning comprises the steps of expanding an input matrix and an output matrix and calculating on a slave core array based on block matrix multiplication;

the expanding the input matrix and the output matrix comprises the following steps:

s1, pre-applying for the space of the expansion matrix for the input matrix and the output matrix: for an M × N matrix, the minimum partition size is blkM and blkN, the size of the direction expansion dimension M _ size and N _ size is determined, and memory allocation is performed, where M _ size = MAX (M × 2, blkM), N _ size = MAX (N × 2, blkN);

s2, arranging the input matrix and the output matrix on the pre-application space: for matrix blocks of which M and N meet the integer division condition, the matrix blocks are kept in an original matrix, for the matrix blocks of which M or N do not meet the integer division condition, 32 slave cores are respectively distributed in the M direction and the N direction, two filling strategies are adopted for the M direction and the N direction, and an input matrix and an output matrix are written into corresponding positions of an extended matrix through reading and writing back of many cores;

s3, for the case of the row direction M not dividing exactly, reading N data according to the row, expanding into N _ size data after zero assignment to the non-dividing row, writing back to the corresponding position of the expansion matrix, for the case of the row number less than M, not reading, expanding into N _ size data after zero assignment according to the row direction, writing back to the corresponding position of the expansion matrix according to the row number of the read data;

s4, for the N-odd-divided part in the column direction, firstly, striding to read the right N-odd-divided part data from the original matrix to the local memory, expanding the (N + 1) th to the N _ size columns to be 0 in the reverse sequence, and then writing the expanded data back to the corresponding position of the expanded matrix in a striding write-back mode according to the row and column numbers of the read data;

the calculating on the slave core array based on block matrix multiplication comprises the following steps:

s5, dividing the expansion matrix of the input matrix and the output matrix into 8 by 8 data blocks equally according to the structural characteristics of the slave core array so as to map the data blocks onto the slave core array;

s6, each slave core reads in a data block of the input matrix and the output matrix from the main memory through DMA according to the row and column number of the slave core array at 8 x 8;

s7, each slave core calculates only one of 8 × 8 output matrix blocks, based on block matrix multiplication, by using inter-core row broadcast and column broadcast communication, the whole calculation is completed in 8 steps, taking (i, j) slave core as an example, to multiply the matrix with input W, Di, the specific steps are as follows:

s71, broadcasting and sharing local W data through a row by a slave core with a column coordinate of 0, broadcasting and sharing local Di data through a column by a data with a row coordinate of 0, receiving W (i, 0) and Di (0, j) data blocks from the slave core at the moment, completing the calculation of W (i, 0) × Di (0, j), and accumulating the W (i, 0) × Di (0, j) into an output matrix of the slave core;

s72, broadcasting and sharing local W data through a row by a slave core with a column coordinate of 1, broadcasting and sharing local Di data through a column by data with a row coordinate of 1, receiving W (i, 1) and Di (1, j) data blocks from the slave core at the moment (i, j), completing the calculation of W (i, 1) × Di (1, j), and accumulating the W (i, 1) × Di (1, j) into an output matrix of the slave core;

s73, broadcasting and sharing local W data through rows by the slave core with the column coordinate of 2, broadcasting and sharing local Di data through columns by the data with the row coordinate of 2, receiving the data blocks of W (i, 2) and Di (2, j) by the slave core at the moment (i, j), completing the calculation of W (i, 2) × Di (2, j), and accumulating the data blocks into an output matrix of the slave core;

s74, broadcasting and sharing local W data through a row by the slave core with the column coordinate of 3, broadcasting and sharing local Di data through a column by the data with the row coordinate of 3, receiving the data blocks of W (i, 3) and Di (3, j) by the slave core at the moment (i, j), completing the calculation of W (i, 3) × Di (3, j), and accumulating the data blocks into an output matrix of the slave core;

s75, broadcasting and sharing local W data through a row by a slave core with a column coordinate of 4, broadcasting and sharing local Di data through a column by a data with a row coordinate of 4, receiving W (i, 4) and Di (4, j) data blocks from the slave core at the moment, completing the calculation of W (i, 4) × Di (4, j), and accumulating the W (i, 4) × Di (4, j) into an output matrix of the slave core;

s76, broadcasting and sharing local W data through rows by the slave core with the column coordinate of 5, broadcasting and sharing local Di data through columns by the data with the row coordinate of 5, wherein at the moment (i, j) the slave core receives W (i, 5) and Di (5, j) data blocks, the calculation of W (i, 5) × Di (5, j) is completed, and the W (i, 5) × Di (5, j) is accumulated in an output matrix of the slave core;

s77, broadcasting and sharing local W data through rows by the slave core with the column coordinate of 6, broadcasting and sharing local Di data through columns by the data with the row coordinate of 6, wherein the (i, j) slave core receives the W (i, 6) and Di (6, j) data blocks, completes the calculation of W (i, 6) × Di (6, j), and accumulates the W (i, 6) × Di (6, j) into an output matrix of the slave core;

and S78, broadcasting the local W data by the slave core with the column coordinate of 7 through a row, broadcasting the local Di data by the data with the row coordinate of 7 through a column, receiving the data blocks of W (i, 7) and Di (7, j) from the slave core at the moment (i, j), completing the calculation of W (i, 7) × Di (7, j), and accumulating the data blocks into an output matrix of the slave core.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

the invention relates to a many-core parallel optimization method for a floating-point matrix multiplier for deep learning, which performs code-level optimization on a matrix multiplier aiming at the calculation characteristic of deep learning, has better support on matrix multiplication calculation in the deep learning, can accelerate by using many cores no matter the dimension of the matrix, and improves the solving efficiency of the problems, thereby improving the overall operation efficiency and the calculation performance of deep learning application; meanwhile, the method has universality and a certain acceleration effect on matrix calculation in other fields, can reduce the access cost, can expand the application range of the algorithm and can well support any dimensionality.

Drawings

FIG. 1 is a schematic diagram of a slave core performing computations through inter-core communication;

FIG. 2 is a schematic diagram of matrix expansion;

FIG. 3 is a flow chart of a floating point matrix multiplier many-core parallel optimization method for deep learning according to the present invention.

Detailed Description

Example (b): a floating-point matrix multiplication operator many-core parallel optimization method aiming at deep learning comprises the steps of expanding an input matrix and an output matrix and calculating on a slave core array based on block matrix multiplication;

s2, arranging the input matrix and the output matrix on the pre-application space: for matrix blocks of which M and N meet the integer division condition, the matrix blocks are kept in an original matrix, for the matrix blocks of which M or N do not meet the integer division condition, 32 slave cores are respectively distributed in the M direction and the N direction, two filling strategies are adopted for the M direction and the N direction (the row direction and the column direction), and an input matrix and an output matrix are written into corresponding positions of an extended matrix through reading and writing back of many cores;

s3, for the case of the row direction M not dividing (bottom), the row number is less than M, reading N data according to the row, expanding into N _ size data after zero assignment to the non-dividing row, writing back to the corresponding position of the expansion matrix, for the case of the row number is greater than M, not reading, expanding into N _ size data after zero assignment according to the row direction, writing back to the corresponding position of the expansion matrix according to the row number of the read data;

s4, for the part of the column direction N which is not divided (right side), firstly, striding to read the right N part of the data which is not divided into parts from the original matrix into a local memory, expanding the (N + 1) th to the N _ size column as 0 in a reverse sequence, and then writing the expanded data back to the corresponding position of the expanded matrix in a striding write-back mode according to the row and column number of the read data;

The examples are further explained below:

a non-transposed matrix multiplication algorithm is basically implemented as follows:

for（cN=0；cN<N；cN++）

for（cM=0；cM<M；cM++）{

tmp=0；

for（cK=0；cK<K；cK++）{

tmp+=W[cN][cK]*Di[cK][cM]；

}

Do[cN][cM] += alpha*tmp + beta*Do[cN][cM]；

}

according to the implementation, overlapped data access exists in the calculation of different C elements, when the crowd check is carried out, if the block is carried out according to the C matrix and the slave core array is taken as a whole, the slave cores respectively calculate own parts, and redundant memory access operation can be generated by different slave cores.

In order to reduce the access and storage overhead caused by repeated reading and writing and realize the full utilization of two input matrixes, the following optimization method is provided based on block matrix multiplication:

mapping the input matrix and the output matrix to an 8 × 8 slave core array by taking a block as a unit;

reading a data block from the main core by each slave core through DMA according to the row and column numbers;

each slave core only calculates one of 8 x 8 output matrix blocks, and the multiplication according to the block matrix is completed in 8 steps.

For more intuition, taking 4 × 4 slave core array as an example, the A, B, C matrix reads in the slave cores in a uniform partitioning manner, as shown in fig. 1, counting (0, 0) at the upper left and (3, 3) at the lower right;

taking the slave core No. (2, 1) as an example, four steps are required for calculating Do (2, 1):

a) time0 Time step:

the slave core with row coordinate 0 broadcasts its own Di data to the same column slave core, so that (2, 1) gets Di (0, 1) block from the core,

the slave core with column coordinate 0 broadcasts its own W data to the same row slave core, so that (2, 1) gets a W (2, 0) block from the core,

performing matrix multiply-add of the W (2, 0) block and the Di (0, 1) block on (2, 1) to obtain the first block intermediate result of Do (2, 1).

b) Time1 Time step:

the slave core with row coordinate of 1 broadcasts its own Di data to the same column slave core, so that (2, 1) gets Di (1, 1) blocks from the core,

the slave core with column coordinate of 1 broadcasts its own W data to the same row slave core, so that (2, 1) gets W (2, 1) block from the core,

performing matrix multiply-add of the W (2, 1) block and the Di (1, 1) block on (2, 1) to obtain a second block intermediate result of Do (2, 1).

c) Time2 Time step:

the slave core with row coordinate 2 broadcasts its own Di data to the same column slave core, so that (2, 1) gets Di (2, 1) blocks from the core,

the slave core with column coordinate 2 broadcasts its own W data to the same row slave core, so that (2, 1) gets a W (2, 2) block from the core,

the matrix multiply-add of the W (2, 3) block and Di (3, 1) block is performed on (2, 1) to get the third block intermediate result of Do (2, 1).

d) Time3 Time step:

the slave core with row coordinate of 3 broadcasts its own Di data to the same column slave core, so that (2, 1) gets Di (3, 1) blocks from the core,

the slave core with column coordinate 3 broadcasts its own W data to the same row slave core, so that (2, 1) gets W (2, 3) block from the core,

and performing matrix multiplication and addition of the W (2, 3) block and the Di (3, 1) block on the (2, 1) block to obtain a final result of Do (2, 1).

By the method, when block matrix multiplication is realized, data sharing on the slave core array is realized by utilizing inter-core communication, each slave core only needs to read in one block of data through DMA, and other data are acquired through the inter-core communication with higher efficiency, so that repeated DMA reading and writing are avoided, and the access and storage expenses are reduced.

In order to ensure the feasibility of the full-array block matrix multiplication, the shape of the matrix is constrained by two points, one is that the shape of the matrix must satisfy the integer division condition, and the other is that the matrix involved in the calculation must be large enough to ensure blocking at least once.

In order to expand the application range of the algorithm, the support for the non-integer division condition and a matrix with a small dimension needs to be added, the application range of the block matrix multiplication method on the core array is expanded by a matrix zero padding mode, and the access overhead caused by the expansion is reduced by the following processing method.

Pre-application space:

for an M × N matrix, the minimum partition size is blkM, blkN, the size of the direction extension dimension is determined in the following manner, and memory allocation is performed:

M_size = MAX（M*2，blkM）

N_size = MAX（N*2，blkN）

the data is arranged in the pre-application space, as shown in fig. 2: the non-stripe block, the cross-stripe block and the vertical stripe block are original matrixes, and the original matrixes are converted into extended matrixes through a supplementary grid part, wherein the non-stripe part is stored in the original matrixes, and the stripe part is stored in the corresponding position in the extended matrixes;

a) keeping the matrix blocks meeting the integral division condition in the original matrix;

b) for the matrix blocks which do not meet the integer division condition, 32 slave cores are respectively distributed, and the matrix blocks are extended to corresponding positions of the pre-application space by adopting two strategies through reading in and writing back of the many cores according to specific positions:

the row direction M is not divided exactly: reading in the part with the row number smaller than N, writing back to the corresponding position of the expansion matrix after zero assignment to the non-integer column;

if the number of lines is larger than N, the data are not read in, and the data are written back to the corresponding position of the extended matrix after being zero-assigned according to the extended lines;

column direction N does not divide (right) part equally: and (3) striding and reading the right N-part of the incompletely-divided data from the original matrix to a local memory, expanding the (N + 1) th to the N _ size columns to be 0 in a reverse order by taking M as a unit, and writing the expanded data back to the right non-bottom part of the expanded matrix in a striding and writing-back manner.

Through the expansion operation, the integer division block is reserved in the original matrix, the non-integer division block is placed in the expansion matrix through zero padding expansion, block matrix multiplication is achieved on the basis, the original matrix is traversed first, then the expansion matrix is traversed, for the condition that a certain dimension is small, data are all written into the expansion matrix, and block traversal is directly started from the expansion matrix. For the calculation result, the data written into the expansion matrix needs to be reversely expanded and written back to the original matrix.

By matrix expansion, the application range of the block algorithm is expanded, and the block algorithm can be normally blocked for different shapes, especially when a matrix with small dimension is used.

When the floating-point matrix multiplier many-core parallel optimization method aiming at deep learning is adopted, the code-level optimization is carried out on the matrix multiplier aiming at the calculation characteristic of the deep learning, the matrix multiplier in the deep learning is well supported, and the many-core can be used for accelerating no matter the dimension of the matrix, so that the solving efficiency of the problems is improved, and the integral operation efficiency and the calculation performance of deep learning application are improved; meanwhile, the method has universality and a certain acceleration effect on matrix calculation in other fields, can reduce the access cost, can expand the application range of the algorithm and can well support any dimensionality.

The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A floating-point matrix multiplier many-core parallel optimization method aiming at deep learning is characterized by comprising the following steps: expanding an input matrix and an output matrix and calculating on a slave core array based on block matrix multiplication;