CN112732630A - Floating-point matrix multiplier many-core parallel optimization method for deep learning - Google Patents

Floating-point matrix multiplier many-core parallel optimization method for deep learning Download PDF

Info

Publication number
CN112732630A
CN112732630A CN201910975075.4A CN201910975075A CN112732630A CN 112732630 A CN112732630 A CN 112732630A CN 201910975075 A CN201910975075 A CN 201910975075A CN 112732630 A CN112732630 A CN 112732630A
Authority
CN
China
Prior art keywords
matrix
data
slave core
row
column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910975075.4A
Other languages
Chinese (zh)
Inventor
刘沙
刘鑫
黄则强
陈德训
朱传家
彭超
裴阳
陆旭峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Jiangnan Computing Technology Institute
Original Assignee
Wuxi Jiangnan Computing Technology Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Jiangnan Computing Technology Institute filed Critical Wuxi Jiangnan Computing Technology Institute
Priority to CN201910975075.4A priority Critical patent/CN112732630A/en
Publication of CN112732630A publication Critical patent/CN112732630A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17318Parallel communications techniques, e.g. gather, scatter, reduce, roadcast, multicast, all to all
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/542Event management; Broadcasting; Multicasting; Notifications

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a floating-point matrix multiplication operator many-core parallel optimization method aiming at deep learning, which comprises the steps of expanding an input matrix and an output matrix and calculating on a slave core array based on block matrix multiplication; the expanding the input matrix and the output matrix comprises the following steps: s1, pre-applying for the space of an expansion matrix; s2, arranging the input matrix and the output matrix in the pre-application space; s3, for the part with row number less than M under the condition of non-integer division in the row direction M, reading N data according to the row, expanding the non-integer division column into N _ size data after zero assignment, and writing back the data to the corresponding position of the expansion matrix; s4, for the N-odd-divided part in the column direction, firstly, the data of the right N-odd-divided part is read into a local memory from the original matrix in a stepping mode, and the N +1 th to the N _ size columns are expanded in a reverse order and are listed as 0. The invention can reduce the cost of access and storage, can expand the application range of the algorithm and has good support for any dimensionality.

Description

Floating-point matrix multiplier many-core parallel optimization method for deep learning
Technical Field
The invention relates to a floating-point matrix multiplier many-core parallel optimization method for deep learning, and belongs to the technical field of deep learning.
Background
In deep learning, data are propagated forwards and backwards in different neural network layers in a tensor mode, parameters are optimized through gradient correction so as to achieve the purpose of accurate prediction, a large number of matrix calculations are involved in the process, the calculation amount is the largest through matrix product and convolution, and the two calculation cores are matrix product. The performance of the matrix multiplication operator directly influences the performance of matrix multiplication and convolution, and has great influence on the overall performance of deep learning calculation, so that the optimization of the related algorithm has profound significance to the field.
The matrix multiplication is a typical operator in deep learning, on one hand, the calculation specification is realized, on the other hand, all possible matrix sizes need to be considered according to the random matrix shape in actual calculation, so that the optimization of the matrix multiplication needs to fully utilize the uncorrelated characteristic of the matrix multiplication and also considers the applicability of the matrix with any shape.
The current matrix multiplication optimization algorithm adopts a uniform block division strategy, utilizes the normalization and the non-relevance of calculation, and mainly has two problems: (1) repeated data reading and writing exist, and data sharing is not realized; (2) the implementation of blocking algorithms for different shapes, especially when a certain dimension is small, is lacking, and such computation is very common in deep learning.
Disclosure of Invention
The invention aims to provide a floating-point matrix multiplication operator many-core parallel optimization method for deep learning, which can reduce the access cost, expand the application range of an algorithm and well support any dimensionality.
In order to achieve the purpose, the invention adopts the technical scheme that: a floating-point matrix multiplication operator many-core parallel optimization method aiming at deep learning comprises the steps of expanding an input matrix and an output matrix and calculating on a slave core array based on block matrix multiplication;
the expanding the input matrix and the output matrix comprises the following steps:
s1, pre-applying for the space of the expansion matrix for the input matrix and the output matrix: for an M × N matrix, the minimum partition size is blkM and blkN, the size of the direction expansion dimension M _ size and N _ size is determined, and memory allocation is performed, where M _ size = MAX (M × 2, blkM), N _ size = MAX (N × 2, blkN);
s2, arranging the input matrix and the output matrix on the pre-application space: for matrix blocks of which M and N meet the integer division condition, the matrix blocks are kept in an original matrix, for the matrix blocks of which M or N do not meet the integer division condition, 32 slave cores are respectively distributed in the M direction and the N direction, two filling strategies are adopted for the M direction and the N direction, and an input matrix and an output matrix are written into corresponding positions of an extended matrix through reading and writing back of many cores;
s3, for the case of the row direction M not dividing exactly, reading N data according to the row, expanding into N _ size data after zero assignment to the non-dividing row, writing back to the corresponding position of the expansion matrix, for the case of the row number less than M, not reading, expanding into N _ size data after zero assignment according to the row direction, writing back to the corresponding position of the expansion matrix according to the row number of the read data;
s4, for the N-odd-divided part in the column direction, firstly, striding to read the right N-odd-divided part data from the original matrix to the local memory, expanding the (N + 1) th to the N _ size columns to be 0 in the reverse sequence, and then writing the expanded data back to the corresponding position of the expanded matrix in a striding write-back mode according to the row and column numbers of the read data;
the calculating on the slave core array based on block matrix multiplication comprises the following steps:
s5, dividing the expansion matrix of the input matrix and the output matrix into 8 by 8 data blocks equally according to the structural characteristics of the slave core array so as to map the data blocks onto the slave core array;
s6, each slave core reads in a data block of the input matrix and the output matrix from the main memory through DMA according to the row and column number of the slave core array at 8 x 8;
s7, each slave core calculates only one of 8 × 8 output matrix blocks, based on block matrix multiplication, by using inter-core row broadcast and column broadcast communication, the whole calculation is completed in 8 steps, taking (i, j) slave core as an example, to multiply the matrix with input W, Di, the specific steps are as follows:
s71, broadcasting and sharing local W data through a row by a slave core with a column coordinate of 0, broadcasting and sharing local Di data through a column by a data with a row coordinate of 0, receiving W (i, 0) and Di (0, j) data blocks from the slave core at the moment, completing the calculation of W (i, 0) × Di (0, j), and accumulating the W (i, 0) × Di (0, j) into an output matrix of the slave core;
s72, broadcasting and sharing local W data through a row by a slave core with a column coordinate of 1, broadcasting and sharing local Di data through a column by data with a row coordinate of 1, receiving W (i, 1) and Di (1, j) data blocks from the slave core at the moment (i, j), completing the calculation of W (i, 1) × Di (1, j), and accumulating the W (i, 1) × Di (1, j) into an output matrix of the slave core;
s73, broadcasting and sharing local W data through rows by the slave core with the column coordinate of 2, broadcasting and sharing local Di data through columns by the data with the row coordinate of 2, receiving the data blocks of W (i, 2) and Di (2, j) by the slave core at the moment (i, j), completing the calculation of W (i, 2) × Di (2, j), and accumulating the data blocks into an output matrix of the slave core;
s74, broadcasting and sharing local W data through a row by the slave core with the column coordinate of 3, broadcasting and sharing local Di data through a column by the data with the row coordinate of 3, receiving the data blocks of W (i, 3) and Di (3, j) by the slave core at the moment (i, j), completing the calculation of W (i, 3) × Di (3, j), and accumulating the data blocks into an output matrix of the slave core;
s75, broadcasting and sharing local W data through a row by a slave core with a column coordinate of 4, broadcasting and sharing local Di data through a column by a data with a row coordinate of 4, receiving W (i, 4) and Di (4, j) data blocks from the slave core at the moment, completing the calculation of W (i, 4) × Di (4, j), and accumulating the W (i, 4) × Di (4, j) into an output matrix of the slave core;
s76, broadcasting and sharing local W data through rows by the slave core with the column coordinate of 5, broadcasting and sharing local Di data through columns by the data with the row coordinate of 5, wherein at the moment (i, j) the slave core receives W (i, 5) and Di (5, j) data blocks, the calculation of W (i, 5) × Di (5, j) is completed, and the W (i, 5) × Di (5, j) is accumulated in an output matrix of the slave core;
s77, broadcasting and sharing local W data through rows by the slave core with the column coordinate of 6, broadcasting and sharing local Di data through columns by the data with the row coordinate of 6, wherein the (i, j) slave core receives the W (i, 6) and Di (6, j) data blocks, completes the calculation of W (i, 6) × Di (6, j), and accumulates the W (i, 6) × Di (6, j) into an output matrix of the slave core;
and S78, broadcasting the local W data by the slave core with the column coordinate of 7 through a row, broadcasting the local Di data by the data with the row coordinate of 7 through a column, receiving the data blocks of W (i, 7) and Di (7, j) from the slave core at the moment (i, j), completing the calculation of W (i, 7) × Di (7, j), and accumulating the data blocks into an output matrix of the slave core.
Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:
the invention relates to a many-core parallel optimization method for a floating-point matrix multiplier for deep learning, which performs code-level optimization on a matrix multiplier aiming at the calculation characteristic of deep learning, has better support on matrix multiplication calculation in the deep learning, can accelerate by using many cores no matter the dimension of the matrix, and improves the solving efficiency of the problems, thereby improving the overall operation efficiency and the calculation performance of deep learning application; meanwhile, the method has universality and a certain acceleration effect on matrix calculation in other fields, can reduce the access cost, can expand the application range of the algorithm and can well support any dimensionality.
Drawings
FIG. 1 is a schematic diagram of a slave core performing computations through inter-core communication;
FIG. 2 is a schematic diagram of matrix expansion;
FIG. 3 is a flow chart of a floating point matrix multiplier many-core parallel optimization method for deep learning according to the present invention.
Detailed Description
Example (b): a floating-point matrix multiplication operator many-core parallel optimization method aiming at deep learning comprises the steps of expanding an input matrix and an output matrix and calculating on a slave core array based on block matrix multiplication;
the expanding the input matrix and the output matrix comprises the following steps:
s1, pre-applying for the space of the expansion matrix for the input matrix and the output matrix: for an M × N matrix, the minimum partition size is blkM and blkN, the size of the direction expansion dimension M _ size and N _ size is determined, and memory allocation is performed, where M _ size = MAX (M × 2, blkM), N _ size = MAX (N × 2, blkN);
s2, arranging the input matrix and the output matrix on the pre-application space: for matrix blocks of which M and N meet the integer division condition, the matrix blocks are kept in an original matrix, for the matrix blocks of which M or N do not meet the integer division condition, 32 slave cores are respectively distributed in the M direction and the N direction, two filling strategies are adopted for the M direction and the N direction (the row direction and the column direction), and an input matrix and an output matrix are written into corresponding positions of an extended matrix through reading and writing back of many cores;
s3, for the case of the row direction M not dividing (bottom), the row number is less than M, reading N data according to the row, expanding into N _ size data after zero assignment to the non-dividing row, writing back to the corresponding position of the expansion matrix, for the case of the row number is greater than M, not reading, expanding into N _ size data after zero assignment according to the row direction, writing back to the corresponding position of the expansion matrix according to the row number of the read data;
s4, for the part of the column direction N which is not divided (right side), firstly, striding to read the right N part of the data which is not divided into parts from the original matrix into a local memory, expanding the (N + 1) th to the N _ size column as 0 in a reverse sequence, and then writing the expanded data back to the corresponding position of the expanded matrix in a striding write-back mode according to the row and column number of the read data;
the calculating on the slave core array based on block matrix multiplication comprises the following steps:
s5, dividing the expansion matrix of the input matrix and the output matrix into 8 by 8 data blocks equally according to the structural characteristics of the slave core array so as to map the data blocks onto the slave core array;
s6, each slave core reads in a data block of the input matrix and the output matrix from the main memory through DMA according to the row and column number of the slave core array at 8 x 8;
s7, each slave core calculates only one of 8 × 8 output matrix blocks, based on block matrix multiplication, by using inter-core row broadcast and column broadcast communication, the whole calculation is completed in 8 steps, taking (i, j) slave core as an example, to multiply the matrix with input W, Di, the specific steps are as follows:
s71, broadcasting and sharing local W data through a row by a slave core with a column coordinate of 0, broadcasting and sharing local Di data through a column by a data with a row coordinate of 0, receiving W (i, 0) and Di (0, j) data blocks from the slave core at the moment, completing the calculation of W (i, 0) × Di (0, j), and accumulating the W (i, 0) × Di (0, j) into an output matrix of the slave core;
s72, broadcasting and sharing local W data through a row by a slave core with a column coordinate of 1, broadcasting and sharing local Di data through a column by data with a row coordinate of 1, receiving W (i, 1) and Di (1, j) data blocks from the slave core at the moment (i, j), completing the calculation of W (i, 1) × Di (1, j), and accumulating the W (i, 1) × Di (1, j) into an output matrix of the slave core;
s73, broadcasting and sharing local W data through rows by the slave core with the column coordinate of 2, broadcasting and sharing local Di data through columns by the data with the row coordinate of 2, receiving the data blocks of W (i, 2) and Di (2, j) by the slave core at the moment (i, j), completing the calculation of W (i, 2) × Di (2, j), and accumulating the data blocks into an output matrix of the slave core;
s74, broadcasting and sharing local W data through a row by the slave core with the column coordinate of 3, broadcasting and sharing local Di data through a column by the data with the row coordinate of 3, receiving the data blocks of W (i, 3) and Di (3, j) by the slave core at the moment (i, j), completing the calculation of W (i, 3) × Di (3, j), and accumulating the data blocks into an output matrix of the slave core;
s75, broadcasting and sharing local W data through a row by a slave core with a column coordinate of 4, broadcasting and sharing local Di data through a column by a data with a row coordinate of 4, receiving W (i, 4) and Di (4, j) data blocks from the slave core at the moment, completing the calculation of W (i, 4) × Di (4, j), and accumulating the W (i, 4) × Di (4, j) into an output matrix of the slave core;
s76, broadcasting and sharing local W data through rows by the slave core with the column coordinate of 5, broadcasting and sharing local Di data through columns by the data with the row coordinate of 5, wherein at the moment (i, j) the slave core receives W (i, 5) and Di (5, j) data blocks, the calculation of W (i, 5) × Di (5, j) is completed, and the W (i, 5) × Di (5, j) is accumulated in an output matrix of the slave core;
s77, broadcasting and sharing local W data through rows by the slave core with the column coordinate of 6, broadcasting and sharing local Di data through columns by the data with the row coordinate of 6, wherein the (i, j) slave core receives the W (i, 6) and Di (6, j) data blocks, completes the calculation of W (i, 6) × Di (6, j), and accumulates the W (i, 6) × Di (6, j) into an output matrix of the slave core;
and S78, broadcasting the local W data by the slave core with the column coordinate of 7 through a row, broadcasting the local Di data by the data with the row coordinate of 7 through a column, receiving the data blocks of W (i, 7) and Di (7, j) from the slave core at the moment (i, j), completing the calculation of W (i, 7) × Di (7, j), and accumulating the data blocks into an output matrix of the slave core.
The examples are further explained below:
a non-transposed matrix multiplication algorithm is basically implemented as follows:
for(cN=0;cN<N;cN++)
for(cM=0;cM<M;cM++){
tmp=0;
for(cK=0;cK<K;cK++){
tmp+=W[cN][cK]*Di[cK][cM];
}
Do[cN][cM] += alpha*tmp + beta*Do[cN][cM];
}
according to the implementation, overlapped data access exists in the calculation of different C elements, when the crowd check is carried out, if the block is carried out according to the C matrix and the slave core array is taken as a whole, the slave cores respectively calculate own parts, and redundant memory access operation can be generated by different slave cores.
In order to reduce the access and storage overhead caused by repeated reading and writing and realize the full utilization of two input matrixes, the following optimization method is provided based on block matrix multiplication:
mapping the input matrix and the output matrix to an 8 × 8 slave core array by taking a block as a unit;
reading a data block from the main core by each slave core through DMA according to the row and column numbers;
each slave core only calculates one of 8 x 8 output matrix blocks, and the multiplication according to the block matrix is completed in 8 steps.
For more intuition, taking 4 × 4 slave core array as an example, the A, B, C matrix reads in the slave cores in a uniform partitioning manner, as shown in fig. 1, counting (0, 0) at the upper left and (3, 3) at the lower right;
taking the slave core No. (2, 1) as an example, four steps are required for calculating Do (2, 1):
a) time0 Time step:
the slave core with row coordinate 0 broadcasts its own Di data to the same column slave core, so that (2, 1) gets Di (0, 1) block from the core,
the slave core with column coordinate 0 broadcasts its own W data to the same row slave core, so that (2, 1) gets a W (2, 0) block from the core,
performing matrix multiply-add of the W (2, 0) block and the Di (0, 1) block on (2, 1) to obtain the first block intermediate result of Do (2, 1).
b) Time1 Time step:
the slave core with row coordinate of 1 broadcasts its own Di data to the same column slave core, so that (2, 1) gets Di (1, 1) blocks from the core,
the slave core with column coordinate of 1 broadcasts its own W data to the same row slave core, so that (2, 1) gets W (2, 1) block from the core,
performing matrix multiply-add of the W (2, 1) block and the Di (1, 1) block on (2, 1) to obtain a second block intermediate result of Do (2, 1).
c) Time2 Time step:
the slave core with row coordinate 2 broadcasts its own Di data to the same column slave core, so that (2, 1) gets Di (2, 1) blocks from the core,
the slave core with column coordinate 2 broadcasts its own W data to the same row slave core, so that (2, 1) gets a W (2, 2) block from the core,
the matrix multiply-add of the W (2, 3) block and Di (3, 1) block is performed on (2, 1) to get the third block intermediate result of Do (2, 1).
d) Time3 Time step:
the slave core with row coordinate of 3 broadcasts its own Di data to the same column slave core, so that (2, 1) gets Di (3, 1) blocks from the core,
the slave core with column coordinate 3 broadcasts its own W data to the same row slave core, so that (2, 1) gets W (2, 3) block from the core,
and performing matrix multiplication and addition of the W (2, 3) block and the Di (3, 1) block on the (2, 1) block to obtain a final result of Do (2, 1).
By the method, when block matrix multiplication is realized, data sharing on the slave core array is realized by utilizing inter-core communication, each slave core only needs to read in one block of data through DMA, and other data are acquired through the inter-core communication with higher efficiency, so that repeated DMA reading and writing are avoided, and the access and storage expenses are reduced.
In order to ensure the feasibility of the full-array block matrix multiplication, the shape of the matrix is constrained by two points, one is that the shape of the matrix must satisfy the integer division condition, and the other is that the matrix involved in the calculation must be large enough to ensure blocking at least once.
In order to expand the application range of the algorithm, the support for the non-integer division condition and a matrix with a small dimension needs to be added, the application range of the block matrix multiplication method on the core array is expanded by a matrix zero padding mode, and the access overhead caused by the expansion is reduced by the following processing method.
Pre-application space:
for an M × N matrix, the minimum partition size is blkM, blkN, the size of the direction extension dimension is determined in the following manner, and memory allocation is performed:
M_size = MAX(M*2,blkM)
N_size = MAX(N*2,blkN)
the data is arranged in the pre-application space, as shown in fig. 2: the non-stripe block, the cross-stripe block and the vertical stripe block are original matrixes, and the original matrixes are converted into extended matrixes through a supplementary grid part, wherein the non-stripe part is stored in the original matrixes, and the stripe part is stored in the corresponding position in the extended matrixes;
a) keeping the matrix blocks meeting the integral division condition in the original matrix;
b) for the matrix blocks which do not meet the integer division condition, 32 slave cores are respectively distributed, and the matrix blocks are extended to corresponding positions of the pre-application space by adopting two strategies through reading in and writing back of the many cores according to specific positions:
the row direction M is not divided exactly: reading in the part with the row number smaller than N, writing back to the corresponding position of the expansion matrix after zero assignment to the non-integer column;
if the number of lines is larger than N, the data are not read in, and the data are written back to the corresponding position of the extended matrix after being zero-assigned according to the extended lines;
column direction N does not divide (right) part equally: and (3) striding and reading the right N-part of the incompletely-divided data from the original matrix to a local memory, expanding the (N + 1) th to the N _ size columns to be 0 in a reverse order by taking M as a unit, and writing the expanded data back to the right non-bottom part of the expanded matrix in a striding and writing-back manner.
Through the expansion operation, the integer division block is reserved in the original matrix, the non-integer division block is placed in the expansion matrix through zero padding expansion, block matrix multiplication is achieved on the basis, the original matrix is traversed first, then the expansion matrix is traversed, for the condition that a certain dimension is small, data are all written into the expansion matrix, and block traversal is directly started from the expansion matrix. For the calculation result, the data written into the expansion matrix needs to be reversely expanded and written back to the original matrix.
By matrix expansion, the application range of the block algorithm is expanded, and the block algorithm can be normally blocked for different shapes, especially when a matrix with small dimension is used.
When the floating-point matrix multiplier many-core parallel optimization method aiming at deep learning is adopted, the code-level optimization is carried out on the matrix multiplier aiming at the calculation characteristic of the deep learning, the matrix multiplier in the deep learning is well supported, and the many-core can be used for accelerating no matter the dimension of the matrix, so that the solving efficiency of the problems is improved, and the integral operation efficiency and the calculation performance of deep learning application are improved; meanwhile, the method has universality and a certain acceleration effect on matrix calculation in other fields, can reduce the access cost, can expand the application range of the algorithm and can well support any dimensionality.
The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims (1)

1. A floating-point matrix multiplier many-core parallel optimization method aiming at deep learning is characterized by comprising the following steps: expanding an input matrix and an output matrix and calculating on a slave core array based on block matrix multiplication;
the expanding the input matrix and the output matrix comprises the following steps:
s1, pre-applying for the space of the expansion matrix for the input matrix and the output matrix: for an M × N matrix, the minimum partition size is blkM and blkN, the size of the direction expansion dimension M _ size and N _ size is determined, and memory allocation is performed, where M _ size = MAX (M × 2, blkM), N _ size = MAX (N × 2, blkN);
s2, arranging the input matrix and the output matrix on the pre-application space: for matrix blocks of which M and N meet the integer division condition, the matrix blocks are kept in an original matrix, for the matrix blocks of which M or N do not meet the integer division condition, 32 slave cores are respectively distributed in the M direction and the N direction, two filling strategies are adopted for the M direction and the N direction, and an input matrix and an output matrix are written into corresponding positions of an extended matrix through reading and writing back of many cores;
s3, for the case of the row direction M not dividing exactly, reading N data according to the row, expanding into N _ size data after zero assignment to the non-dividing row, writing back to the corresponding position of the expansion matrix, for the case of the row number less than M, not reading, expanding into N _ size data after zero assignment according to the row direction, writing back to the corresponding position of the expansion matrix according to the row number of the read data;
s4, for the N-odd-divided part in the column direction, firstly, striding to read the right N-odd-divided part data from the original matrix to the local memory, expanding the (N + 1) th to the N _ size columns to be 0 in the reverse sequence, and then writing the expanded data back to the corresponding position of the expanded matrix in a striding write-back mode according to the row and column numbers of the read data;
the calculating on the slave core array based on block matrix multiplication comprises the following steps:
s5, dividing the expansion matrix of the input matrix and the output matrix into 8 by 8 data blocks equally according to the structural characteristics of the slave core array so as to map the data blocks onto the slave core array;
s6, each slave core reads in a data block of the input matrix and the output matrix from the main memory through DMA according to the row and column number of the slave core array at 8 x 8;
s7, each slave core calculates only one of 8 × 8 output matrix blocks, based on block matrix multiplication, by using inter-core row broadcast and column broadcast communication, the whole calculation is completed in 8 steps, taking (i, j) slave core as an example, to multiply the matrix with input W, Di, the specific steps are as follows:
s71, broadcasting and sharing local W data through a row by a slave core with a column coordinate of 0, broadcasting and sharing local Di data through a column by a data with a row coordinate of 0, receiving W (i, 0) and Di (0, j) data blocks from the slave core at the moment, completing the calculation of W (i, 0) × Di (0, j), and accumulating the W (i, 0) × Di (0, j) into an output matrix of the slave core;
s72, broadcasting and sharing local W data through a row by a slave core with a column coordinate of 1, broadcasting and sharing local Di data through a column by data with a row coordinate of 1, receiving W (i, 1) and Di (1, j) data blocks from the slave core at the moment (i, j), completing the calculation of W (i, 1) × Di (1, j), and accumulating the W (i, 1) × Di (1, j) into an output matrix of the slave core;
s73, broadcasting and sharing local W data through rows by the slave core with the column coordinate of 2, broadcasting and sharing local Di data through columns by the data with the row coordinate of 2, receiving the data blocks of W (i, 2) and Di (2, j) by the slave core at the moment (i, j), completing the calculation of W (i, 2) × Di (2, j), and accumulating the data blocks into an output matrix of the slave core;
s74, broadcasting and sharing local W data through a row by the slave core with the column coordinate of 3, broadcasting and sharing local Di data through a column by the data with the row coordinate of 3, receiving the data blocks of W (i, 3) and Di (3, j) by the slave core at the moment (i, j), completing the calculation of W (i, 3) × Di (3, j), and accumulating the data blocks into an output matrix of the slave core;
s75, broadcasting and sharing local W data through a row by a slave core with a column coordinate of 4, broadcasting and sharing local Di data through a column by a data with a row coordinate of 4, receiving W (i, 4) and Di (4, j) data blocks from the slave core at the moment, completing the calculation of W (i, 4) × Di (4, j), and accumulating the W (i, 4) × Di (4, j) into an output matrix of the slave core;
s76, broadcasting and sharing local W data through rows by the slave core with the column coordinate of 5, broadcasting and sharing local Di data through columns by the data with the row coordinate of 5, wherein at the moment (i, j) the slave core receives W (i, 5) and Di (5, j) data blocks, the calculation of W (i, 5) × Di (5, j) is completed, and the W (i, 5) × Di (5, j) is accumulated in an output matrix of the slave core;
s77, broadcasting and sharing local W data through rows by the slave core with the column coordinate of 6, broadcasting and sharing local Di data through columns by the data with the row coordinate of 6, wherein the (i, j) slave core receives the W (i, 6) and Di (6, j) data blocks, completes the calculation of W (i, 6) × Di (6, j), and accumulates the W (i, 6) × Di (6, j) into an output matrix of the slave core;
and S78, broadcasting the local W data by the slave core with the column coordinate of 7 through a row, broadcasting the local Di data by the data with the row coordinate of 7 through a column, receiving the data blocks of W (i, 7) and Di (7, j) from the slave core at the moment (i, j), completing the calculation of W (i, 7) × Di (7, j), and accumulating the data blocks into an output matrix of the slave core.
CN201910975075.4A 2019-10-14 2019-10-14 Floating-point matrix multiplier many-core parallel optimization method for deep learning Withdrawn CN112732630A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910975075.4A CN112732630A (en) 2019-10-14 2019-10-14 Floating-point matrix multiplier many-core parallel optimization method for deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910975075.4A CN112732630A (en) 2019-10-14 2019-10-14 Floating-point matrix multiplier many-core parallel optimization method for deep learning

Publications (1)

Publication Number Publication Date
CN112732630A true CN112732630A (en) 2021-04-30

Family

ID=75588627

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910975075.4A Withdrawn CN112732630A (en) 2019-10-14 2019-10-14 Floating-point matrix multiplier many-core parallel optimization method for deep learning

Country Status (1)

Country Link
CN (1) CN112732630A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114675829A (en) * 2022-01-30 2022-06-28 华东师范大学 Performance optimization method for self-adaptive elimination of redundant computation and communication in distributed matrix computing system
WO2024027039A1 (en) * 2022-08-03 2024-02-08 北京登临科技有限公司 Data processing method and apparatus, and device and readable storage medium
CN117472448B (en) * 2023-12-28 2024-03-26 山东省计算中心(国家超级计算济南中心) Parallel acceleration method, device and medium for secondary core cluster of Shenwei many-core processor

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114675829A (en) * 2022-01-30 2022-06-28 华东师范大学 Performance optimization method for self-adaptive elimination of redundant computation and communication in distributed matrix computing system
CN114675829B (en) * 2022-01-30 2023-07-14 华东师范大学 Performance optimization method for adaptively eliminating redundant calculation and communication in distributed matrix computing system
WO2024027039A1 (en) * 2022-08-03 2024-02-08 北京登临科技有限公司 Data processing method and apparatus, and device and readable storage medium
CN117472448B (en) * 2023-12-28 2024-03-26 山东省计算中心(国家超级计算济南中心) Parallel acceleration method, device and medium for secondary core cluster of Shenwei many-core processor

Similar Documents

Publication Publication Date Title
CN108241890B (en) Reconfigurable neural network acceleration method and architecture
CN111178519B (en) Convolutional neural network acceleration engine, convolutional neural network acceleration system and method
CN113424201A (en) Neural network processor
CN112732630A (en) Floating-point matrix multiplier many-core parallel optimization method for deep learning
JP3639323B2 (en) Simultaneous linear equation calculation processing method and computer using memory distributed parallel computer
KR20230113851A (en) Spatial locality transform of matrices
CN107992943A (en) Addressed for convolutional neural networks
CN109948774A (en) Neural network accelerator and its implementation based on network layer binding operation
CN112748956A (en) High throughput matrix processor supporting simultaneous processing of multiple matrices
JP3675537B2 (en) Memory distributed parallel computer performing fast Fourier transform and method thereof
CN103336758A (en) Sparse matrix storage method CSRL (Compressed Sparse Row with Local Information) and SpMV (Sparse Matrix Vector Multiplication) realization method based on same
CN110399591B (en) Data processing method and device based on convolutional neural network
CN110516316B (en) GPU acceleration method for solving Euler equation by interrupted Galerkin method
CN102968503A (en) Data processing method for database system, and database system
CN116303111A (en) Hardware double buffering using special purpose computing units
US10872038B1 (en) Memory organization for matrix processing
CN114281755B (en) Vector processor-oriented semi-precision vectorization convolution method and system
CN114565501A (en) Data loading method and device for convolution operation
CN109446478B (en) Complex covariance matrix calculation system based on iteration and reconfigurable mode
Rhe et al. VW-SDK: Efficient convolutional weight mapping using variable windows for processing-in-memory architectures
CN113312285B (en) Convolutional neural network accelerator and working method thereof
CN114330656A (en) Convolution operation hardware accelerator and data processing method
CN116881618B (en) General matrix multiplication calculation optimization method, device and processor
CN106484532A (en) GPGPU parallel calculating method towards SPH fluid simulation
CN111610963A (en) Chip structure and multiply-add calculation engine thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210430

WW01 Invention patent application withdrawn after publication