CN108845795B - GPDSP-based dense matrix multiplication vectorization assembly code generation method - Google Patents
GPDSP-based dense matrix multiplication vectorization assembly code generation method Download PDFInfo
- Publication number
- CN108845795B CN108845795B CN201810530676.XA CN201810530676A CN108845795B CN 108845795 B CN108845795 B CN 108845795B CN 201810530676 A CN201810530676 A CN 201810530676A CN 108845795 B CN108845795 B CN 108845795B
- Authority
- CN
- China
- Prior art keywords
- module
- parameter
- dma
- assembly code
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/30—Creation or generation of source code
- G06F8/37—Compiler construction; Parser generation
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
The invention discloses a GPDSP-based dense matrix multiplication vectorization assembly code generation method, which comprises the following steps: s1, constructing a plurality of core templates for realizing different tasks; s2, constructing a frame template of dense matrix multiplication, wherein each module is used in the frame template to realize the dense matrix multiplication; and S3, converting each core module in the framework template into an assembly code by using a pre-constructed assembly code generation module Gen _ GEMM, and finally generating the required dense matrix multiplication assembly code. The method has the advantages of simple realization principle, simple and convenient operation, flexible use, capability of realizing the automatic generation of the dense matrix multiplication vectorization assembly code, high generation efficiency and performance and the like.
Description
Technical field
The present invention relates to GPDSP(General-Purpose Digital Signal Processor, general-purpose computations numbers
Signal processor) technical field more particularly to a kind of dense matrix multiplication vectorization assembly code generation side based on GPDSP
Method.
Background technique
Substantially linear Algebraic Algorithm library (Basic Linear Algebra Subroutines, BLAS) is all kinds of science meters
One of most common core mathematics algorithms library is calculated, industry is all proposed the BLAS of height optimization for respective processor platform
Realize, such as IBM ESSL, Intel MKL, AMD ACML.Wherein, dense matrix multiplication (General Matrix-
Matrix Multiplication, GEMM) be the library BLAS core algorithm, GEMM is that typical computation-intensive and memory access is intensive
Type application, very high to the operational capability of processor, memory bandwidth and delay requirement, GEMM calculating occupies high-performance benchmark test
Program (High Performance Linpack, HPL) operand is generally up to 90% or more.Therefore, for the system knot of processor
Structure the characteristic study GEMM optimization method applies the computational efficiency, the calculating advantage of performance processor and raising of evaluating and testing the processor
The speed of service of program all has critically important reference value.
Although compiler is to realize that performance optimizes optimal solution in theory, because source code and being not required to
It rewrites, but the development speed of hardware is much unable to catch up in the technological progress of compiler, even simple computational problem,
Compiler often generates inefficient code, for example, the dense matrix multiplication code generated by compiler is write than optimization by hand
Slow several times of code.Usual high-performance library function is all to be optimized meticulously using hand assemble, however new processor platform is more
Newly with publication so that the code of hand-coding needs to realize and optimize again, there is still a need for expend a large amount of workload, complexity for this
It spends and at high cost.
GPDSP is as a kind of heterogeneous multi-nucleus processor, it includes CPU core unit and DSP core unit, wherein CPU core unit
Be mainly used for being responsible for generic transaction management including storage management, document control, process scheduling, interrupt management task and
Complete support to the general-purpose operating system is provided;DSP core unit includes several 64 bit vectors processing with powerful calculating ability
Array, for supporting the resolving of highly dense processor active task, DSP core includes mark, vector register file, scalar L1D, vector array
The complicated multistage storage organization such as storage and outside DDR storage is shared in storage, piece.And complicated architecture is to efficient generation
The generation of code brings huge challenge, is difficult to realize by the library function assembly code that compiler generates efficient between storages at different levels
Data access and transmitting, traditional matrix in block form multiplication method towards Cache structure are also not suitable for the non-Cache's of GPDSP
Vector array store memory access mode and Vector Processing array Concurrent Vector processing architectural feature, it is difficult to play GPDSP to
Measure calculating advantage.
The high-performance library function that application system high to requirement of real-time at present is called usually all is that compilation is smart by hand
Heart optimization, it is currently to face that the architectural feature for how being directed to GPDSP complexity, which quickly generates efficient library function assembly code,
A huge challenge, and wherein based on GPDSP framework realize dense matrix multiplication vectorization assembly code generation be urgently
It solves the problems, such as.
Summary of the invention
The technical problem to be solved in the present invention is that, for technical problem of the existing technology, the present invention provides one
Kind has simple realization principle, easy to operate, using flexible, can be realized code automation generation, and formation efficiency and performance
The high dense matrix multiplication vectorization assembly code generation method based on GPDSP.
In order to solve the above technical problems, technical solution proposed by the present invention are as follows:
A kind of dense matrix multiplication vectorization assembly code generation method based on GPDSP, step include:
S1. the multiple cores template for realizing different task is constructed, including for realizing sub-block matrix multiplication
GEMM_kernel module, for realizing the DMA_Translate module of data transmission, for whether detecting in DMA data-moving
Move the DMA_POLL module for finishing corresponding register flag bit and the LOOP module for executing cyclic process, each mould
Plate includes the parameter list of required parameter;
S2. the framework template of dense matrix multiplication is constructed, uses the GEMM_kernel respectively in the framework template
Module, DMA_Translate module, DMA_POLL module and LOOP module, to realize dense matrix multiplication;
S3. using the assembly code generation module Gen_GEMM constructed in advance by nucleus module each in the framework template
Assembly code is converted to, required dense matrix multiplication assembly code is ultimately generated.
As a further improvement of the present invention: the parameter list of the GEMM_kernel module includes 4 input parameters,
Wherein the 1st parameter is to call the return address terminated, and the 2nd parameter is A sub-block matrix data address, and the 3rd parameter is B sub-block matrix
Data address, the 4th parameter are C sub-block matrix data address;It is defeated with above-mentioned parameter when constructing the GEMM_kernel module
Enter and carry out assembly code setting, realizes the corresponding sub-block matrix multiplication task of dense matrix multiplication, and jump after completion task
The program address transmitted to the 1st parameter.
As a further improvement of the present invention: the parameter list of the DMA_Translate module includes 11 input ginsengs
Number, wherein the 1st parameter is to call the return address terminated, the 2nd parameter is dma logic channel number, and the 3rd parameter is that logical channel is excellent
First grade, the 4th parameter be the 1, the 5th parameter of transmission mode control parameter word be the 2, the 6th parameter of transmission mode control parameter word for source
Location, the 7th parameter are source counting, and the 8th parameter is purpose address, are counted for the purpose of the 9th parameter, and the 10th parameter is source/destination index,
11st parameter is block index;It is that input progress assembly code is set with above-mentioned parameter when constructing the DMA_Translate module
It sets, realizes that source address arrives the data transfer task of destination address, and with jumping to after the completion of task the program that the 1st parameter is transmitted
Location.
As a further improvement of the present invention: the parameter list of the DMA_POLL module includes 2 input parameters, wherein
1st parameter is to call the return address terminated, and the 2nd parameter is dma logic channel number;When constructing the DMA_POLL module, with
Above-mentioned parameter is that input carries out assembly code setting, realizes that dma logic channel number is whether the data-moving of the 2nd parameter has been moved
Complete corresponding registers flag bit Detection task, and the program address of the 1st parameter transmitting is jumped to after the completion of task.
As a further improvement of the present invention: the parameter list of the LOOP module includes 3 input parameters, wherein the 1st
Parameter is register used in cycle count, and the 2nd parameter is cycle count initial value, and the 3rd parameter is the change in count recycled every time
Value.
As a further improvement of the present invention, when the dense matrix multiplying calculated based on GPDSP framework is C=C-A*
B, wherein A is the matrix of M × K rank, and B is the matrix of K × N rank, and C is the matrix of M × N rank, and tri- dimensions of M, K, N of matrix are corresponding
Piecemeal size be respectively labeled as MB, KB, NB, enable mi=M/MB, when ki=K/ KB, ni=N/NB, specifically pressed in the step S2
Following steps construct to obtain the framework template of dense matrix multiplication:
Step 1: opening K dimension Circulant Block and calculate, set counter register R0, counter initial value ki is followed every time
Inner loop counter subtracts 1, until Counter Value is 0;
Step 2: opening M dimension Circulant Block and calculate, set counter register R1, counter initial value mi is followed every time
Inner loop counter subtracts 1, until Counter Value is 0, wherein being divided in loop calculation using LOOP module execution cyclic process
Not Shi Yong DMA_Translate module transfer A, B, C matrix to data buffer zone, use the DMA_POLL module to wait
Data end of transmission, and the calculating using GEMM_kernel module progress A, B, C matrix sub block;
Step 3: judging whether counter R1 is 0, if not returning to step 2, execute step 4 if being transferred to;
Step 4: judging whether counter R0 is 0, if not 1 is returned to step, if completing current calculate.
As a further improvement of the present invention, the specific steps of the step 2 are as follows:
Step 2.1: successively using a sub-block MBxKB of the DMA_Translate module transfer A matrix to scalar
The data buffer zone Abuf of L1D;
Step 2.2: waiting data buffer zone Abuf data end of transmission using the DMA_POLL module;
Step 2.3: using sub-block a KBxNB, MBxNB of DMA_Translate module transfer B, C matrix respectively
The data buffer zone Bbuf0, Cbuf0 stored to vector array;
Step 2.4: using sub-block a KBxNB, MBxNB of DMA_Translate module transfer B, C matrix respectively
The data buffer zone Bbuf1, Cbuf1 stored to vector array;
Step 2.5: opening N-dimensional Circulant Block and calculate, set counter register R2, counter initial value ni-2, often
Secondary cycle counter subtracts 2, until Counter Value is 0;
Step 2.6: waiting data buffer zone Bbuf0, Cbuf0 using the DMA_POLL module respectively, Out0 data pass
It is finished complete;
Step 2.7: using the GEMM_kernel module to data buffer zone Abuf, A, B, C square of Bbuf0, Cbuf0
A period of time block is calculated;
Step 2.8: using the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out0;
Step 2.9: waiting data buffer zone Bbuf1, Cbuf1 using the DMA_POLL module respectively, Out1 data pass
It is finished complete;
Step 2.10: using the GEMM_kernel module to data buffer zone Abuf, A, B, C square of Bbuf1, Cbuf1
A period of time block is calculated;
Step 2.11: using the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out1.
As a further improvement of the present invention, the specific steps of the step 2.5 are as follows:
Step 2.5.1: data buffer zone Bbuf0, Cbuf0, Out0 data are waited using the DMA_POLL module respectively
End of transmission;
Step 2.5.2: using the GEMM_kernel module to data buffer zone Abuf, A, B, C of Bbuf0, Cbuf0
Matrix sub block is calculated;
Step 2.5.3: the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out0 is used;
Step 2.5.4: using a sub-block KBxNB of DMA_Translate module transfer B, C matrix respectively,
The data buffer zone Bbuf0 that MBxNB is stored to vector array, Cbuf0;
Step 2.5.5: data buffer zone Bbuf1, Cbuf1, Out1 data are waited using the DMA_POLL module respectively
End of transmission;
Step 2.5.6: using the GEMM_kernel module to data buffer zone Abuf, A, B, C of Bbuf1, Cbuf1
Matrix sub block is calculated;
Step 2.5.7: the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out1 is used;
Step 2.5.8: using a sub-block KBxNB of DMA_Translate module transfer B, C matrix respectively,
The data buffer zone Bbuf1 that MBxNB is stored to vector array, Cbuf1;
Step 2.5.9: judging whether counter R2 is 0, if not 0 returns to step 2.5.1, is transferred to execution if 0
Step 2.6.
As a further improvement of the present invention: the assembly code generation module Gen_GEMM is according to target core template
Type and corresponding parameter list generate target core template assembly code.
Compared with the prior art, the advantages of the present invention are as follows:
1, the present invention is based on the architectural features of GPDSP, by constructing multiple nucleus modules for realizing different task, base
It is indicated and corresponding parameter in the framework template of each nucleus module building dense matrix multiplication, template comprising multiple cores module
List is realized corresponding task by each nucleus module, then is passed through the assembly code generation module constructed in advance based on the template
Gen_GEMM carries out assembly code conversion, by assembly code generation module Gen_GEMM according to the core for including in framework template
Core module indicates and corresponding input parameter list automatically generates corresponding assembly code, ultimately generates required vectorization compilation
Code can be realized the dense matrix multiplication vectorization assembly code based on GPDSP framework and automatically generate.
2, realization principle of the present invention is simple and convenient to operate, and can be suitable for the height that GPDSP framework is quickly obtained height optimization
Performance dense matrix multiplication library function assembly code is realized the cores assembly code optimizings such as software flow, parallel instructions, vectorization and is joined
Numberization, and without paying close attention to bottom hardware realization, and it is convenient for the maintenance of library function assembly code, when that need to update, it is only necessary to update
The nucleus module for including in template indicates and corresponding parameter list, can be avoided vector involved in subsequent IPization and calculates core
Quantity needs to adapt to that library function assembly code is caused all to need vectorization and optimization again when different application
3, the present invention can either be quickly obtained the high-performance dense matrix multiplication library function assembly code of height optimization, can
The powerful vectorization computing capability of DSP core is given full play to, while bottom hardware is shielded to user and realizes details, greatly alleviates journey
Sequence person realizes the burden of be skillful at requirement and maintenance library function assembly code to bottom hardware, when kernel update need to be realized more
When newly optimizing nucleus module function or extending new nucleus module function, it is only necessary to which more new template, which regenerates, to be obtained automatically
Optimization library function assembly code newly is obtained, needs to adapt to so as to avoid vector involved in subsequent IPization from calculating core amounts
Library function assembly code all needs the problems such as vectorization again and optimization caused by different application.
Detailed description of the invention
Fig. 1 is the implementation process of dense matrix multiplication vectorization assembly code generation method of the present embodiment based on GPDSP
Schematic diagram.
Fig. 2 is the realization principle schematic diagram that the present embodiment dense matrix multiplication vectorization assembly code generates;
Fig. 3 is that dense matrix multiplication framework template realizes the implementation process signal that dense matrix multiplication calculates in the present embodiment
Figure.
Fig. 4 is the realization principle schematic diagram for generating assembly code in concrete application embodiment of the present invention to loop module.
Fig. 5 is the realization for generating assembly code in concrete application embodiment of the present invention to DMA_Translate nucleus module
Schematic illustration;
Fig. 6 is that the realization principle for generating assembly code to DMA_POLL nucleus module in concrete application embodiment of the present invention is shown
It is intended to;
Fig. 7 is the realization original for generating assembly code in concrete application embodiment of the present invention to GEMM_kernel nucleus module
Manage schematic diagram.
Specific embodiment
Below in conjunction with Figure of description and specific preferred embodiment, the invention will be further described, but not therefore and
It limits the scope of the invention.
As shown in Figure 1, 2, dense matrix multiplication vectorization assembly code generation method of the present embodiment based on GPDSP, step
Suddenly include:
S1. the multiple cores template for realizing different task is constructed, including for realizing sub-block matrix multiplication
GEMM_kernel module, for realizing the DMA_Translate module of data transmission, for whether detecting in DMA data-moving
Move the DMA_POLL module for finishing corresponding register flag bit and the LOOP module for executing cyclic process, each mould
Plate includes the parameter list of required parameter;
S2. the framework template of dense matrix multiplication is constructed, uses GEMM_kernel module, DMA_ respectively in framework template
Translate module, DMA_POLL module and LOOP module, to realize dense matrix multiplication;
S3. nucleus module each in framework template is converted using the assembly code generation module Gen_GEMM constructed in advance
For assembly code, required dense matrix multiplication assembly code is ultimately generated.
Architectural feature of the present embodiment based on GPDSP, by constructing multiple nucleus modules for realizing different task, base
It is indicated and corresponding parameter in the framework template of each nucleus module building dense matrix multiplication, template comprising multiple cores module
List is realized corresponding task by each nucleus module, then is passed through the assembly code generation module constructed in advance based on the template
Gen_GEMM carries out assembly code conversion, by assembly code generation module Gen_GEMM according to the core for including in framework template
Core module indicates and corresponding input parameter list automatically generates corresponding assembly code, ultimately generates required vectorization compilation
Code realizes that the dense matrix multiplication vectorization assembly code based on GPDSP framework automatically generates.
The above-mentioned code generating method of the present embodiment, principle are simple and convenient to operate, and can be suitable for GPDSP framework, quickly be obtained
The high-performance dense matrix multiplication library function assembly code for obtaining height optimization is realized without paying close attention to bottom hardware, and is convenient for library letter
The maintenance of number assembly code, when that need to update, it is only necessary to which the nucleus module for including in more new template indicates and corresponding parameter column
Table can be avoided the calculating core amounts of vector involved in subsequent IPization and need to adapt to caused library function compilation when different application
Code all needs the problems such as vectorization again and optimization.
The dense matrix multiplying based on GPDSP framework realized needed for the present embodiment is C=C-A*B, wherein A be M ×
The matrix of K rank, B are the matrix of K × N rank, and C is the matrix of M × N rank;If the corresponding piecemeal size of tri- dimensions of M, K, N of matrix
It is respectively labeled as MB, KB, NB, enables mi=M/MB, ki=K/ KB, ni=N/NB, it is assumed that mi, ki are integers, and ni is even number.
To realize that above-mentioned dense matrix multiplying C=C-A*B, the present embodiment construct sub-block matrix multiplication core mould first
Block indicated using GEMM_kernel module, i.e. GEMM_kernel module, the corresponding parameter list packet of GEMM_kernel module
Include 4 input parameters, wherein the 1st parameter be call terminate return address, the 2nd parameter be A sub-block matrix data address, the 3rd
Parameter is B sub-block matrix data address, and the 4th parameter is C sub-block matrix data address;It is above when constructing GEMM_kernel module
It states parameter and carries out assembly code setting for input, realize the corresponding sub-block matrix multiplication task of dense matrix multiplication C=C-A*B, and
The program address of the 1st parameter transmitting is jumped to after completion task, the GEMM_kernel nucleus module assembly code of realization is specific
Preservation is independent file GEMM_kernel.s.
It constructs data and transmits nucleus module, indicated using DMA_Translate module, i.e. DMA_Translate module,
The corresponding parameter list of DMA_Translate nucleus module includes 11 input parameters, wherein the 1st parameter is to call what is terminated to return
Address is gone back to, the 2nd parameter is dma logic channel number, and the 3rd parameter is logical channel priority, and the 4th parameter is transmission mode control ginseng
Digital 1, the 5th parameter is that the 2, the 6th parameter of transmission mode control parameter word is source address, and the 7th parameter is source counting, and the 8th parameter is
Destination address counts for the purpose of the 9th parameter, and the 10th parameter is source/destination index, and the 11st parameter is that block indexes;Construct DMA_
When Translate module, it is that input carries out assembly code setting with above-mentioned parameter, realizes source address to destination address by DMA
Data transfer task, and jump to after the completion of task the program address of the 1st parameter transmitting, the DMA_Translate of realization
Nucleus module assembly code specifically saves and is independent file DMA_Translate.s.
It constructs data-moving flag bit and detects nucleus module, indicated using DMA_POLL module, i.e. DMA_POLL module,
The corresponding parameter list of DMA_POLL module includes 2 input parameters, wherein the 1st parameter is to call the return address terminated, the 2nd
Parameter is dma logic channel number;When constructing DMA_POLL module, it is that input carries out assembly code setting with above-mentioned parameter, realizes
Dma logic channel number is whether the data-moving of the 2nd parameter moves the corresponding registers flag bit Detection task finished, and task
The program address of the 1st parameter transmitting is jumped to after the completion, and the DMA_POLL nucleus module assembly code of realization specifically saves as solely
Vertical file DMA_POLL.s.
Loop module is constructed, is indicated using LOOP, END of pairing, wherein LOOP indicates that circulation starts, END indicates pairing
Circulation terminate, i.e. LOOP module, the corresponding parameter list of LOOP module includes 3 input parameters, wherein the 1st parameter is to recycle
Register used is counted, the 2nd parameter is cycle count initial value, and the 3rd parameter is the change in count value recycled every time.
After building obtains above-mentioned each nucleus module, one can be constructed jointly by above-mentioned each nucleus module based on GPDSP frame
The dense matrix multiplication framework template of structure.The present embodiment is according to the multistage storage architecture feature of GPDSP, and step S2 is according to thick
The execution path combination multiple cores module composition piecemeal dense matrix multiplication Computational frame template of close matrix multiplication is wrapped in template
The expression of multiple cores functions of modules and parameter list are included, to realize that efficient piecemeal dense matrix multiplication calculates;Step S3 is sharp again
It indicated with compilation code generation module Gen_GEMM according to the nucleus module for including in template, input parameter list and core accordingly
Core module assembly code file automatically generates dense matrix multiplication assembly code, completes the vector code of dense matrix multiplication certainly
It is dynamic to generate.
As shown in figure 3, specific building as follows obtains the frame mould of dense matrix multiplication in the present embodiment step S2
Plate:
Step 1: opening K dimension Circulant Block and calculate, set counter register R0, counter initial value ki is followed every time
Inner loop counter subtracts 1, until Counter Value is 0;
Step 2: opening M dimension Circulant Block and calculate, set counter register R1, counter initial value mi is followed every time
Inner loop counter subtracts 1, until Counter Value is 0, wherein being made respectively in loop calculation using the execution cyclic process of LOOP module
With DMA_Translate module transfer A, B, C matrix to data buffer zone, transferred using pending datas such as DMA_POLL modules
Finish, and carry out the calculating of A, B, C matrix sub block using GEMM_kernel module, wherein DMA_Translate module is expressed as
Data transmit nucleus module, and DMA_POLL module is expressed as data-moving flag bit detection nucleus module, GEMM_kernel mould
Block is expressed as sub-block matrix multiplication nucleus module;
Step 3: judging whether counter R1 is 0, if not returning to step 2, execute step 4 if being transferred to;
Step 4: judging whether counter R0 is 0, if not 1 is returned to step, if completing current calculate.
In the present embodiment, step 2 realizes the specific steps that M dimension Circulant Block calculates are as follows:
Step 2.1: successively using a sub-block MBxKB of DMA_Translate module transfer A matrix to scalar L1D's
Data buffer zone Abuf;
Step 2.2: waiting Abuf data end of transmission using DMA_POLL module;
Step 2.3: respectively using DMA_Translate module transfer B, C matrix sub-block a KBxNB, MBxNB to
Measure the data buffer zone Bbuf0, Cbuf0 of array storage;
Step 2.4: respectively using DMA_Translate module transfer B, C matrix sub-block a KBxNB, MBxNB to
Measure the data buffer zone Bbuf1, Cbuf1 of array storage;
Step 2.5: opening N-dimensional Circulant Block and calculate, set counter register R2, counter initial value ni-2, often
Secondary cycle counter subtracts 2, until Counter Value is 0;
Step 2.6: waiting Bbuf0, Cbuf0, Out0 data end of transmission using DMA_POLL module respectively;
Step 2.7: using GEMM_kernel module to Abuf, A, B, C matrix sub block of Bbuf0, Cbuf0 are calculated;
Step 2.8: using the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out0;
Step 2.9: waiting Bbuf1, Cbuf1, Out1 data end of transmission using DMA_POLL module respectively;
Step 2.10: using GEMM_kernel module to Abuf, A, B, C matrix sub block of Bbuf1, Cbuf1 are counted
It calculates;
Step 2.11: using the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out1.
In the present embodiment, step 2.5 executes the specific steps that N-dimensional Circulant Block calculates are as follows:
Step 2.5.1: Bbuf0, Cbuf0, Out0 data end of transmission are waited using DMA_POLL module respectively;
Step 2.5.2: using GEMM_kernel module to Abuf, A, B, C matrix sub block of Bbuf0, Cbuf0 are counted
It calculates;
Step 2.5.3: the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out0 is used;
Step 2.5.4: respectively extremely using sub-block a KBxNB, MBxNB of DMA_Translate module transfer B, C matrix
The data buffer zone Bbuf0, Cbuf0 of vector array storage;
Step 2.5.5: Bbuf1, Cbuf1, Out1 data end of transmission are waited using DMA_POLL module respectively;
Step 2.5.6: using GEMM_kernel module to Abuf, A, B, C matrix sub block of Bbuf1, Cbuf1 are counted
It calculates;
Step 2.5.7: the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out1 is used;
Step 2.5.8: respectively extremely using sub-block a KBxNB, MBxNB of DMA_Translate module transfer B, C matrix
The data buffer zone Bbuf1, Cbuf1 of vector array storage;
Step 2.5.9: judging whether counter R2 is 0, if not 0 returns to step 2.5.1, is transferred to execution if 0
Step 2.6.
By the framework template of above-mentioned dense matrix multiplication, can be realized in conjunction with the structural system feature of GPDSP efficient
Piecemeal dense matrix multiplication calculates, and realizes different task by using each nucleus module of building in template, and subsequent combination converges
The high-performance dense matrix multiplication library function assembly code that code generation module Gen_GEMM can be quickly obtained height optimization is compiled,
Details is realized without paying close attention to bottom hardware, when needing to update optimization or extension, it is only necessary to which more new template can automatically obtain more
New optimization library function assembly code.
In the present embodiment, assembly code generation module Gen_GEMM is with specific reference to the type of target core template and right
The parameter list answered generates the assembly code of target core template.
As shown in figure 4, the present invention in concrete application embodiment assembly code generation module Gen_GEMM to loop module
Generate assembly code specifically:
Loop module and input parameter list in template are expressed as follows:
LOOP(Ri,count,len)
……
END
Then above-mentioned loop module is generated following assembly code by Gen_GEMM:
LOOP_Ri:
SMOVI count, Ri
……
[Ri] SBR LOOP_Ri
[Ri] SSUB 1en, Ri, Ri
SNOP 4
As shown in figure 5, the present invention is in concrete application embodiment assembly code generation module Gen_GEMM to DMA_
Translate nucleus module generates assembly code specifically:
DMA_Translate module and input parameter list in template are expressed as follows:
DMA_Translate (para1,para2,para3,para4,para5,para6,para7,para8,para9,
para10,para11)
Then above-mentioned DMA_Translate module is generated following assembly code by Gen_GEMM:
SBR DMA_Translate
SMOVI para1, R63
| SMOVI.M1 para2, R62
| SMOVI.M2 para3, R61
SMOVI para4, R60
| SMOVI.M1 para5, R59
| SMOVI.M2 para6, R58
SMOVI para7, R67
| SMOVI.M1 para8, R56
| SMOVI.M2 para9, R55
SMOVI para10,R54
SMOVI.M1 para11,R53
If DMA_Translate expression is to first appear in a template, the code of DMA_Translate.s file is inserted
Enter to assembling file tail portion to be generated.
As shown in fig. 6, assembly code generation module Gen_GEMM is to DMA_POLL core mould in the specific embodiment of the invention
Block generates assembly code specifically:
DMA_POLL module and input parameter list in template are expressed as follows:
DMA_POLL (para1,para2)
Then above-mentioned DMA_POLL module is generated following assembly code by Gen_GEMM:
SBR DMA_POLL
SMOVI para1, R63
SMOVI para2, R62
SNOP 4
If DMA_POLL expression is to first appear in a template, by the code insertion of DMA_POLL.s file to be generated
Assembling file tail portion.
As shown in fig. 7, the present invention in concrete application embodiment assembly code generation module Gen_GEMM to GEMM_
Kernel nucleus module generates assembly code specifically:
GEMM_kernel module and input parameter list in template are expressed as follows:
GEMM_kernel (para1,para2, para3,para4)
Then above-mentioned GEMM_kernel module is generated following assembly code by Gen_GEMM:
SBR GEMM_kernel
SMOVI para1, R63
SMOVI para2, R62
SMOVI para3, R61
SMOVI para4, R60
SNOP 2
If GEMM_kernel expression is to first appear in a template, the code insertion of GEMM_kernel.s file is arrived
Assembling file tail portion to be generated.
Above-mentioned only presently preferred embodiments of the present invention, is not intended to limit the present invention in any form.Although of the invention
It has been disclosed in a preferred embodiment above, however, it is not intended to limit the invention.Therefore, all without departing from technical solution of the present invention
Content, technical spirit any simple modifications, equivalents, and modifications made to the above embodiment, should all fall according to the present invention
In the range of technical solution of the present invention protection.
Claims (8)
1. a kind of dense matrix multiplication vectorization assembly code generation method based on GPDSP, which is characterized in that step includes:
S1. the multiple cores template for realizing different task is constructed, including the GEMM_ for realizing sub-block matrix multiplication
Kernel module, for realizing data transmission DMA_Translate module, for detecting whether data-moving in DMA is moved
Finish the DMA_POLL module of corresponding register flag bit and the LOOP module for executing cyclic process, each template packet
Include the parameter list of required parameter;
S2. the framework template for constructing dense matrix multiplication, in the framework template respectively using the GEMM_kernel module,
DMA_Translate module, DMA_POLL module and LOOP module, to realize dense matrix multiplication;
S3. nucleus module each in the framework template is converted to using the assembly code generation module Gen_GEMM constructed in advance
Assembly code ultimately generates required dense matrix multiplication assembly code;
When the dense matrix multiplying calculated based on GPDSP framework is C=C-A*B, wherein A is the matrix of M × K rank, B K
The matrix of × N rank, C be M × N rank matrix, the corresponding piecemeal size of tri- dimensions of M, K, N of matrix be respectively labeled as MB, KB,
NB, enables mi=M/MB, and when ki=K/KB, ni=N/NB, specific building as follows obtains dense matrix in the step S2
The framework template of multiplication:
Step 1: opening K dimension Circulant Block and calculate, set counter register R0, counter initial value ki, every time circulation meter
Number device subtracts 1, until Counter Value is 0;
Step 2: opening M dimension Circulant Block and calculate, set counter register R1, counter initial value mi, every time circulation meter
Number device subtracts 1, until Counter Value is 0, wherein being made respectively in loop calculation using LOOP module execution cyclic process
With DMA_Translate module transfer A, B, C matrix to data buffer zone, the pending datas such as the DMA_POLL module are used
End of transmission, and the calculating using GEMM_kernel module progress A, B, C matrix sub block;
Step 3: judging whether counter R1 is 0, if not returning to step 2, execute step 4 if being transferred to;
Step 4: judging whether counter R0 is 0, if not 1 is returned to step, if completing current calculate.
2. the dense matrix multiplication vectorization assembly code generation method according to claim 1 based on GPDSP, feature
Be: the parameter list of the GEMM_kernel module includes 4 input parameters, wherein the 1st parameter is to call the return terminated
Address, the 2nd parameter are A sub-block matrix data address, and the 3rd parameter is B sub-block matrix data address, and the 4th parameter is C sub-block matrix
Data address;When constructing the GEMM_kernel module, it is that input carries out assembly code setting with above-mentioned parameter, realizes dense
The corresponding sub-block matrix multiplication task of matrix multiplication, and jump to after completion task the program address of the 1st parameter transmitting.
3. the dense matrix multiplication vectorization assembly code generation method according to claim 2 based on GPDSP, feature
Be: the parameter list of the DMA_Translate module includes 11 input parameters, wherein the 1st parameter is to call to terminate
Return address, the 2nd parameter are dma logic channel number, and the 3rd parameter is logical channel priority, and the 4th parameter is transmission mode control
The 1, the 5th parameter of parameter word is that the 2, the 6th parameter of transmission mode control parameter word is source address, and the 7th parameter is source counting, the 8th parameter
It for purpose address, is counted for the purpose of the 9th parameter, the 10th parameter is source/destination index, and the 11st parameter is that block indexes;Described in building
It is that input carries out assembly code setting, the number of realization source address to destination address with above-mentioned parameter when DMA_Translate module
According to transformation task, and jump to after the completion of task the program address of the 1st parameter transmitting.
4. the dense matrix multiplication vectorization assembly code generation method according to claim 3 based on GPDSP, feature
Be: the parameter list of the DMA_POLL module includes 2 input parameters, wherein the 1st parameter is with calling the return terminated
Location, the 2nd parameter are dma logic channel number;It is that input carries out assembly code with above-mentioned parameter when constructing the DMA_POLL module
Setting realizes that dma logic channel number is whether the data-moving of the 2nd parameter moves the corresponding registers flag bit detection times finished
Business, and the program address of the 1st parameter transmitting is jumped to after the completion of task.
5. the dense matrix multiplication vectorization assembly code generation method according to claim 4 based on GPDSP, feature
Be: the parameter list of the LOOP module includes 3 input parameters, wherein the 1st parameter is register used in cycle count, the
2 parameters are cycle count initial value, and the 3rd parameter is the change in count value recycled every time.
6. the dense matrix multiplication vectorization assembly code generation method according to claim 1 based on GPDSP, feature
It is, the specific steps of the step 2 are as follows:
Step 2.1: successively using a sub-block MBxKB of the DMA_Translate module transfer A matrix to scalar L1D's
Data buffer zone Abuf;
Step 2.2: waiting data buffer zone Abuf data end of transmission using the DMA_POLL module;
Step 2.3: respectively using DMA_Translate module transfer B, C matrix sub-block a KBxNB, MBxNB to
Measure the data buffer zone Bbuf0, Cbuf0 of array storage;
Step 2.4: respectively using DMA_Translate module transfer B, C matrix sub-block a KBxNB, MBxNB to
Measure the data buffer zone Bbuf1, Cbuf1 of array storage;
Step 2.5: opening N-dimensional Circulant Block and calculate, set counter register R2, counter initial value ni-2 is followed every time
Inner loop counter subtracts 2, until Counter Value is 0;
Step 2.6: waiting data buffer zone Bbuf0, Cbuf0 using the DMA_POLL module respectively, Out0 data transfer
Finish;
Step 2.7: using the GEMM_kernel module to data buffer zone Abuf, A, B, C matrix of Bbuf0, Cbuf0
Block is calculated;
Step 2.8: using the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out0;
Step 2.9: waiting data buffer zone Bbuf1, Cbuf1 using the DMA_POLL module respectively, Out1 data transfer
Finish;
Step 2.10: using the GEMM_kernel module to data buffer zone Abuf, A, B, C matrix of Bbuf1, Cbuf1
Block is calculated;
Step 2.11: using the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out1.
7. the dense matrix multiplication vectorization assembly code generation method according to claim 6 based on GPDSP, feature
It is, the specific steps of the step 2.5 are as follows:
Step 2.5.1: data buffer zone Bbuf0, Cbuf0, the transmission of Out0 data are waited using the DMA_POLL module respectively
It finishes;
Step 2.5.2: using the GEMM_kernel module to data buffer zone Abuf, A, B, C matrix of Bbuf0, Cbuf0
Sub-block is calculated;
Step 2.5.3: the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out0 is used;
Step 2.5.4: respectively extremely using sub-block a KBxNB, MBxNB of DMA_Translate module transfer B, C matrix
The data buffer zone Bbuf0, Cbuf0 of vector array storage;
Step 2.5.5: data buffer zone Bbuf1, Cbuf1, the transmission of Out1 data are waited using the DMA_POLL module respectively
It finishes;
Step 2.5.6: using the GEMM_kernel module to data buffer zone Abuf, A, B, C matrix of Bbuf1, Cbuf1
Sub-block is calculated;
Step 2.5.7: the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out1 is used;
Step 2.5.8: respectively extremely using sub-block a KBxNB, MBxNB of DMA_Translate module transfer B, C matrix
The data buffer zone Bbuf1, Cbuf1 of vector array storage;
Step 2.5.9: judging whether counter R2 is 0, if not 0 returns to step 2.5.1, is transferred to execution step if 0
2.6。
8. the dense matrix multiplication vectorization assembly code described according to claim 1~any one of 5 based on GPDSP is raw
At method, which is characterized in that the assembly code generation module Gen_GEMM is according to the type and correspondence of target core template
Parameter list generate target core template assembly code.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810530676.XA CN108845795B (en) | 2018-05-29 | 2018-05-29 | GPDSP-based dense matrix multiplication vectorization assembly code generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810530676.XA CN108845795B (en) | 2018-05-29 | 2018-05-29 | GPDSP-based dense matrix multiplication vectorization assembly code generation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108845795A CN108845795A (en) | 2018-11-20 |
CN108845795B true CN108845795B (en) | 2019-06-14 |
Family
ID=64211068
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810530676.XA Active CN108845795B (en) | 2018-05-29 | 2018-05-29 | GPDSP-based dense matrix multiplication vectorization assembly code generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108845795B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110704193B (en) * | 2019-10-12 | 2022-12-16 | 中国电子科技集团公司第三十八研究所 | Method and device for realizing multi-core software architecture suitable for vector processing |
CN113721899B (en) * | 2021-09-02 | 2023-08-15 | 中国人民解放军国防科技大学 | GPDSP-oriented lightweight high-efficiency assembly code programming method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103984522A (en) * | 2014-05-27 | 2014-08-13 | 中国人民解放军国防科学技术大学 | Method for achieving fixed point and floating point mixed division in general-purpose digital signal processor (GPDSP) |
CN104142886A (en) * | 2013-05-10 | 2014-11-12 | 华为软件技术有限公司 | ARM (advanced RISC machines) assembly code debugging and processing method and device |
CN104636315A (en) * | 2015-02-06 | 2015-05-20 | 中国人民解放军国防科学技术大学 | GPDSP-oriented matrix LU decomposition vectorization calculation method |
CN104679691A (en) * | 2015-01-22 | 2015-06-03 | 中国人民解放军国防科学技术大学 | Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP and adopting host counting |
-
2018
- 2018-05-29 CN CN201810530676.XA patent/CN108845795B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104142886A (en) * | 2013-05-10 | 2014-11-12 | 华为软件技术有限公司 | ARM (advanced RISC machines) assembly code debugging and processing method and device |
CN103984522A (en) * | 2014-05-27 | 2014-08-13 | 中国人民解放军国防科学技术大学 | Method for achieving fixed point and floating point mixed division in general-purpose digital signal processor (GPDSP) |
CN104679691A (en) * | 2015-01-22 | 2015-06-03 | 中国人民解放军国防科学技术大学 | Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP and adopting host counting |
CN104636315A (en) * | 2015-02-06 | 2015-05-20 | 中国人民解放军国防科学技术大学 | GPDSP-oriented matrix LU decomposition vectorization calculation method |
Non-Patent Citations (1)
Title |
---|
面向GPDSP科学计算的高性能DMA传输方式的设计与实现;王占立;《中国优秀硕士学位论文全文数据库 信息科技辑》;20170315(第03期);第I137-120页 |
Also Published As
Publication number | Publication date |
---|---|
CN108845795A (en) | 2018-11-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11307873B2 (en) | Apparatus, methods, and systems for unstructured data flow in a configurable spatial accelerator with predicate propagation and merging | |
EP3726389B1 (en) | Apparatuses, methods, and systems for memory interface circuit allocation in a configurable spatial accelerator | |
CN109522254B (en) | Arithmetic device and method | |
US10515046B2 (en) | Processors, methods, and systems with a configurable spatial accelerator | |
US20190205284A1 (en) | Apparatus, methods, and systems for memory consistency in a configurable spatial accelerator | |
US20190007332A1 (en) | Processors and methods with configurable network-based dataflow operator circuits | |
US10678724B1 (en) | Apparatuses, methods, and systems for in-network storage in a configurable spatial accelerator | |
CN101763245B (en) | Method and apparatus for programming direct memory access engine | |
CN108845795B (en) | GPDSP-based dense matrix multiplication vectorization assembly code generation method | |
CN110427337B (en) | Processor core based on field programmable gate array and operation method thereof | |
US9239732B2 (en) | Unrolling aggregation operations in asynchronous programming code having multiple levels in hierarchy | |
CN102135950A (en) | On-chip heterogeneous multi-core system based on star type interconnection structure, and communication method thereof | |
Elteir et al. | Performance characterization and optimization of atomic operations on amd gpus | |
CN101763246A (en) | Method and apparatus for programming direct memory access engine | |
US11907713B2 (en) | Apparatuses, methods, and systems for fused operations using sign modification in a processing element of a configurable spatial accelerator | |
Wun et al. | Exploiting coarse-grained parallelism to accelerate protein motif finding with a network processor | |
US20170269931A1 (en) | Method and Computing System for Handling Instruction Execution Using Affine Register File on Graphic Processing Unit | |
CN108776586B (en) | Large-point-number FFT vectorization assembly code generation method based on GPDSP | |
US11941440B2 (en) | System and method for queuing commands in a deep learning processor | |
Ltaief et al. | Hybrid multicore cholesky factorization with multiple gpu accelerators | |
JP2023544911A (en) | Method and apparatus for parallel quantum computing | |
Dongarra et al. | Batched BLAS (basic linear algebra subprograms) 2018 specification | |
D'Hollander et al. | Calling hardware procedures in a reconfigurable accelerator using RPC-FPGA | |
Sharma et al. | Stash: A comprehensive stall-centric characterization of public cloud VMs for distributed deep learning | |
Yi | Dynamic binary translation cache optimization algorithm in cloud computing environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |