CN108845795B - GPDSP-based dense matrix multiplication vectorization assembly code generation method - Google Patents

GPDSP-based dense matrix multiplication vectorization assembly code generation method Download PDF

Info

Publication number
CN108845795B
CN108845795B CN201810530676.XA CN201810530676A CN108845795B CN 108845795 B CN108845795 B CN 108845795B CN 201810530676 A CN201810530676 A CN 201810530676A CN 108845795 B CN108845795 B CN 108845795B
Authority
CN
China
Prior art keywords
module
parameter
dma
assembly code
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810530676.XA
Other languages
Chinese (zh)
Other versions
CN108845795A (en
Inventor
刘仲
田希
陈海燕
郭阳
扈啸
陈跃跃
孙永节
刘胜
雷元武
吴家铸
王丽萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201810530676.XA priority Critical patent/CN108845795B/en
Publication of CN108845795A publication Critical patent/CN108845795A/en
Application granted granted Critical
Publication of CN108845795B publication Critical patent/CN108845795B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/37Compiler construction; Parser generation

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a GPDSP-based dense matrix multiplication vectorization assembly code generation method, which comprises the following steps: s1, constructing a plurality of core templates for realizing different tasks; s2, constructing a frame template of dense matrix multiplication, wherein each module is used in the frame template to realize the dense matrix multiplication; and S3, converting each core module in the framework template into an assembly code by using a pre-constructed assembly code generation module Gen _ GEMM, and finally generating the required dense matrix multiplication assembly code. The method has the advantages of simple realization principle, simple and convenient operation, flexible use, capability of realizing the automatic generation of the dense matrix multiplication vectorization assembly code, high generation efficiency and performance and the like.

Description

Dense matrix multiplication vectorization assembly code generation method based on GPDSP
Technical field
The present invention relates to GPDSP(General-Purpose Digital Signal Processor, general-purpose computations numbers Signal processor) technical field more particularly to a kind of dense matrix multiplication vectorization assembly code generation side based on GPDSP Method.
Background technique
Substantially linear Algebraic Algorithm library (Basic Linear Algebra Subroutines, BLAS) is all kinds of science meters One of most common core mathematics algorithms library is calculated, industry is all proposed the BLAS of height optimization for respective processor platform Realize, such as IBM ESSL, Intel MKL, AMD ACML.Wherein, dense matrix multiplication (General Matrix- Matrix Multiplication, GEMM) be the library BLAS core algorithm, GEMM is that typical computation-intensive and memory access is intensive Type application, very high to the operational capability of processor, memory bandwidth and delay requirement, GEMM calculating occupies high-performance benchmark test Program (High Performance Linpack, HPL) operand is generally up to 90% or more.Therefore, for the system knot of processor Structure the characteristic study GEMM optimization method applies the computational efficiency, the calculating advantage of performance processor and raising of evaluating and testing the processor The speed of service of program all has critically important reference value.
Although compiler is to realize that performance optimizes optimal solution in theory, because source code and being not required to It rewrites, but the development speed of hardware is much unable to catch up in the technological progress of compiler, even simple computational problem, Compiler often generates inefficient code, for example, the dense matrix multiplication code generated by compiler is write than optimization by hand Slow several times of code.Usual high-performance library function is all to be optimized meticulously using hand assemble, however new processor platform is more Newly with publication so that the code of hand-coding needs to realize and optimize again, there is still a need for expend a large amount of workload, complexity for this It spends and at high cost.
GPDSP is as a kind of heterogeneous multi-nucleus processor, it includes CPU core unit and DSP core unit, wherein CPU core unit Be mainly used for being responsible for generic transaction management including storage management, document control, process scheduling, interrupt management task and Complete support to the general-purpose operating system is provided;DSP core unit includes several 64 bit vectors processing with powerful calculating ability Array, for supporting the resolving of highly dense processor active task, DSP core includes mark, vector register file, scalar L1D, vector array The complicated multistage storage organization such as storage and outside DDR storage is shared in storage, piece.And complicated architecture is to efficient generation The generation of code brings huge challenge, is difficult to realize by the library function assembly code that compiler generates efficient between storages at different levels Data access and transmitting, traditional matrix in block form multiplication method towards Cache structure are also not suitable for the non-Cache's of GPDSP Vector array store memory access mode and Vector Processing array Concurrent Vector processing architectural feature, it is difficult to play GPDSP to Measure calculating advantage.
The high-performance library function that application system high to requirement of real-time at present is called usually all is that compilation is smart by hand Heart optimization, it is currently to face that the architectural feature for how being directed to GPDSP complexity, which quickly generates efficient library function assembly code, A huge challenge, and wherein based on GPDSP framework realize dense matrix multiplication vectorization assembly code generation be urgently It solves the problems, such as.
Summary of the invention
The technical problem to be solved in the present invention is that, for technical problem of the existing technology, the present invention provides one Kind has simple realization principle, easy to operate, using flexible, can be realized code automation generation, and formation efficiency and performance The high dense matrix multiplication vectorization assembly code generation method based on GPDSP.
In order to solve the above technical problems, technical solution proposed by the present invention are as follows:
A kind of dense matrix multiplication vectorization assembly code generation method based on GPDSP, step include:
S1. the multiple cores template for realizing different task is constructed, including for realizing sub-block matrix multiplication GEMM_kernel module, for realizing the DMA_Translate module of data transmission, for whether detecting in DMA data-moving Move the DMA_POLL module for finishing corresponding register flag bit and the LOOP module for executing cyclic process, each mould Plate includes the parameter list of required parameter;
S2. the framework template of dense matrix multiplication is constructed, uses the GEMM_kernel respectively in the framework template Module, DMA_Translate module, DMA_POLL module and LOOP module, to realize dense matrix multiplication;
S3. using the assembly code generation module Gen_GEMM constructed in advance by nucleus module each in the framework template Assembly code is converted to, required dense matrix multiplication assembly code is ultimately generated.
As a further improvement of the present invention: the parameter list of the GEMM_kernel module includes 4 input parameters, Wherein the 1st parameter is to call the return address terminated, and the 2nd parameter is A sub-block matrix data address, and the 3rd parameter is B sub-block matrix Data address, the 4th parameter are C sub-block matrix data address;It is defeated with above-mentioned parameter when constructing the GEMM_kernel module Enter and carry out assembly code setting, realizes the corresponding sub-block matrix multiplication task of dense matrix multiplication, and jump after completion task The program address transmitted to the 1st parameter.
As a further improvement of the present invention: the parameter list of the DMA_Translate module includes 11 input ginsengs Number, wherein the 1st parameter is to call the return address terminated, the 2nd parameter is dma logic channel number, and the 3rd parameter is that logical channel is excellent First grade, the 4th parameter be the 1, the 5th parameter of transmission mode control parameter word be the 2, the 6th parameter of transmission mode control parameter word for source Location, the 7th parameter are source counting, and the 8th parameter is purpose address, are counted for the purpose of the 9th parameter, and the 10th parameter is source/destination index, 11st parameter is block index;It is that input progress assembly code is set with above-mentioned parameter when constructing the DMA_Translate module It sets, realizes that source address arrives the data transfer task of destination address, and with jumping to after the completion of task the program that the 1st parameter is transmitted Location.
As a further improvement of the present invention: the parameter list of the DMA_POLL module includes 2 input parameters, wherein 1st parameter is to call the return address terminated, and the 2nd parameter is dma logic channel number;When constructing the DMA_POLL module, with Above-mentioned parameter is that input carries out assembly code setting, realizes that dma logic channel number is whether the data-moving of the 2nd parameter has been moved Complete corresponding registers flag bit Detection task, and the program address of the 1st parameter transmitting is jumped to after the completion of task.
As a further improvement of the present invention: the parameter list of the LOOP module includes 3 input parameters, wherein the 1st Parameter is register used in cycle count, and the 2nd parameter is cycle count initial value, and the 3rd parameter is the change in count recycled every time Value.
As a further improvement of the present invention, when the dense matrix multiplying calculated based on GPDSP framework is C=C-A* B, wherein A is the matrix of M × K rank, and B is the matrix of K × N rank, and C is the matrix of M × N rank, and tri- dimensions of M, K, N of matrix are corresponding Piecemeal size be respectively labeled as MB, KB, NB, enable mi=M/MB, when ki=K/ KB, ni=N/NB, specifically pressed in the step S2 Following steps construct to obtain the framework template of dense matrix multiplication:
Step 1: opening K dimension Circulant Block and calculate, set counter register R0, counter initial value ki is followed every time Inner loop counter subtracts 1, until Counter Value is 0;
Step 2: opening M dimension Circulant Block and calculate, set counter register R1, counter initial value mi is followed every time Inner loop counter subtracts 1, until Counter Value is 0, wherein being divided in loop calculation using LOOP module execution cyclic process Not Shi Yong DMA_Translate module transfer A, B, C matrix to data buffer zone, use the DMA_POLL module to wait Data end of transmission, and the calculating using GEMM_kernel module progress A, B, C matrix sub block;
Step 3: judging whether counter R1 is 0, if not returning to step 2, execute step 4 if being transferred to;
Step 4: judging whether counter R0 is 0, if not 1 is returned to step, if completing current calculate.
As a further improvement of the present invention, the specific steps of the step 2 are as follows:
Step 2.1: successively using a sub-block MBxKB of the DMA_Translate module transfer A matrix to scalar The data buffer zone Abuf of L1D;
Step 2.2: waiting data buffer zone Abuf data end of transmission using the DMA_POLL module;
Step 2.3: using sub-block a KBxNB, MBxNB of DMA_Translate module transfer B, C matrix respectively The data buffer zone Bbuf0, Cbuf0 stored to vector array;
Step 2.4: using sub-block a KBxNB, MBxNB of DMA_Translate module transfer B, C matrix respectively The data buffer zone Bbuf1, Cbuf1 stored to vector array;
Step 2.5: opening N-dimensional Circulant Block and calculate, set counter register R2, counter initial value ni-2, often Secondary cycle counter subtracts 2, until Counter Value is 0;
Step 2.6: waiting data buffer zone Bbuf0, Cbuf0 using the DMA_POLL module respectively, Out0 data pass It is finished complete;
Step 2.7: using the GEMM_kernel module to data buffer zone Abuf, A, B, C square of Bbuf0, Cbuf0 A period of time block is calculated;
Step 2.8: using the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out0;
Step 2.9: waiting data buffer zone Bbuf1, Cbuf1 using the DMA_POLL module respectively, Out1 data pass It is finished complete;
Step 2.10: using the GEMM_kernel module to data buffer zone Abuf, A, B, C square of Bbuf1, Cbuf1 A period of time block is calculated;
Step 2.11: using the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out1.
As a further improvement of the present invention, the specific steps of the step 2.5 are as follows:
Step 2.5.1: data buffer zone Bbuf0, Cbuf0, Out0 data are waited using the DMA_POLL module respectively End of transmission;
Step 2.5.2: using the GEMM_kernel module to data buffer zone Abuf, A, B, C of Bbuf0, Cbuf0 Matrix sub block is calculated;
Step 2.5.3: the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out0 is used;
Step 2.5.4: using a sub-block KBxNB of DMA_Translate module transfer B, C matrix respectively, The data buffer zone Bbuf0 that MBxNB is stored to vector array, Cbuf0;
Step 2.5.5: data buffer zone Bbuf1, Cbuf1, Out1 data are waited using the DMA_POLL module respectively End of transmission;
Step 2.5.6: using the GEMM_kernel module to data buffer zone Abuf, A, B, C of Bbuf1, Cbuf1 Matrix sub block is calculated;
Step 2.5.7: the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out1 is used;
Step 2.5.8: using a sub-block KBxNB of DMA_Translate module transfer B, C matrix respectively, The data buffer zone Bbuf1 that MBxNB is stored to vector array, Cbuf1;
Step 2.5.9: judging whether counter R2 is 0, if not 0 returns to step 2.5.1, is transferred to execution if 0 Step 2.6.
As a further improvement of the present invention: the assembly code generation module Gen_GEMM is according to target core template Type and corresponding parameter list generate target core template assembly code.
Compared with the prior art, the advantages of the present invention are as follows:
1, the present invention is based on the architectural features of GPDSP, by constructing multiple nucleus modules for realizing different task, base It is indicated and corresponding parameter in the framework template of each nucleus module building dense matrix multiplication, template comprising multiple cores module List is realized corresponding task by each nucleus module, then is passed through the assembly code generation module constructed in advance based on the template Gen_GEMM carries out assembly code conversion, by assembly code generation module Gen_GEMM according to the core for including in framework template Core module indicates and corresponding input parameter list automatically generates corresponding assembly code, ultimately generates required vectorization compilation Code can be realized the dense matrix multiplication vectorization assembly code based on GPDSP framework and automatically generate.
2, realization principle of the present invention is simple and convenient to operate, and can be suitable for the height that GPDSP framework is quickly obtained height optimization Performance dense matrix multiplication library function assembly code is realized the cores assembly code optimizings such as software flow, parallel instructions, vectorization and is joined Numberization, and without paying close attention to bottom hardware realization, and it is convenient for the maintenance of library function assembly code, when that need to update, it is only necessary to update The nucleus module for including in template indicates and corresponding parameter list, can be avoided vector involved in subsequent IPization and calculates core Quantity needs to adapt to that library function assembly code is caused all to need vectorization and optimization again when different application
3, the present invention can either be quickly obtained the high-performance dense matrix multiplication library function assembly code of height optimization, can The powerful vectorization computing capability of DSP core is given full play to, while bottom hardware is shielded to user and realizes details, greatly alleviates journey Sequence person realizes the burden of be skillful at requirement and maintenance library function assembly code to bottom hardware, when kernel update need to be realized more When newly optimizing nucleus module function or extending new nucleus module function, it is only necessary to which more new template, which regenerates, to be obtained automatically Optimization library function assembly code newly is obtained, needs to adapt to so as to avoid vector involved in subsequent IPization from calculating core amounts Library function assembly code all needs the problems such as vectorization again and optimization caused by different application.
Detailed description of the invention
Fig. 1 is the implementation process of dense matrix multiplication vectorization assembly code generation method of the present embodiment based on GPDSP Schematic diagram.
Fig. 2 is the realization principle schematic diagram that the present embodiment dense matrix multiplication vectorization assembly code generates;
Fig. 3 is that dense matrix multiplication framework template realizes the implementation process signal that dense matrix multiplication calculates in the present embodiment Figure.
Fig. 4 is the realization principle schematic diagram for generating assembly code in concrete application embodiment of the present invention to loop module.
Fig. 5 is the realization for generating assembly code in concrete application embodiment of the present invention to DMA_Translate nucleus module Schematic illustration;
Fig. 6 is that the realization principle for generating assembly code to DMA_POLL nucleus module in concrete application embodiment of the present invention is shown It is intended to;
Fig. 7 is the realization original for generating assembly code in concrete application embodiment of the present invention to GEMM_kernel nucleus module Manage schematic diagram.
Specific embodiment
Below in conjunction with Figure of description and specific preferred embodiment, the invention will be further described, but not therefore and It limits the scope of the invention.
As shown in Figure 1, 2, dense matrix multiplication vectorization assembly code generation method of the present embodiment based on GPDSP, step Suddenly include:
S1. the multiple cores template for realizing different task is constructed, including for realizing sub-block matrix multiplication GEMM_kernel module, for realizing the DMA_Translate module of data transmission, for whether detecting in DMA data-moving Move the DMA_POLL module for finishing corresponding register flag bit and the LOOP module for executing cyclic process, each mould Plate includes the parameter list of required parameter;
S2. the framework template of dense matrix multiplication is constructed, uses GEMM_kernel module, DMA_ respectively in framework template Translate module, DMA_POLL module and LOOP module, to realize dense matrix multiplication;
S3. nucleus module each in framework template is converted using the assembly code generation module Gen_GEMM constructed in advance For assembly code, required dense matrix multiplication assembly code is ultimately generated.
Architectural feature of the present embodiment based on GPDSP, by constructing multiple nucleus modules for realizing different task, base It is indicated and corresponding parameter in the framework template of each nucleus module building dense matrix multiplication, template comprising multiple cores module List is realized corresponding task by each nucleus module, then is passed through the assembly code generation module constructed in advance based on the template Gen_GEMM carries out assembly code conversion, by assembly code generation module Gen_GEMM according to the core for including in framework template Core module indicates and corresponding input parameter list automatically generates corresponding assembly code, ultimately generates required vectorization compilation Code realizes that the dense matrix multiplication vectorization assembly code based on GPDSP framework automatically generates.
The above-mentioned code generating method of the present embodiment, principle are simple and convenient to operate, and can be suitable for GPDSP framework, quickly be obtained The high-performance dense matrix multiplication library function assembly code for obtaining height optimization is realized without paying close attention to bottom hardware, and is convenient for library letter The maintenance of number assembly code, when that need to update, it is only necessary to which the nucleus module for including in more new template indicates and corresponding parameter column Table can be avoided the calculating core amounts of vector involved in subsequent IPization and need to adapt to caused library function compilation when different application Code all needs the problems such as vectorization again and optimization.
The dense matrix multiplying based on GPDSP framework realized needed for the present embodiment is C=C-A*B, wherein A be M × The matrix of K rank, B are the matrix of K × N rank, and C is the matrix of M × N rank;If the corresponding piecemeal size of tri- dimensions of M, K, N of matrix It is respectively labeled as MB, KB, NB, enables mi=M/MB, ki=K/ KB, ni=N/NB, it is assumed that mi, ki are integers, and ni is even number.
To realize that above-mentioned dense matrix multiplying C=C-A*B, the present embodiment construct sub-block matrix multiplication core mould first Block indicated using GEMM_kernel module, i.e. GEMM_kernel module, the corresponding parameter list packet of GEMM_kernel module Include 4 input parameters, wherein the 1st parameter be call terminate return address, the 2nd parameter be A sub-block matrix data address, the 3rd Parameter is B sub-block matrix data address, and the 4th parameter is C sub-block matrix data address;It is above when constructing GEMM_kernel module It states parameter and carries out assembly code setting for input, realize the corresponding sub-block matrix multiplication task of dense matrix multiplication C=C-A*B, and The program address of the 1st parameter transmitting is jumped to after completion task, the GEMM_kernel nucleus module assembly code of realization is specific Preservation is independent file GEMM_kernel.s.
It constructs data and transmits nucleus module, indicated using DMA_Translate module, i.e. DMA_Translate module, The corresponding parameter list of DMA_Translate nucleus module includes 11 input parameters, wherein the 1st parameter is to call what is terminated to return Address is gone back to, the 2nd parameter is dma logic channel number, and the 3rd parameter is logical channel priority, and the 4th parameter is transmission mode control ginseng Digital 1, the 5th parameter is that the 2, the 6th parameter of transmission mode control parameter word is source address, and the 7th parameter is source counting, and the 8th parameter is Destination address counts for the purpose of the 9th parameter, and the 10th parameter is source/destination index, and the 11st parameter is that block indexes;Construct DMA_ When Translate module, it is that input carries out assembly code setting with above-mentioned parameter, realizes source address to destination address by DMA Data transfer task, and jump to after the completion of task the program address of the 1st parameter transmitting, the DMA_Translate of realization Nucleus module assembly code specifically saves and is independent file DMA_Translate.s.
It constructs data-moving flag bit and detects nucleus module, indicated using DMA_POLL module, i.e. DMA_POLL module, The corresponding parameter list of DMA_POLL module includes 2 input parameters, wherein the 1st parameter is to call the return address terminated, the 2nd Parameter is dma logic channel number;When constructing DMA_POLL module, it is that input carries out assembly code setting with above-mentioned parameter, realizes Dma logic channel number is whether the data-moving of the 2nd parameter moves the corresponding registers flag bit Detection task finished, and task The program address of the 1st parameter transmitting is jumped to after the completion, and the DMA_POLL nucleus module assembly code of realization specifically saves as solely Vertical file DMA_POLL.s.
Loop module is constructed, is indicated using LOOP, END of pairing, wherein LOOP indicates that circulation starts, END indicates pairing Circulation terminate, i.e. LOOP module, the corresponding parameter list of LOOP module includes 3 input parameters, wherein the 1st parameter is to recycle Register used is counted, the 2nd parameter is cycle count initial value, and the 3rd parameter is the change in count value recycled every time.
After building obtains above-mentioned each nucleus module, one can be constructed jointly by above-mentioned each nucleus module based on GPDSP frame The dense matrix multiplication framework template of structure.The present embodiment is according to the multistage storage architecture feature of GPDSP, and step S2 is according to thick The execution path combination multiple cores module composition piecemeal dense matrix multiplication Computational frame template of close matrix multiplication is wrapped in template The expression of multiple cores functions of modules and parameter list are included, to realize that efficient piecemeal dense matrix multiplication calculates;Step S3 is sharp again It indicated with compilation code generation module Gen_GEMM according to the nucleus module for including in template, input parameter list and core accordingly Core module assembly code file automatically generates dense matrix multiplication assembly code, completes the vector code of dense matrix multiplication certainly It is dynamic to generate.
As shown in figure 3, specific building as follows obtains the frame mould of dense matrix multiplication in the present embodiment step S2 Plate:
Step 1: opening K dimension Circulant Block and calculate, set counter register R0, counter initial value ki is followed every time Inner loop counter subtracts 1, until Counter Value is 0;
Step 2: opening M dimension Circulant Block and calculate, set counter register R1, counter initial value mi is followed every time Inner loop counter subtracts 1, until Counter Value is 0, wherein being made respectively in loop calculation using the execution cyclic process of LOOP module With DMA_Translate module transfer A, B, C matrix to data buffer zone, transferred using pending datas such as DMA_POLL modules Finish, and carry out the calculating of A, B, C matrix sub block using GEMM_kernel module, wherein DMA_Translate module is expressed as Data transmit nucleus module, and DMA_POLL module is expressed as data-moving flag bit detection nucleus module, GEMM_kernel mould Block is expressed as sub-block matrix multiplication nucleus module;
Step 3: judging whether counter R1 is 0, if not returning to step 2, execute step 4 if being transferred to;
Step 4: judging whether counter R0 is 0, if not 1 is returned to step, if completing current calculate.
In the present embodiment, step 2 realizes the specific steps that M dimension Circulant Block calculates are as follows:
Step 2.1: successively using a sub-block MBxKB of DMA_Translate module transfer A matrix to scalar L1D's Data buffer zone Abuf;
Step 2.2: waiting Abuf data end of transmission using DMA_POLL module;
Step 2.3: respectively using DMA_Translate module transfer B, C matrix sub-block a KBxNB, MBxNB to Measure the data buffer zone Bbuf0, Cbuf0 of array storage;
Step 2.4: respectively using DMA_Translate module transfer B, C matrix sub-block a KBxNB, MBxNB to Measure the data buffer zone Bbuf1, Cbuf1 of array storage;
Step 2.5: opening N-dimensional Circulant Block and calculate, set counter register R2, counter initial value ni-2, often Secondary cycle counter subtracts 2, until Counter Value is 0;
Step 2.6: waiting Bbuf0, Cbuf0, Out0 data end of transmission using DMA_POLL module respectively;
Step 2.7: using GEMM_kernel module to Abuf, A, B, C matrix sub block of Bbuf0, Cbuf0 are calculated;
Step 2.8: using the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out0;
Step 2.9: waiting Bbuf1, Cbuf1, Out1 data end of transmission using DMA_POLL module respectively;
Step 2.10: using GEMM_kernel module to Abuf, A, B, C matrix sub block of Bbuf1, Cbuf1 are counted It calculates;
Step 2.11: using the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out1.
In the present embodiment, step 2.5 executes the specific steps that N-dimensional Circulant Block calculates are as follows:
Step 2.5.1: Bbuf0, Cbuf0, Out0 data end of transmission are waited using DMA_POLL module respectively;
Step 2.5.2: using GEMM_kernel module to Abuf, A, B, C matrix sub block of Bbuf0, Cbuf0 are counted It calculates;
Step 2.5.3: the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out0 is used;
Step 2.5.4: respectively extremely using sub-block a KBxNB, MBxNB of DMA_Translate module transfer B, C matrix The data buffer zone Bbuf0, Cbuf0 of vector array storage;
Step 2.5.5: Bbuf1, Cbuf1, Out1 data end of transmission are waited using DMA_POLL module respectively;
Step 2.5.6: using GEMM_kernel module to Abuf, A, B, C matrix sub block of Bbuf1, Cbuf1 are counted It calculates;
Step 2.5.7: the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out1 is used;
Step 2.5.8: respectively extremely using sub-block a KBxNB, MBxNB of DMA_Translate module transfer B, C matrix The data buffer zone Bbuf1, Cbuf1 of vector array storage;
Step 2.5.9: judging whether counter R2 is 0, if not 0 returns to step 2.5.1, is transferred to execution if 0 Step 2.6.
By the framework template of above-mentioned dense matrix multiplication, can be realized in conjunction with the structural system feature of GPDSP efficient Piecemeal dense matrix multiplication calculates, and realizes different task by using each nucleus module of building in template, and subsequent combination converges The high-performance dense matrix multiplication library function assembly code that code generation module Gen_GEMM can be quickly obtained height optimization is compiled, Details is realized without paying close attention to bottom hardware, when needing to update optimization or extension, it is only necessary to which more new template can automatically obtain more New optimization library function assembly code.
In the present embodiment, assembly code generation module Gen_GEMM is with specific reference to the type of target core template and right The parameter list answered generates the assembly code of target core template.
As shown in figure 4, the present invention in concrete application embodiment assembly code generation module Gen_GEMM to loop module Generate assembly code specifically:
Loop module and input parameter list in template are expressed as follows:
LOOP(Ri,count,len)
……
END
Then above-mentioned loop module is generated following assembly code by Gen_GEMM:
LOOP_Ri:
SMOVI count, Ri
……
[Ri] SBR LOOP_Ri
[Ri] SSUB 1en, Ri, Ri
SNOP 4
As shown in figure 5, the present invention is in concrete application embodiment assembly code generation module Gen_GEMM to DMA_ Translate nucleus module generates assembly code specifically:
DMA_Translate module and input parameter list in template are expressed as follows:
DMA_Translate (para1,para2,para3,para4,para5,para6,para7,para8,para9, para10,para11)
Then above-mentioned DMA_Translate module is generated following assembly code by Gen_GEMM:
SBR DMA_Translate
SMOVI para1, R63
| SMOVI.M1 para2, R62
| SMOVI.M2 para3, R61
SMOVI para4, R60
| SMOVI.M1 para5, R59
| SMOVI.M2 para6, R58
SMOVI para7, R67
| SMOVI.M1 para8, R56
| SMOVI.M2 para9, R55
SMOVI para10,R54
SMOVI.M1 para11,R53
If DMA_Translate expression is to first appear in a template, the code of DMA_Translate.s file is inserted Enter to assembling file tail portion to be generated.
As shown in fig. 6, assembly code generation module Gen_GEMM is to DMA_POLL core mould in the specific embodiment of the invention Block generates assembly code specifically:
DMA_POLL module and input parameter list in template are expressed as follows:
DMA_POLL (para1,para2)
Then above-mentioned DMA_POLL module is generated following assembly code by Gen_GEMM:
SBR DMA_POLL
SMOVI para1, R63
SMOVI para2, R62
SNOP 4
If DMA_POLL expression is to first appear in a template, by the code insertion of DMA_POLL.s file to be generated Assembling file tail portion.
As shown in fig. 7, the present invention in concrete application embodiment assembly code generation module Gen_GEMM to GEMM_ Kernel nucleus module generates assembly code specifically:
GEMM_kernel module and input parameter list in template are expressed as follows:
GEMM_kernel (para1,para2, para3,para4)
Then above-mentioned GEMM_kernel module is generated following assembly code by Gen_GEMM:
SBR GEMM_kernel
SMOVI para1, R63
SMOVI para2, R62
SMOVI para3, R61
SMOVI para4, R60
SNOP 2
If GEMM_kernel expression is to first appear in a template, the code insertion of GEMM_kernel.s file is arrived Assembling file tail portion to be generated.
Above-mentioned only presently preferred embodiments of the present invention, is not intended to limit the present invention in any form.Although of the invention It has been disclosed in a preferred embodiment above, however, it is not intended to limit the invention.Therefore, all without departing from technical solution of the present invention Content, technical spirit any simple modifications, equivalents, and modifications made to the above embodiment, should all fall according to the present invention In the range of technical solution of the present invention protection.

Claims (8)

1. a kind of dense matrix multiplication vectorization assembly code generation method based on GPDSP, which is characterized in that step includes:
S1. the multiple cores template for realizing different task is constructed, including the GEMM_ for realizing sub-block matrix multiplication Kernel module, for realizing data transmission DMA_Translate module, for detecting whether data-moving in DMA is moved Finish the DMA_POLL module of corresponding register flag bit and the LOOP module for executing cyclic process, each template packet Include the parameter list of required parameter;
S2. the framework template for constructing dense matrix multiplication, in the framework template respectively using the GEMM_kernel module, DMA_Translate module, DMA_POLL module and LOOP module, to realize dense matrix multiplication;
S3. nucleus module each in the framework template is converted to using the assembly code generation module Gen_GEMM constructed in advance Assembly code ultimately generates required dense matrix multiplication assembly code;
When the dense matrix multiplying calculated based on GPDSP framework is C=C-A*B, wherein A is the matrix of M × K rank, B K The matrix of × N rank, C be M × N rank matrix, the corresponding piecemeal size of tri- dimensions of M, K, N of matrix be respectively labeled as MB, KB, NB, enables mi=M/MB, and when ki=K/KB, ni=N/NB, specific building as follows obtains dense matrix in the step S2 The framework template of multiplication:
Step 1: opening K dimension Circulant Block and calculate, set counter register R0, counter initial value ki, every time circulation meter Number device subtracts 1, until Counter Value is 0;
Step 2: opening M dimension Circulant Block and calculate, set counter register R1, counter initial value mi, every time circulation meter Number device subtracts 1, until Counter Value is 0, wherein being made respectively in loop calculation using LOOP module execution cyclic process With DMA_Translate module transfer A, B, C matrix to data buffer zone, the pending datas such as the DMA_POLL module are used End of transmission, and the calculating using GEMM_kernel module progress A, B, C matrix sub block;
Step 3: judging whether counter R1 is 0, if not returning to step 2, execute step 4 if being transferred to;
Step 4: judging whether counter R0 is 0, if not 1 is returned to step, if completing current calculate.
2. the dense matrix multiplication vectorization assembly code generation method according to claim 1 based on GPDSP, feature Be: the parameter list of the GEMM_kernel module includes 4 input parameters, wherein the 1st parameter is to call the return terminated Address, the 2nd parameter are A sub-block matrix data address, and the 3rd parameter is B sub-block matrix data address, and the 4th parameter is C sub-block matrix Data address;When constructing the GEMM_kernel module, it is that input carries out assembly code setting with above-mentioned parameter, realizes dense The corresponding sub-block matrix multiplication task of matrix multiplication, and jump to after completion task the program address of the 1st parameter transmitting.
3. the dense matrix multiplication vectorization assembly code generation method according to claim 2 based on GPDSP, feature Be: the parameter list of the DMA_Translate module includes 11 input parameters, wherein the 1st parameter is to call to terminate Return address, the 2nd parameter are dma logic channel number, and the 3rd parameter is logical channel priority, and the 4th parameter is transmission mode control The 1, the 5th parameter of parameter word is that the 2, the 6th parameter of transmission mode control parameter word is source address, and the 7th parameter is source counting, the 8th parameter It for purpose address, is counted for the purpose of the 9th parameter, the 10th parameter is source/destination index, and the 11st parameter is that block indexes;Described in building It is that input carries out assembly code setting, the number of realization source address to destination address with above-mentioned parameter when DMA_Translate module According to transformation task, and jump to after the completion of task the program address of the 1st parameter transmitting.
4. the dense matrix multiplication vectorization assembly code generation method according to claim 3 based on GPDSP, feature Be: the parameter list of the DMA_POLL module includes 2 input parameters, wherein the 1st parameter is with calling the return terminated Location, the 2nd parameter are dma logic channel number;It is that input carries out assembly code with above-mentioned parameter when constructing the DMA_POLL module Setting realizes that dma logic channel number is whether the data-moving of the 2nd parameter moves the corresponding registers flag bit detection times finished Business, and the program address of the 1st parameter transmitting is jumped to after the completion of task.
5. the dense matrix multiplication vectorization assembly code generation method according to claim 4 based on GPDSP, feature Be: the parameter list of the LOOP module includes 3 input parameters, wherein the 1st parameter is register used in cycle count, the 2 parameters are cycle count initial value, and the 3rd parameter is the change in count value recycled every time.
6. the dense matrix multiplication vectorization assembly code generation method according to claim 1 based on GPDSP, feature It is, the specific steps of the step 2 are as follows:
Step 2.1: successively using a sub-block MBxKB of the DMA_Translate module transfer A matrix to scalar L1D's Data buffer zone Abuf;
Step 2.2: waiting data buffer zone Abuf data end of transmission using the DMA_POLL module;
Step 2.3: respectively using DMA_Translate module transfer B, C matrix sub-block a KBxNB, MBxNB to Measure the data buffer zone Bbuf0, Cbuf0 of array storage;
Step 2.4: respectively using DMA_Translate module transfer B, C matrix sub-block a KBxNB, MBxNB to Measure the data buffer zone Bbuf1, Cbuf1 of array storage;
Step 2.5: opening N-dimensional Circulant Block and calculate, set counter register R2, counter initial value ni-2 is followed every time Inner loop counter subtracts 2, until Counter Value is 0;
Step 2.6: waiting data buffer zone Bbuf0, Cbuf0 using the DMA_POLL module respectively, Out0 data transfer Finish;
Step 2.7: using the GEMM_kernel module to data buffer zone Abuf, A, B, C matrix of Bbuf0, Cbuf0 Block is calculated;
Step 2.8: using the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out0;
Step 2.9: waiting data buffer zone Bbuf1, Cbuf1 using the DMA_POLL module respectively, Out1 data transfer Finish;
Step 2.10: using the GEMM_kernel module to data buffer zone Abuf, A, B, C matrix of Bbuf1, Cbuf1 Block is calculated;
Step 2.11: using the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out1.
7. the dense matrix multiplication vectorization assembly code generation method according to claim 6 based on GPDSP, feature It is, the specific steps of the step 2.5 are as follows:
Step 2.5.1: data buffer zone Bbuf0, Cbuf0, the transmission of Out0 data are waited using the DMA_POLL module respectively It finishes;
Step 2.5.2: using the GEMM_kernel module to data buffer zone Abuf, A, B, C matrix of Bbuf0, Cbuf0 Sub-block is calculated;
Step 2.5.3: the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out0 is used;
Step 2.5.4: respectively extremely using sub-block a KBxNB, MBxNB of DMA_Translate module transfer B, C matrix The data buffer zone Bbuf0, Cbuf0 of vector array storage;
Step 2.5.5: data buffer zone Bbuf1, Cbuf1, the transmission of Out1 data are waited using the DMA_POLL module respectively It finishes;
Step 2.5.6: using the GEMM_kernel module to data buffer zone Abuf, A, B, C matrix of Bbuf1, Cbuf1 Sub-block is calculated;
Step 2.5.7: the above-mentioned calculated result of DMA_Translate module transfer to external memory area Out1 is used;
Step 2.5.8: respectively extremely using sub-block a KBxNB, MBxNB of DMA_Translate module transfer B, C matrix The data buffer zone Bbuf1, Cbuf1 of vector array storage;
Step 2.5.9: judging whether counter R2 is 0, if not 0 returns to step 2.5.1, is transferred to execution step if 0 2.6。
8. the dense matrix multiplication vectorization assembly code described according to claim 1~any one of 5 based on GPDSP is raw At method, which is characterized in that the assembly code generation module Gen_GEMM is according to the type and correspondence of target core template Parameter list generate target core template assembly code.
CN201810530676.XA 2018-05-29 2018-05-29 GPDSP-based dense matrix multiplication vectorization assembly code generation method Active CN108845795B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810530676.XA CN108845795B (en) 2018-05-29 2018-05-29 GPDSP-based dense matrix multiplication vectorization assembly code generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810530676.XA CN108845795B (en) 2018-05-29 2018-05-29 GPDSP-based dense matrix multiplication vectorization assembly code generation method

Publications (2)

Publication Number Publication Date
CN108845795A CN108845795A (en) 2018-11-20
CN108845795B true CN108845795B (en) 2019-06-14

Family

ID=64211068

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810530676.XA Active CN108845795B (en) 2018-05-29 2018-05-29 GPDSP-based dense matrix multiplication vectorization assembly code generation method

Country Status (1)

Country Link
CN (1) CN108845795B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704193B (en) * 2019-10-12 2022-12-16 中国电子科技集团公司第三十八研究所 Method and device for realizing multi-core software architecture suitable for vector processing
CN113721899B (en) * 2021-09-02 2023-08-15 中国人民解放军国防科技大学 GPDSP-oriented lightweight high-efficiency assembly code programming method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984522A (en) * 2014-05-27 2014-08-13 中国人民解放军国防科学技术大学 Method for achieving fixed point and floating point mixed division in general-purpose digital signal processor (GPDSP)
CN104142886A (en) * 2013-05-10 2014-11-12 华为软件技术有限公司 ARM (advanced RISC machines) assembly code debugging and processing method and device
CN104636315A (en) * 2015-02-06 2015-05-20 中国人民解放军国防科学技术大学 GPDSP-oriented matrix LU decomposition vectorization calculation method
CN104679691A (en) * 2015-01-22 2015-06-03 中国人民解放军国防科学技术大学 Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP and adopting host counting

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104142886A (en) * 2013-05-10 2014-11-12 华为软件技术有限公司 ARM (advanced RISC machines) assembly code debugging and processing method and device
CN103984522A (en) * 2014-05-27 2014-08-13 中国人民解放军国防科学技术大学 Method for achieving fixed point and floating point mixed division in general-purpose digital signal processor (GPDSP)
CN104679691A (en) * 2015-01-22 2015-06-03 中国人民解放军国防科学技术大学 Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP and adopting host counting
CN104636315A (en) * 2015-02-06 2015-05-20 中国人民解放军国防科学技术大学 GPDSP-oriented matrix LU decomposition vectorization calculation method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向GPDSP科学计算的高性能DMA传输方式的设计与实现;王占立;《中国优秀硕士学位论文全文数据库 信息科技辑》;20170315(第03期);第I137-120页

Also Published As

Publication number Publication date
CN108845795A (en) 2018-11-20

Similar Documents

Publication Publication Date Title
US11307873B2 (en) Apparatus, methods, and systems for unstructured data flow in a configurable spatial accelerator with predicate propagation and merging
EP3726389B1 (en) Apparatuses, methods, and systems for memory interface circuit allocation in a configurable spatial accelerator
CN109522254B (en) Arithmetic device and method
US10515046B2 (en) Processors, methods, and systems with a configurable spatial accelerator
US20190205284A1 (en) Apparatus, methods, and systems for memory consistency in a configurable spatial accelerator
US20190007332A1 (en) Processors and methods with configurable network-based dataflow operator circuits
US10678724B1 (en) Apparatuses, methods, and systems for in-network storage in a configurable spatial accelerator
CN101763245B (en) Method and apparatus for programming direct memory access engine
CN108845795B (en) GPDSP-based dense matrix multiplication vectorization assembly code generation method
CN110427337B (en) Processor core based on field programmable gate array and operation method thereof
US9239732B2 (en) Unrolling aggregation operations in asynchronous programming code having multiple levels in hierarchy
CN102135950A (en) On-chip heterogeneous multi-core system based on star type interconnection structure, and communication method thereof
Elteir et al. Performance characterization and optimization of atomic operations on amd gpus
CN101763246A (en) Method and apparatus for programming direct memory access engine
US11907713B2 (en) Apparatuses, methods, and systems for fused operations using sign modification in a processing element of a configurable spatial accelerator
Wun et al. Exploiting coarse-grained parallelism to accelerate protein motif finding with a network processor
US20170269931A1 (en) Method and Computing System for Handling Instruction Execution Using Affine Register File on Graphic Processing Unit
CN108776586B (en) Large-point-number FFT vectorization assembly code generation method based on GPDSP
US11941440B2 (en) System and method for queuing commands in a deep learning processor
Ltaief et al. Hybrid multicore cholesky factorization with multiple gpu accelerators
JP2023544911A (en) Method and apparatus for parallel quantum computing
Dongarra et al. Batched BLAS (basic linear algebra subprograms) 2018 specification
D'Hollander et al. Calling hardware procedures in a reconfigurable accelerator using RPC-FPGA
Sharma et al. Stash: A comprehensive stall-centric characterization of public cloud VMs for distributed deep learning
Yi Dynamic binary translation cache optimization algorithm in cloud computing environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant