CN111291320A - Double-precision floating-point complex matrix operation optimization method based on HXDDSP chip - Google Patents

Double-precision floating-point complex matrix operation optimization method based on HXDDSP chip Download PDF

Info

Publication number
CN111291320A
CN111291320A CN202010047045.XA CN202010047045A CN111291320A CN 111291320 A CN111291320 A CN 111291320A CN 202010047045 A CN202010047045 A CN 202010047045A CN 111291320 A CN111291320 A CN 111291320A
Authority
CN
China
Prior art keywords
double
precision floating
point complex
complex matrix
register
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010047045.XA
Other languages
Chinese (zh)
Other versions
CN111291320B (en
Inventor
苏涛
张丽
朱晨曦
张晓杰
桂宪满
陈琛
董浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202010047045.XA priority Critical patent/CN111291320B/en
Publication of CN111291320A publication Critical patent/CN111291320A/en
Application granted granted Critical
Publication of CN111291320B publication Critical patent/CN111291320B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a double-precision floating point complex matrix operation optimization method based on an HXPSP chip, belonging to the field of bottom function optimization of a digital signal processor; the double-precision floating-point complex matrix operation comprises an inner loop formed by multiplying one element in a first double-precision floating-point complex matrix with four elements in corresponding rows of the first four columns of a second double-precision floating-point complex matrix for multiple times and then adding the multiplied element with corresponding elements of a third double-precision floating-point complex matrix; an outer loop formed by a process of multiplying one row of the first double-precision floating-point complex matrix by corresponding four elements of each four columns of the second double-precision floating-point complex matrix for a plurality of times; the invention avoids the Kanton phenomenon caused by reading conflict in the memory by optimizing the matrix multiplication, and simultaneously, fully utilizes hardware resources and improves the operation efficiency.

Description

Double-precision floating-point complex matrix operation optimization method based on HXDDSP chip
Technical Field
The invention belongs to the field of bottom function optimization of a digital signal processor, and particularly relates to a double-precision floating point complex matrix operation optimization method based on an HXPDSP chip.
Background
With the rapid development of digital signal processing technology, DSP processors are widely used in the fields of image processing, communication technology, radar detection, voice processing, network control, instrumentation, and home appliances. The independent development of DSP processor chips has gradually become an important subject for the development of digital signal processing technology in China. Under the background, a research institute has proposed a high-performance series of DSP processors HWDSP1024, and the DSP from the architecture to the instruction system to the development environment of design implementation and software and hardware matching is completely and independently developed, and the performance of the DSP is even higher than that of some international products. The method has wide application prospect in national defense security, public security, Internet of things, communication and other industries, and the success of the method breaks through the monopoly of high-end digital signal processing chips in China abroad.
The HWDSP1024 integrates 2 processor cores (eC104+), 16 issue (sending up to 16 instructions simultaneously per instruction cycle), single instruction stream, multiple data stream architectures. Each eC104+ core is contained in four execution macros, each execution macro internally containing 2 sets of registers, each set of registers having 64 (32-bit binary data), 8 Arithmetic Logic Units (ALUs), 8 multipliers, 4 shifters, and 1 super calculator (SPU); the development language used is C language or assembly language, and considering the improvement of the efficiency of function operation, although the C language is easy to read and has good portability, the C language is inconvenient for directly controlling a hardware system and cannot fully exert the characteristics of a DSP chip, so the assembly language is selected to complete the design of the library function so as to ensure the maximization of the utilization of hardware resources. However, the assembly language directly compiled from the C language does not combine the hardware characteristics of the DSP chip, and has certain defects, particularly, severe Bank conflicts exist for the multiplication of double-precision floating-point complex numbers.
Disclosure of Invention
In order to solve the above problems, the present invention aims to provide a double-precision floating-point complex matrix operation optimization method based on an HXDSP chip. The optimization method can realize multiplication of the double-precision floating-point complex matrix and then summation of the double-precision floating-point complex matrix with another double-precision floating-point complex matrix, can avoid the phenomenon of stuck caused by Bank conflict in the multiplication process of the double-precision floating-point complex matrix, and improves the operation efficiency of the double-precision floating-point complex matrix; in addition, hardware resources of the HXDSP chip are fully utilized, and therefore the processing efficiency of the bottom layer function is greatly improved.
In order to achieve the above object, the present invention adopts the following technical solutions.
The double-precision floating point complex matrix operation optimization method based on the HXPSP chip comprises the following steps:
the double-precision floating-point complex matrix operation is that the first double-precision floating-point complex matrix is multiplied by the second double-precision floating-point complex matrix and then added with the third double-precision floating-point complex matrix; the specification of the first double-precision floating point complex matrix is m x n, the specification of the second double-precision floating point complex matrix is n x k, and the specification of the third double-precision floating point complex matrix is m x k;
the first double-precision floating-point complex matrix is multiplied by the second double-precision floating-point complex matrix and then added with the third double-precision floating-point complex matrix, and the inner loop is formed by a process of multiplying one element in the first double-precision floating-point complex matrix by four elements in corresponding rows of the first four columns of the second double-precision floating-point complex matrix and adding the multiplied element with corresponding elements of the third double-precision floating-point complex matrix for multiple times; an outer loop formed by a process of multiplying one row of the first double-precision floating-point complex matrix by corresponding four elements of each four columns of the second double-precision floating-point complex matrix for a plurality of times;
the process that one element in the first double-precision floating point complex matrix is multiplied and accumulated with four elements in a corresponding row of the second double-precision floating point complex matrix and then added with the corresponding element of the third double-precision floating point complex matrix comprises optimization based on HXPSP chip instruction parallelism, optimization based on loop expansion and optimization based on software pipelining;
the optimization based on the HXPSP chip instruction parallelism is the optimization that the same instruction simultaneously controls a plurality of arithmetic units to execute the same operation; the optimization based on loop expansion is the optimization of expanding the same loop for a plurality of times in one loop; the software pipeline-based optimization is an optimization of the multiple identical loops performed in parallel and interleaved.
Furthermore, the HXDDSP chip comprises 4 execution macros, each execution macro comprises a general register group with 64 words, registers in the general register group are divided into an A-side register and a B-side register, the number of the A-side register is respectively from 0 to 63, 0-19 is a first register group, 20-39 is a second register group, 40-59 is a third register group, and the number of the B-side register is respectively from 40 to 59 and is a fourth register group; the four operations to execute the macro may be in parallel; each execution macro also comprises 8 multipliers, 8 Arithmetic Logic Units (ALUs), 4 shifters and a super calculator; the HXDSP chip is provided with three address registers, and each address register comprises 16 registers;
the parallel optimization based on the HXPSP chip instruction specifically comprises the steps of respectively reading 8 data in a first double-precision floating point complex matrix and 8 data in a second double-precision floating point complex matrix; 8 data in the first double-precision floating-point complex matrix are 2 data corresponding to one element in 4 same first double-precision floating-point complex matrices, and 8 data in the second double-precision floating-point complex matrix are four elements in the same row in the second double-precision floating-point complex matrix; reading 8 data in the first floating-point complex vector and reading 8 data in the second floating-point complex vector are performed in parallel;
before the cycle begins, reading and storing 8 data in the third double-precision floating-point complex matrix;
respectively and correspondingly storing the 8 data in the first double-precision floating point complex matrix into eight registers consisting of a first register and a second register in four register groups of the HXPSP chip; the 8 data in the second double-precision floating point complex matrix are respectively and correspondingly stored in eight registers consisting of a third register and a fourth register in four register groups of the HXPSP chip; the 8 data in the third double-precision floating point complex matrix are respectively and correspondingly stored in eight registers consisting of a fifth register and a sixth register in four register groups of the HXPSP chip; the storage of 8 data in the first double-precision floating-point complex matrix and the storage of 8 data in the second double-precision floating-point complex matrix are carried out in parallel;
multiplying the data in the first register and the data in the third register by adopting a first multiplier in each execution macro of the HXPSP chip in one instruction, and storing a multiplication result in a seventh register; the second multiplier multiplies the data in the second register and the data in the fourth register respectively, and stores the multiplication result in the eighth register; and the data multiplication of the first multiplier and the second multiplier is carried out in parallel.
Further, the optimization based on inner loop expansion specifically includes: the floating-point complex multiplication process is executed for a plurality of times in one inner loop.
Further, the optimization based on loop unrolling specifically includes: the four floating-point complex multiplication processes are performed in one inner loop.
Further, the specific process of multiplying the floating-point complex numbers in the inner loop is as follows:
firstly, multiplying and summing a first element of a first row of a first double-precision floating point complex matrix with first row data in first four rows of data of a second double-precision floating point complex matrix, and accumulating the result to a register storing corresponding elements of a third double-precision floating point complex matrix; multiplying and summing a second element of the first row of the first double-precision floating point complex matrix with second row data in the first four rows of data of the second double-precision floating point complex matrix respectively, and accumulating the result to a register storing corresponding elements of a third double-precision floating point complex matrix; repeating the operation until all data in the first four columns of the second double-precision floating point complex matrix are calculated;
then, multiplying and summing the first element of the first row of the first double-precision floating point complex matrix with the first row of data in the next four rows of data of the second double-precision floating point complex matrix, and accumulating the result to a register storing the corresponding element of the third double-precision floating point complex matrix; repeating the operation until the first row of the first double-precision floating point complex matrix is multiplied by all columns of the second double-precision floating point complex matrix, and then adding the first row of the first double-precision floating point complex matrix and the first row of the third double-precision floating point complex matrix to obtain a finally calculated first row data result;
the specific process of the external circulation is as follows: for each row of the first double-precision floating-point complex matrix, sequentially performing the inner loop from the first row to the last row.
Further, the optimization based on software pipelining specifically includes:
the internal cycle after the software-based pipeline optimization comprises a cycle start period and a normal cycle period; the instructions executed by the inner loop are reading instructions, inter-macro transmission instructions, complex multiplication instructions, complex addition and subtraction instructions and accumulation instructions in sequence;
in the cycle starting period, the four register groups sequentially carry out a reading instruction, an inter-macro transmission instruction, a complex multiplication instruction, a complex addition and subtraction instruction and an accumulation instruction;
in the normal cycle, four register groups respectively carry out a reading instruction, an inter-macro transmission instruction, a complex multiplication instruction, a complex addition and subtraction instruction and an accumulation instruction, and the instructions between every two register groups are different at the same time.
Compared with the prior art, the invention has the beneficial effects that:
(1) the matrix multiplication method is different from the traditional matrix multiplication method of multiplying one row by one column, and the invention adopts the steps that the first element of the first double-precision floating point complex matrix of the matrix is respectively stored in eight registers and multiplied and accumulated with the four elements of the corresponding row of the first four columns of the second double-precision floating point complex matrix, thereby avoiding the stop shooting caused by the memory conflict in the traditional matrix multiplication.
(2) In the process of the internal circulation, in order to fully use resources, the internal circulation is carried out by adopting an instruction parallel method, namely irrelevant circulation in a circulation body is unfolded, so that the judgment and jumping times of the internal circulation are reduced, the running water pause is reduced, and the execution efficiency is improved; through optimization of software pipelining, all instructions are used in a staggered mode in different cycle times, and the instruction utilization rate is improved.
Drawings
The invention is described in further detail below with reference to the figures and specific embodiments.
Fig. 1(a) is a schematic diagram of memory data arrangement of an HXDSP1042 chip according to an embodiment of the present invention;
FIG. 1(b) is a schematic diagram of data storage according to an embodiment of the present invention;
FIG. 2(a) is a matrix A in a conventional complex matrix multiplication method;
FIG. 2(B) shows a matrix B in a conventional complex matrix multiplication method;
FIG. 3(a) shows a matrix A in a complex matrix multiplication method according to an embodiment of the present invention;
FIG. 3(B) shows a matrix B in a complex matrix multiplication method according to an embodiment of the present invention;
FIG. 3(C) shows a matrix C in a complex matrix multiplication method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a circular flow of a double-precision floating-point complex matrix multiplication and addition operation according to an embodiment of the present invention;
FIG. 5 is a parallel schematic of the innermost loop of FIG. 4;
FIG. 6(a) is a diagram illustrating data storage of an A-plane register according to an embodiment of the present invention;
FIG. 6(B) is a diagram illustrating data storage of a B-plane register according to an embodiment of the present invention.
Detailed Description
The embodiments and effects of the present invention will be described in further detail below with reference to the accompanying drawings.
The development of the bottom layer functions mainly uses the resources in the execution macro, the execution units of the processor are contained in the four execution macros, the external interfaces and the internal structures of the four execution macros are completely consistent, and the four execution macros obtain operation commands from a decoder, obtain operands from a data storage and perform various specific operations.
The HXPDSP chip of the invention comprises 4 execution macros (x, y, z and t respectively), each execution macro comprises a 64-word general register group (32 bits), the number of registers in the general register group is respectively from 0 to 63, and the operation of the four macros can be carried out in parallel. Each execution macro also includes 8 multipliers, 8 Roy arithmetic units ALU (including addition operations), 4 shifters, and a super operator. In addition, the chip has three address registers (U, V, W), each address register comprises 16 registers (u0-u15, v0-v15, w0-w15)
The following is a description of the address, multiply, add instructions of the present invention:
the three address registers U, V, W of the present invention are used consistently, and the following description takes the U address register as an example:
u-unit doubleword addressing instruction: in the process of double-word operation, each address and adjacent addresses thereof form an address pair as a group, and other data addresses read from the memory are determined by an address offset Uk; because of the double-byte reading used here, the address offset is the actual adjustment quantity formed by multiplying 2, so as to form 8 addresses of [ Un ] and [ Un +1], [ Un +2Uk ] and [ Un +2Uk +1], [ Un +2 x 2Uk ] and [ Un +2 x 2Uk +1], [ Un +3 x 2Uk ] and [ Un +3 x 2Uk +1], and respectively read out 8 data to be sent to the operation macro unit. Macros present in { x, y, z, t } receive 2 data each.
The internal memory of the HXDDSP 1042 chip of the invention comprises a program memory and a data memory. For the data storage, each block of data is subdivided into 8 banks, the memory data is arranged as shown in fig. 1(a), each address of the memory in fig. 1(a) can store 32-bit binary numbers, and the address and the 16-bit data in the figure are only used for explaining the division of the banks.
In the double word addressing process, an address generator generates a plurality of effective addresses, and the eight addresses of the [ Un ] and [ Un +1], [ Un +2Uk ] and [ Un +2Uk +1], [ Un + 3X 2Uk ] and [ Un + 3X 2Uk +1] are simultaneously counted, because the size of the Uk is uncertain, if two or more than two addresses fall on the same memory bank, a bank conflict is generated, once the bank conflict is generated, the whole pipeline must be stopped until all data are correctly read or written, and the normal pipeline cannot be restored.
In the conventional matrix multiplication, the specification of the first double-precision floating-point complex matrix a is m × n, the specification of the second double-precision floating-point complex matrix B is n × k, m is 6, n is 6, and k is 6 as an example:
the conventional matrix multiplication process is as follows: matrix a is passed through a double word addressing instruction: { x, y, z, t } Rs +1: s ═ Un + ═ 2, 8] four rows are taken out and placed into registers xRs +1: s, yRs +1: s, zRs +1: s, tRs +1: s, matrix B is addressed by doubleword { x, y, z, t } Rm +1: the m ═ Un + ═ 4 × 2 × 4, 4 × 2 instructions take four numbers in columns and put them in xRm +1: m, yRm +1: m, zxRm +1: m, txRm +1: m; then Rs +1: s, Rm +1: m is multiplied by double; as shown in fig. 2(a) and 2(B), the conventional matrix multiplication is to take the first four elements of the first row of a to perform multiplication operation with the first column of the matrix B, and then, in the operation process of multiplying the two matrices and then adding the two matrices to another matrix, that is, in the operation process of a × B + C, the elements in the matrix a are ak × i + bk (where k is a number), the elements in the matrix B are ck × i + dk, and the elements in the matrix C are ek i + fk.
When the traditional matrix multiplication method is adopted for operation, although the operation is easy to realize, bank conflict is easily caused when the matrix B is subjected to access. The matrix B is arranged in the memory as shown in fig. 1(B), and the storage rule is first virtual and then real. When four numbers in a column are taken simultaneously, if the four numbers are in the same bank, bank conflict is generated to cause stop shooting; for example, in FIG. 1(b), when four numbers in the same column are addressed by double word addressing, bank conflicts will result if c1 d1 and c5 d5 are in the same bank.
To solve the bank conflict problem in the matrix multiplication process, referring to fig. 3(a) -3(c), the present invention provides a double-precision floating-point complex matrix operation optimization method based on an HXDSP chip, which specifically includes:
matrix a is passed through a double word addressing instruction: { x, y, z, t } Rs +1: s ═ Un + ═ 0, 2] four rows are taken out and placed into registers xRs +1: s, yRs +1: s, zRs +1: s, tRs +1: s, matrix B is addressed by doubleword { x, y, z, t } Rm +1: the m ═ Un ═ 8, 2 instruction takes four numbers in columns and puts them in xRm +1: m, yRm +1: m, zxRm +1: m, txRm +1: and m is selected. Then Rs +1: s, Rm +1: m are multiplied by double. Fig. 3(a) shows a storage situation of the matrix a, fig. 3(B) shows a storage situation of the matrix B, and fig. 3(C) shows a storage situation of the matrix C, wherein when the matrix A, B is multiplied, the calculation of the shaded portion in the figure is performed first, that is, the first element of the matrix a is taken first, and the register xRs +1: s, yRs +1: s, zRs +1: s, tRs +1: the first element of the matrix A is put in s; and then the first four elements xRm +1 of the first row of matrix B: m, yRm +1: m, zxRm +1: m, txRm +1: m, multiplying the product result with the value stored in xRk +1: k. yRk +1: k. zxRk +1: k. txRk +1: the four elements of matrix C of k four registers are added and the final result is stored at xRk +1: k. yRk +1: k. zxRk +1: k. txRk +1: k registers.
The next element of matrix a is taken and multiplied by the second row element of the first four columns of matrix B, and the result continues to be summed with register xRk +1: k. yRk +1: k. zxRk +1: k. txRk +1: k is added; and repeating the steps until the first four numbers are obtained, and outputting.
Next, the remaining elements in the current row of matrix a are multiplied by the remaining elements in the first four columns of matrix B in turn, respectively, and then the multiplied values are compared with the register xRk +1: k. yRk +1: k. zxRk +1: k. txRk +1: k are added, wherein the first four numbers are obtained and can be output. Then, as in the above process, the result of multiplying and accumulating the data of the first row of the matrix a by the data of the columns of the matrix B5-8 is calculated, and so on until the calculation of the first row of the matrix a and all the columns of the matrix B is completed, so that the operation of the first row of the matrix a and all the columns of the matrix B is completed.
The above process is also performed on the remaining row data of the matrix a until all row calculations of the matrix a are completed, and the specific flow is shown in fig. 4. In the process of the innermost loop, in order to fully use resources, an instruction parallel method is adopted, and the innermost loop mode is shown in fig. 5.
Specifically, the value of matrix a is a × i + B, and the value of matrix B is c × i + d, and the matrix multiplication operation of the present invention is a × B ═ (a × i + B) (c × i + d) ═ ad + bc × i + (bd-ac).
Each execution macro in the chip of the present invention internally includes 2 sets of registers, each set of registers has 64 registers (32-bit binary data), and the two sets are called a-plane registers and B-plane registers, respectively, as shown in fig. 6(a) and 6 (B).
Let the element in matrix A be ak*i+bkWherein k is the number, and the element in the matrix B is ck*i+dkIn matrix CElement is ek*i+fk
Each register has four macros of x, y, z and t, taking the x macro of the a-plane register as an example, the data storage in the operation process is shown in fig. 6(a), and in the calculation process, 20 registers are used as a group for calculation, wherein the a-plane registers 0-19 are used as a group, 20-39 are used as a group, 40-59 are used as a group, and the B-plane registers 40-59 are used as a group (shown in fig. 6 (B)). The A-plane registers 60-63 initially store the values taken from the matrix C. And accumulating the result multiplied by the obtained result and storing the final operation result.
In FIG. 5, a set of numbers is entered once per cycle, one register being set into a set of numbers; the cycle start period is as shown in table 1:
TABLE 1 operation of four sets of numbers in adjacent cycles in the start of the cycle
Figure BDA0002369799290000101
Figure BDA0002369799290000111
As can be seen from the shaded portion in table 1, the different operation operations of the first group number are sequentially performed, so as to avoid hardware conflict.
After entering the normal cycle period, four groups of data are simultaneously staggered to perform corresponding operations, as shown in table 2:
TABLE 2 operation of four groups of numbers in adjacent cycles after the cycle body enters the normal cycle period
Figure BDA0002369799290000112
As can be seen from Table 2, different operation instructions are used alternately among the four groups of data, and hardware resources are fully utilized.
The following introduces the operation instruction of the double-precision floating-point complex matrix of the invention:
(1) the double multiply instruction is:
QMACCH=DFRm+1:m*DFRn+1:n
DFRs+1:s=QMACCH
or
QMACCL=DFRm+1:m*DFRn+1:n
DFRs+1:s=QMACCL
The above two instructions are used consecutively, and 64-bit floating-point multiplication can be performed. Given the 4 multipliers, the present instruction occupies four multipliers. The 8 multipliers are numbered according to 0-7, and 4 multipliers are lower or 4 multipliers are higher to form 4 paired multipliers. I.e., multiplier pairs 0-3, or multiplier pairs 4-7.
Wherein: DFRm +1: m represents a 64-bit floating point number. Wherein Rm +1 stores the upper 32 bits of the 64-bit floating point number, and Rm represents the lower 32 bits of the 64-bit floating point number.
As can be seen from the above, each macro includes 8 multipliers, each group performs four-group multiplication of a × d, b × c, b × a, and a × c, QMACCH + DFRm +1: m × DFRn +1: n is a complex multiplication instruction, and DFRs +1: s ═ QMACCH is an assignment instruction, which each occupy four multipliers and cannot be parallel, so that a × c and b × d can be calculated in the same beat, two beats are calculated, and each group of double multiplications requires four beats to complete.
(2) The Double addition instruction is:
DFHACCk=DFHRm+DHFHRn
DFLACCk=DFLRm+DFLRn
DFHRs=DFHACCk
DFLRs=DFLACCk
the above four instructions are used consecutively, and 64-bit floating point addition and subtraction can be performed.
The HRm and LRm registers respectively store the upper 32 bits and the lower 32 bits of a 64-bit floating point number and represent a 64-bit floating point number, the index numbers of the two general registers must be continuous and are recorded as DFRm +1: m, wherein Rm +1 stores the upper 32 bits of the 64-bit floating point number, and Rm stores the lower 32 bits of the 64-bit floating point number. The ALU selection K of the above 4 micro-operations must be the same, so that one 64-bit floating point number subtraction can be completed, and one ALU is occupied.
In a loop body, double addition itself requires four beats, and a loop body needs to calculate four sets of addition of ad + bc, bd-ac, e + Σ (ad + bc) and f + Σ (bd-ac), each macro uses 4 ALUs, and although each macro has 8 ALUs, four beats are still required.
A set of simultaneous data operations as shown in Table 3 was chosen for the description:
table 3 selected simultaneous set of data operations
Figure BDA0002369799290000131
The group of data of the multiplication-first and matrix-second operation needs four beats, and the instruction of each beat is as follows:
first beat
r23:22=[v1+=2,0]||xyr25:24=[w1+=4,1]||QMACCL=DFR53:52*DFR55:54||QM
ACCH=DFR43:42*DFR45:44||DFHACC0=DFHR27-DFHR29||DFHACC1=DFH
R26-DFHR28
||DFHACC4=DFHR0+DFHR60||DFHACC5=DFHR1+DFHR61
Second beat
r33:32=[v1+=2,0]||ztr25:24=[w1+=4,1]||DFR47:46=QMACCL||DFR49:48=QM
ACCH||DFHR21=DFHACC0||DFHR20=DFHACC1||DFHR60=DFHACC4||DFH
R61=DFHACC5
Third beat
xyr35:34=[w1+=4,1]||xr15:14=yr5:4||yr5:4=xr15:14||QMACCL=DFR53:52*DF
R45:44||QMACCH=DFR43:42*DFR55:54||DFHACC2=DFHR37+DFHR39
||DFHACC3=DFHR36+DFHR38||DFHACC6=DFHR10+DFHR62||DFHACC7=DFHR11+DFHR63
The fourth beat
ztr35:34=[w1+=4,1]||zr15:14=tr5:4||tr5:4=zr15:14||DFR57:56=QMACCL||DFR
59:58=QMACCH||DFHR31=DFHACC2||DFHR30=DFHACC3||DFHR62=DFHACC6||DFHR63=DFHACC7
The meaning of each beat of instructions is as follows:
if | is a parallel instruction, r23:22 ═ v1+ ═ 2,0, r33:32 ═ v1+ ═ 2,0, read data from the matrix a, read eight 32 bits at most, i.e. four double-type data, twice, and read four double-type data in total
Complex number data: a isi,bi,ai+1,bi+1,ai+2,bi+2,ai+3,bi+3
xyr25:24=[w1+=4,1]、ztr25:24=[w1+=4,1]、xyr35:34=[w1+=4,1]、
ztr35:34=[w1+=4,1]Respectively representing the reading of data from the matrix B, four 32-bit data at a time, four times, and four double complex numbers ci,di,ci+1,di+1,ci+2,di+2,ci+3,di+3
xr15:14 ═ yr5:4| | | yr5:4 ═ xr15:14, zr15:14 | | tr5:4| | | tr5:4| | zr15:14 are macro-to-macro transmission instructions respectively, and the transmission is completed in two beats.
QMACCL=DFR53:52*DFR55:54||QMACCH=DFR43:42*DFR45:44、
DFR47:46=QMACCL||DFR49:48=QMACCH、
QMACCL=DFR53:52*DFR45:44||QMACCH=DFR43:42*DFR55:54、
DFR57:56 ═ QMACCL | | DFR59:58 ═ QMACCH respectively represent double multiplications, four macros are in parallel, and a total of 16 double multiplications a × d, b × c, b × d, a × c are completed in four beats.
DFHACC0=DFHR27-DFHR29||DFHACC1=DFHR26-DFHR28、
||DFHR21=DFHACC0||DFHR20=DFHACC1、
DFHACC2=DFHR37+DFHR39||DFHACC3=DFHR36+DFHR38、
DFHR31 ═ DFHACC2| | | DFHR30 ═ DFHACC3 indicates double addition operations, and a × d + b × c, b × d-a × c, four macros are added for 8 beats.
DFHACC4=DFHR0+DFHR60||DFHACC5=DFHR1+DFHR61、
DFHR60=DFHACC4||DFHR61=DFHACC5、
DFHACC6=DFHR10+DFHR62||DFHACC7=DFHR11+DFHR63、
DFHR62=DFHACC6||DFHR63=DFHACC7 is respectively an accumulation operation and a four-beat implementation
Figure BDA0002369799290000151
The four macros are added eight times in total.
According to the invention, through ingenious arrangement, the operations of multiplication and addition of the complex floating point matrix can be combined into four beats, so that the operation process is accelerated.
The invention relates to the design of the input aspect:
the internal data buses of the HXDDSP 104X kernel have four in total, two-time reading and two-time writing are realized, and at most three buses are allowed to work at the same time; i.e. up to two sets of numbers in one clock cycle. According to the above description, each group of operations needs four numbers a, b, c, d (16 numbers are entered by four macros), and two registers store one number because the data type is double precision. According to the above description, the data in the four macros of the first complex floating-point matrix a takes the same numbers a and b, and the specific instruction is as follows:
r23:22=[v1+=2,0]
r33:32=[v1+=2,0]
and the second complex floating-point matrix B adopts a method of taking four numbers as a period to avoid bank conflicts.
The specific count instruction is as follows:
xyr5:4=[w2+=4,1]
ztr5:4=[w2+=4,1]
xyr15:14=[w2+=4,1]
ztr15:14=[w2+=4,1]
wherein xyr5:4 ═ w2+ ═ 4,1] denotes xr 5:4 storage C1, yr5:4 storage D1; ztr5:4 ═ w2+ ═ 4,1, denotes zr 5:4 storage C2, tr5:4 storage D2; xyr15:14 ═ w2+ ═ 4,1] denotes xr15:14 for C3, yr15:14 for D3; ztr15:14 ═ w2+ ═ 4,1, denotes zr15:14 for C4, tr15:14 for D4.
The specific implementation instruction of the inter-macro transmission is as follows:
xr15:14=yr5:4||yr5:4=xr15:14
zr15:14=tr5:4||tr5:4=zr15:14
xr15:14 ═ yr5:4| | | yr5:4| | | xr15:14 denotes xr15:14 storage D1, yr5:4 storage C3;
zr15:14 ═ tr5:4| | | tr5:4| | | tr15:14 denotes zr15:14 storage D2, tr5:4 storage C4;
through the above instructions, c1, c2, c3, c4 and d 15:14 are ensured to be stored in r5:4, and d1, d2, d3 and d4 are ensured to be stored in r15: 14.
The cycle body counting method comprises the following steps: zero overhead loop register LC 0:
the instruction is in the form of If LC 0B label, representing a zero overhead loop. At the end of the loop body it is checked whether LC0 equals 0 and if it equals 0, no jump is made and the instructions are executed sequentially down, in which case the instructions are at the end of the loop body. The zero overhead loop is pipelined with the conditional jump. The instruction is in the form of If nLC 0B label, represents a jump based on LC0, and jumps to the label destination address If LC0 is checked to be equal to 0; if not, then the execution is sequential.
The model of the HXDDSP chip is HXDDSP 1042.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (6)

1. The double-precision floating point complex matrix operation optimization method based on the HXPSP chip is characterized by comprising the following steps:
the double-precision floating-point complex matrix operation is that the first double-precision floating-point complex matrix is multiplied by the second double-precision floating-point complex matrix and then added with the third double-precision floating-point complex matrix; the specification of the first double-precision floating point complex matrix is m x n, the specification of the second double-precision floating point complex matrix is n x k, and the specification of the third double-precision floating point complex matrix is m x k;
the first double-precision floating-point complex matrix is multiplied by the second double-precision floating-point complex matrix and then added with the third double-precision floating-point complex matrix, and the inner loop is formed by a process of multiplying one element in the first double-precision floating-point complex matrix by four elements in corresponding rows of the first four columns of the second double-precision floating-point complex matrix and adding the multiplied element with corresponding elements of the third double-precision floating-point complex matrix for multiple times; an outer loop formed by a process of multiplying one row of the first double-precision floating-point complex matrix by corresponding four elements of each four columns of the second double-precision floating-point complex matrix for a plurality of times;
the process that one element in the first double-precision floating point complex matrix is multiplied and accumulated with four elements in a corresponding row of the second double-precision floating point complex matrix and then added with the corresponding element of the third double-precision floating point complex matrix comprises optimization based on HXPSP chip instruction parallelism, optimization based on loop expansion and optimization based on software pipelining;
the optimization based on the HXPSP chip instruction parallelism is the optimization that the same instruction simultaneously controls a plurality of arithmetic units to execute the same operation; the optimization based on loop expansion is the optimization of expanding the same loop for a plurality of times in one loop; the software pipeline-based optimization is an optimization of the multiple identical loops performed in parallel and interleaved.
2. The double-precision floating-point complex matrix operation optimization method based on the HXDDSP chip according to claim 1, wherein the HXDDSP chip comprises 4 execution macros, each execution macro comprises a general register group with 64 words, registers in the general register group are divided into an A-side register and a B-side register, the number of the A-side register is respectively from 0 to 63, 0 to 19 are respectively a first register group, 20 to 39 are respectively a second register group, 40 to 59 are respectively a third register group, and the number of the B-side register is respectively from 40 to 59 and is respectively a fourth register group; the four operations to execute the macro may be in parallel; each execution macro also comprises 8 multipliers, 8 arithmetic logic units, 4 shifters and a super calculator; the HXDSP chip is provided with three address registers, and each address register comprises 16 registers;
the parallel optimization based on the HXPSP chip instruction specifically comprises the steps of respectively reading 8 data in a first double-precision floating point complex matrix and 8 data in a second double-precision floating point complex matrix; 8 data in the first double-precision floating-point complex matrix are 2 data corresponding to one element in 4 same first double-precision floating-point complex matrices, and 8 data in the second double-precision floating-point complex matrix are four elements in the same row in the second double-precision floating-point complex matrix; reading 8 data in the first double-precision floating-point complex vector and reading 8 data in the second double-precision floating-point complex vector are carried out in parallel;
before the cycle begins, reading and storing 8 data in the third double-precision floating-point complex matrix;
respectively and correspondingly storing the 8 data in the first double-precision floating point complex matrix into eight registers consisting of a first register and a second register in four register groups of the HXPSP chip; the 8 data in the second double-precision floating point complex matrix are respectively and correspondingly stored in eight registers consisting of a third register and a fourth register in four register groups of the HXPSP chip; the 8 data in the third double-precision floating point complex matrix are respectively and correspondingly stored in eight registers consisting of a fifth register and a sixth register in four register groups of the HXPSP chip; the storage of 8 data in the first double-precision floating-point complex matrix and the storage of 8 data in the second double-precision floating-point complex matrix are carried out in parallel;
multiplying the data in the first register and the data in the third register by adopting a first multiplier in each execution macro of the HXPSP chip in one instruction, and storing a multiplication result in a seventh register; the second multiplier multiplies the data in the second register and the data in the fourth register respectively, and stores the multiplication result in the eighth register; and the data multiplication of the first multiplier and the second multiplier is carried out in parallel.
3. The HXDSP chip-based double-precision floating-point complex matrix operation optimization method according to claim 1, wherein the inner loop expansion-based optimization specifically comprises: the double-precision floating-point complex multiplication process is executed for a plurality of times in one inner loop.
4. The double-precision floating-point complex matrix operation optimization method based on the HXPSP chip of claim 3, wherein the optimization based on loop expansion specifically comprises: the double-precision floating-point complex multiplication process is performed four times in one inner loop.
5. The HXPSP chip-based double-precision floating-point complex matrix operation optimization method of claim 1, wherein the specific process of multiplying the inner-loop double-precision floating-point complex is as follows:
firstly, multiplying and summing a first element of a first row of a first double-precision floating point complex matrix with first row data in first four rows of data of a second double-precision floating point complex matrix, and accumulating the result to a register storing corresponding elements of a third double-precision floating point complex matrix; multiplying and summing a second element of the first row of the first double-precision floating point complex matrix with second row data in the first four rows of data of the second double-precision floating point complex matrix respectively, and accumulating the result to a register storing corresponding elements of a third double-precision floating point complex matrix; repeating the operation until all data in the first four columns of the second double-precision floating point complex matrix are calculated;
then, multiplying and summing the first element of the first row of the first double-precision floating point complex matrix with the first row of data in the next four rows of data of the second double-precision floating point complex matrix, and accumulating the result to a register storing the corresponding element of the third double-precision floating point complex matrix; repeating the operation until the first row of the first double-precision floating point complex matrix is multiplied by all columns of the second double-precision floating point complex matrix, and then adding the first row of the first double-precision floating point complex matrix and the first row of the third double-precision floating point complex matrix to obtain a finally calculated first row data result;
the specific process of the external circulation is as follows: for each row of the first double-precision floating-point complex matrix, sequentially performing the inner loop from the first row to the last row.
6. The HXDSP chip-based double-precision floating-point complex matrix operation optimization method according to claim 1, wherein the software-based pipeline optimization specifically comprises:
the internal cycle after the software-based pipeline optimization comprises a cycle start period and a normal cycle period; the instructions executed by the inner loop are reading instructions, inter-macro transmission instructions, complex multiplication instructions, complex addition and subtraction instructions and accumulation instructions in sequence;
in the cycle starting period, the four register groups sequentially carry out a reading instruction, an inter-macro transmission instruction, a complex multiplication instruction, a complex addition and subtraction instruction and an accumulation instruction;
in the normal cycle, four register groups respectively carry out a reading instruction, an inter-macro transmission instruction, a complex multiplication instruction, a complex addition and subtraction instruction and an accumulation instruction, and the instructions between every two register groups are different at the same time.
CN202010047045.XA 2020-01-16 2020-01-16 Double-precision floating point complex matrix operation optimization method based on HXPS chip Active CN111291320B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010047045.XA CN111291320B (en) 2020-01-16 2020-01-16 Double-precision floating point complex matrix operation optimization method based on HXPS chip

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010047045.XA CN111291320B (en) 2020-01-16 2020-01-16 Double-precision floating point complex matrix operation optimization method based on HXPS chip

Publications (2)

Publication Number Publication Date
CN111291320A true CN111291320A (en) 2020-06-16
CN111291320B CN111291320B (en) 2023-12-15

Family

ID=71021227

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010047045.XA Active CN111291320B (en) 2020-01-16 2020-01-16 Double-precision floating point complex matrix operation optimization method based on HXPS chip

Country Status (1)

Country Link
CN (1) CN111291320B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115408061A (en) * 2022-11-02 2022-11-29 北京红山微电子技术有限公司 Hardware acceleration method, device, chip and storage medium for complex matrix operation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1994010638A1 (en) * 1992-11-05 1994-05-11 The Commonwealth Of Australia Scalable dimensionless array
US20110040822A1 (en) * 2009-08-17 2011-02-17 International Business Machines Corporation Complex Matrix Multiplication Operations with Data Pre-Conditioning in a High Performance Computing Architecture
CN102214160A (en) * 2011-07-08 2011-10-12 中国科学技术大学 Single-accuracy matrix multiplication optimization method based on loongson chip 3A
US20120011348A1 (en) * 2010-07-12 2012-01-12 International Business Machines Corporation Matrix Multiplication Operations Using Pair-Wise Load and Splat Operations
CN102722472A (en) * 2012-05-28 2012-10-10 中国科学技术大学 Complex matrix optimizing method
CN107357552A (en) * 2017-06-06 2017-11-17 西安电子科技大学 The optimization method of floating-point complex vector summation is realized based on BWDSP chips
CN110162742A (en) * 2019-03-31 2019-08-23 西南电子技术研究所(中国电子科技集团公司第十研究所) The floating-point operation circuit implementing method that real number matrix is inverted

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1994010638A1 (en) * 1992-11-05 1994-05-11 The Commonwealth Of Australia Scalable dimensionless array
US20110040822A1 (en) * 2009-08-17 2011-02-17 International Business Machines Corporation Complex Matrix Multiplication Operations with Data Pre-Conditioning in a High Performance Computing Architecture
US20120011348A1 (en) * 2010-07-12 2012-01-12 International Business Machines Corporation Matrix Multiplication Operations Using Pair-Wise Load and Splat Operations
CN102214160A (en) * 2011-07-08 2011-10-12 中国科学技术大学 Single-accuracy matrix multiplication optimization method based on loongson chip 3A
CN102722472A (en) * 2012-05-28 2012-10-10 中国科学技术大学 Complex matrix optimizing method
CN107357552A (en) * 2017-06-06 2017-11-17 西安电子科技大学 The optimization method of floating-point complex vector summation is realized based on BWDSP chips
CN110162742A (en) * 2019-03-31 2019-08-23 西南电子技术研究所(中国电子科技集团公司第十研究所) The floating-point operation circuit implementing method that real number matrix is inverted

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘沛华;鲁华祥;龚国良;刘文鹏;: "基于FPGA的全流水双精度浮点矩阵乘法器设计", 智能系统学报, no. 04 *
朱耀国;党皓;: "基于FPGA的矩阵尺寸自适应的双精度浮点数矩阵乘法器", 电脑知识与技术, no. 14 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115408061A (en) * 2022-11-02 2022-11-29 北京红山微电子技术有限公司 Hardware acceleration method, device, chip and storage medium for complex matrix operation

Also Published As

Publication number Publication date
CN111291320B (en) 2023-12-15

Similar Documents

Publication Publication Date Title
JP6977239B2 (en) Matrix multiplier
US6088783A (en) DPS having a plurality of like processors controlled in parallel by an instruction word, and a control processor also controlled by the instruction word
US5822606A (en) DSP having a plurality of like processors controlled in parallel by an instruction word, and a control processor also controlled by the instruction word
US6029240A (en) Method for processing instructions for parallel execution including storing instruction sequences along with compounding information in cache
CN111381880B (en) Processor, medium, and operation method of processor
JP2016508640A (en) Solutions for branch branches in the SIMD core using hardware pointers
JPS6028015B2 (en) information processing equipment
US5083267A (en) Horizontal computer having register multiconnect for execution of an instruction loop with recurrance
US5276819A (en) Horizontal computer having register multiconnect for operand address generation during execution of iterations of a loop of program code
US5036454A (en) Horizontal computer having register multiconnect for execution of a loop with overlapped code
CN107357552B (en) Optimization method for realizing floating-point complex vector summation based on BWDSP chip
Schneck Supercomputer architecture
CN111291320B (en) Double-precision floating point complex matrix operation optimization method based on HXPS chip
WO2024103896A1 (en) Method for implementing matrix transposition multiplication, and coprocessor, server and storage medium
JP2002544587A (en) Digital signal processor calculation core
Dorozhevets et al. The El'brus-3 and MARS-M: Recent advances in Russian high-performance computing
JP3182591B2 (en) Microprocessor
US8332447B2 (en) Systems and methods for performing fixed-point fractional multiplication operations in a SIMD processor
CN114528248A (en) Array reconstruction method, device, equipment and storage medium
US5506974A (en) Method and means for concatenating multiple instructions
JPS60178580A (en) Instruction control system
WO2022067510A1 (en) Processor, processing method, and related device
CN105843589B (en) A kind of storage arrangement applied to VLIW type processors
CN111273889B (en) Floating point complex FIR optimization method based on HXSDSP 1042 processor
JP2006515446A (en) Data processing system with Cartesian controller that cross-references related applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant