CN111291320A

CN111291320A - Double-precision floating-point complex matrix operation optimization method based on HXDDSP chip

Info

Publication number: CN111291320A
Application number: CN202010047045.XA
Authority: CN
Inventors: 苏涛; 张丽; 朱晨曦; 张晓杰; 桂宪满; 陈琛; 董浩
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2020-06-16
Anticipated expiration: 2040-01-16
Also published as: CN111291320B

Abstract

The invention discloses a double-precision floating point complex matrix operation optimization method based on an HXPSP chip, belonging to the field of bottom function optimization of a digital signal processor; the double-precision floating-point complex matrix operation comprises an inner loop formed by multiplying one element in a first double-precision floating-point complex matrix with four elements in corresponding rows of the first four columns of a second double-precision floating-point complex matrix for multiple times and then adding the multiplied element with corresponding elements of a third double-precision floating-point complex matrix; an outer loop formed by a process of multiplying one row of the first double-precision floating-point complex matrix by corresponding four elements of each four columns of the second double-precision floating-point complex matrix for a plurality of times; the invention avoids the Kanton phenomenon caused by reading conflict in the memory by optimizing the matrix multiplication, and simultaneously, fully utilizes hardware resources and improves the operation efficiency.

Description

Double-precision floating-point complex matrix operation optimization method based on HXDDSP chip

Technical Field

The invention belongs to the field of bottom function optimization of a digital signal processor, and particularly relates to a double-precision floating point complex matrix operation optimization method based on an HXPDSP chip.

Background

With the rapid development of digital signal processing technology, DSP processors are widely used in the fields of image processing, communication technology, radar detection, voice processing, network control, instrumentation, and home appliances. The independent development of DSP processor chips has gradually become an important subject for the development of digital signal processing technology in China. Under the background, a research institute has proposed a high-performance series of DSP processors HWDSP1024, and the DSP from the architecture to the instruction system to the development environment of design implementation and software and hardware matching is completely and independently developed, and the performance of the DSP is even higher than that of some international products. The method has wide application prospect in national defense security, public security, Internet of things, communication and other industries, and the success of the method breaks through the monopoly of high-end digital signal processing chips in China abroad.

The HWDSP1024 integrates 2 processor cores (eC104+), 16 issue (sending up to 16 instructions simultaneously per instruction cycle), single instruction stream, multiple data stream architectures. Each eC104+ core is contained in four execution macros, each execution macro internally containing 2 sets of registers, each set of registers having 64 (32-bit binary data), 8 Arithmetic Logic Units (ALUs), 8 multipliers, 4 shifters, and 1 super calculator (SPU); the development language used is C language or assembly language, and considering the improvement of the efficiency of function operation, although the C language is easy to read and has good portability, the C language is inconvenient for directly controlling a hardware system and cannot fully exert the characteristics of a DSP chip, so the assembly language is selected to complete the design of the library function so as to ensure the maximization of the utilization of hardware resources. However, the assembly language directly compiled from the C language does not combine the hardware characteristics of the DSP chip, and has certain defects, particularly, severe Bank conflicts exist for the multiplication of double-precision floating-point complex numbers.

Disclosure of Invention

In order to solve the above problems, the present invention aims to provide a double-precision floating-point complex matrix operation optimization method based on an HXDSP chip. The optimization method can realize multiplication of the double-precision floating-point complex matrix and then summation of the double-precision floating-point complex matrix with another double-precision floating-point complex matrix, can avoid the phenomenon of stuck caused by Bank conflict in the multiplication process of the double-precision floating-point complex matrix, and improves the operation efficiency of the double-precision floating-point complex matrix; in addition, hardware resources of the HXDSP chip are fully utilized, and therefore the processing efficiency of the bottom layer function is greatly improved.

In order to achieve the above object, the present invention adopts the following technical solutions.

The double-precision floating point complex matrix operation optimization method based on the HXPSP chip comprises the following steps:

the double-precision floating-point complex matrix operation is that the first double-precision floating-point complex matrix is multiplied by the second double-precision floating-point complex matrix and then added with the third double-precision floating-point complex matrix; the specification of the first double-precision floating point complex matrix is m x n, the specification of the second double-precision floating point complex matrix is n x k, and the specification of the third double-precision floating point complex matrix is m x k;

the first double-precision floating-point complex matrix is multiplied by the second double-precision floating-point complex matrix and then added with the third double-precision floating-point complex matrix, and the inner loop is formed by a process of multiplying one element in the first double-precision floating-point complex matrix by four elements in corresponding rows of the first four columns of the second double-precision floating-point complex matrix and adding the multiplied element with corresponding elements of the third double-precision floating-point complex matrix for multiple times; an outer loop formed by a process of multiplying one row of the first double-precision floating-point complex matrix by corresponding four elements of each four columns of the second double-precision floating-point complex matrix for a plurality of times;

the process that one element in the first double-precision floating point complex matrix is multiplied and accumulated with four elements in a corresponding row of the second double-precision floating point complex matrix and then added with the corresponding element of the third double-precision floating point complex matrix comprises optimization based on HXPSP chip instruction parallelism, optimization based on loop expansion and optimization based on software pipelining;

the optimization based on the HXPSP chip instruction parallelism is the optimization that the same instruction simultaneously controls a plurality of arithmetic units to execute the same operation; the optimization based on loop expansion is the optimization of expanding the same loop for a plurality of times in one loop; the software pipeline-based optimization is an optimization of the multiple identical loops performed in parallel and interleaved.

Furthermore, the HXDDSP chip comprises 4 execution macros, each execution macro comprises a general register group with 64 words, registers in the general register group are divided into an A-side register and a B-side register, the number of the A-side register is respectively from 0 to 63, 0-19 is a first register group, 20-39 is a second register group, 40-59 is a third register group, and the number of the B-side register is respectively from 40 to 59 and is a fourth register group; the four operations to execute the macro may be in parallel; each execution macro also comprises 8 multipliers, 8 Arithmetic Logic Units (ALUs), 4 shifters and a super calculator; the HXDSP chip is provided with three address registers, and each address register comprises 16 registers;

the parallel optimization based on the HXPSP chip instruction specifically comprises the steps of respectively reading 8 data in a first double-precision floating point complex matrix and 8 data in a second double-precision floating point complex matrix; 8 data in the first double-precision floating-point complex matrix are 2 data corresponding to one element in 4 same first double-precision floating-point complex matrices, and 8 data in the second double-precision floating-point complex matrix are four elements in the same row in the second double-precision floating-point complex matrix; reading 8 data in the first floating-point complex vector and reading 8 data in the second floating-point complex vector are performed in parallel;

before the cycle begins, reading and storing 8 data in the third double-precision floating-point complex matrix;

respectively and correspondingly storing the 8 data in the first double-precision floating point complex matrix into eight registers consisting of a first register and a second register in four register groups of the HXPSP chip; the 8 data in the second double-precision floating point complex matrix are respectively and correspondingly stored in eight registers consisting of a third register and a fourth register in four register groups of the HXPSP chip; the 8 data in the third double-precision floating point complex matrix are respectively and correspondingly stored in eight registers consisting of a fifth register and a sixth register in four register groups of the HXPSP chip; the storage of 8 data in the first double-precision floating-point complex matrix and the storage of 8 data in the second double-precision floating-point complex matrix are carried out in parallel;

multiplying the data in the first register and the data in the third register by adopting a first multiplier in each execution macro of the HXPSP chip in one instruction, and storing a multiplication result in a seventh register; the second multiplier multiplies the data in the second register and the data in the fourth register respectively, and stores the multiplication result in the eighth register; and the data multiplication of the first multiplier and the second multiplier is carried out in parallel.

Further, the optimization based on inner loop expansion specifically includes: the floating-point complex multiplication process is executed for a plurality of times in one inner loop.

Further, the optimization based on loop unrolling specifically includes: the four floating-point complex multiplication processes are performed in one inner loop.

Further, the specific process of multiplying the floating-point complex numbers in the inner loop is as follows:

firstly, multiplying and summing a first element of a first row of a first double-precision floating point complex matrix with first row data in first four rows of data of a second double-precision floating point complex matrix, and accumulating the result to a register storing corresponding elements of a third double-precision floating point complex matrix; multiplying and summing a second element of the first row of the first double-precision floating point complex matrix with second row data in the first four rows of data of the second double-precision floating point complex matrix respectively, and accumulating the result to a register storing corresponding elements of a third double-precision floating point complex matrix; repeating the operation until all data in the first four columns of the second double-precision floating point complex matrix are calculated;

then, multiplying and summing the first element of the first row of the first double-precision floating point complex matrix with the first row of data in the next four rows of data of the second double-precision floating point complex matrix, and accumulating the result to a register storing the corresponding element of the third double-precision floating point complex matrix; repeating the operation until the first row of the first double-precision floating point complex matrix is multiplied by all columns of the second double-precision floating point complex matrix, and then adding the first row of the first double-precision floating point complex matrix and the first row of the third double-precision floating point complex matrix to obtain a finally calculated first row data result;

the specific process of the external circulation is as follows: for each row of the first double-precision floating-point complex matrix, sequentially performing the inner loop from the first row to the last row.

Further, the optimization based on software pipelining specifically includes:

the internal cycle after the software-based pipeline optimization comprises a cycle start period and a normal cycle period; the instructions executed by the inner loop are reading instructions, inter-macro transmission instructions, complex multiplication instructions, complex addition and subtraction instructions and accumulation instructions in sequence;

in the cycle starting period, the four register groups sequentially carry out a reading instruction, an inter-macro transmission instruction, a complex multiplication instruction, a complex addition and subtraction instruction and an accumulation instruction;

in the normal cycle, four register groups respectively carry out a reading instruction, an inter-macro transmission instruction, a complex multiplication instruction, a complex addition and subtraction instruction and an accumulation instruction, and the instructions between every two register groups are different at the same time.

Compared with the prior art, the invention has the beneficial effects that:

(1) the matrix multiplication method is different from the traditional matrix multiplication method of multiplying one row by one column, and the invention adopts the steps that the first element of the first double-precision floating point complex matrix of the matrix is respectively stored in eight registers and multiplied and accumulated with the four elements of the corresponding row of the first four columns of the second double-precision floating point complex matrix, thereby avoiding the stop shooting caused by the memory conflict in the traditional matrix multiplication.

(2) In the process of the internal circulation, in order to fully use resources, the internal circulation is carried out by adopting an instruction parallel method, namely irrelevant circulation in a circulation body is unfolded, so that the judgment and jumping times of the internal circulation are reduced, the running water pause is reduced, and the execution efficiency is improved; through optimization of software pipelining, all instructions are used in a staggered mode in different cycle times, and the instruction utilization rate is improved.

Drawings

The invention is described in further detail below with reference to the figures and specific embodiments.

Fig. 1(a) is a schematic diagram of memory data arrangement of an HXDSP1042 chip according to an embodiment of the present invention;

FIG. 1(b) is a schematic diagram of data storage according to an embodiment of the present invention;

FIG. 2(a) is a matrix A in a conventional complex matrix multiplication method;

FIG. 2(B) shows a matrix B in a conventional complex matrix multiplication method;

FIG. 3(a) shows a matrix A in a complex matrix multiplication method according to an embodiment of the present invention;

FIG. 3(B) shows a matrix B in a complex matrix multiplication method according to an embodiment of the present invention;

FIG. 3(C) shows a matrix C in a complex matrix multiplication method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a circular flow of a double-precision floating-point complex matrix multiplication and addition operation according to an embodiment of the present invention;

FIG. 5 is a parallel schematic of the innermost loop of FIG. 4;

FIG. 6(a) is a diagram illustrating data storage of an A-plane register according to an embodiment of the present invention;

FIG. 6(B) is a diagram illustrating data storage of a B-plane register according to an embodiment of the present invention.

Detailed Description

The embodiments and effects of the present invention will be described in further detail below with reference to the accompanying drawings.

The development of the bottom layer functions mainly uses the resources in the execution macro, the execution units of the processor are contained in the four execution macros, the external interfaces and the internal structures of the four execution macros are completely consistent, and the four execution macros obtain operation commands from a decoder, obtain operands from a data storage and perform various specific operations.

The HXPDSP chip of the invention comprises 4 execution macros (x, y, z and t respectively), each execution macro comprises a 64-word general register group (32 bits), the number of registers in the general register group is respectively from 0 to 63, and the operation of the four macros can be carried out in parallel. Each execution macro also includes 8 multipliers, 8 Roy arithmetic units ALU (including addition operations), 4 shifters, and a super operator. In addition, the chip has three address registers (U, V, W), each address register comprises 16 registers (u0-u15, v0-v15, w0-w15)

The following is a description of the address, multiply, add instructions of the present invention:

the three address registers U, V, W of the present invention are used consistently, and the following description takes the U address register as an example:

u-unit doubleword addressing instruction: in the process of double-word operation, each address and adjacent addresses thereof form an address pair as a group, and other data addresses read from the memory are determined by an address offset Uk; because of the double-byte reading used here, the address offset is the actual adjustment quantity formed by multiplying 2, so as to form 8 addresses of [ Un ] and [ Un +1], [ Un +2Uk ] and [ Un +2Uk +1], [ Un +2 x 2Uk ] and [ Un +2 x 2Uk +1], [ Un +3 x 2Uk ] and [ Un +3 x 2Uk +1], and respectively read out 8 data to be sent to the operation macro unit. Macros present in { x, y, z, t } receive 2 data each.

The internal memory of the HXDDSP 1042 chip of the invention comprises a program memory and a data memory. For the data storage, each block of data is subdivided into 8 banks, the memory data is arranged as shown in fig. 1(a), each address of the memory in fig. 1(a) can store 32-bit binary numbers, and the address and the 16-bit data in the figure are only used for explaining the division of the banks.

In the double word addressing process, an address generator generates a plurality of effective addresses, and the eight addresses of the [ Un ] and [ Un +1], [ Un +2Uk ] and [ Un +2Uk +1], [ Un + 3X 2Uk ] and [ Un + 3X 2Uk +1] are simultaneously counted, because the size of the Uk is uncertain, if two or more than two addresses fall on the same memory bank, a bank conflict is generated, once the bank conflict is generated, the whole pipeline must be stopped until all data are correctly read or written, and the normal pipeline cannot be restored.

In the conventional matrix multiplication, the specification of the first double-precision floating-point complex matrix a is m × n, the specification of the second double-precision floating-point complex matrix B is n × k, m is 6, n is 6, and k is 6 as an example:

the conventional matrix multiplication process is as follows: matrix a is passed through a double word addressing instruction: { x, y, z, t } Rs +1: s ═ Un + ═ 2, 8] four rows are taken out and placed into registers xRs +1: s, yRs +1: s, zRs +1: s, tRs +1: s, matrix B is addressed by doubleword { x, y, z, t } Rm +1: the m ═ Un + ═ 4 × 2 × 4, 4 × 2 instructions take four numbers in columns and put them in xRm +1: m, yRm +1: m, zxRm +1: m, txRm +1: m; then Rs +1: s, Rm +1: m is multiplied by double; as shown in fig. 2(a) and 2(B), the conventional matrix multiplication is to take the first four elements of the first row of a to perform multiplication operation with the first column of the matrix B, and then, in the operation process of multiplying the two matrices and then adding the two matrices to another matrix, that is, in the operation process of a × B + C, the elements in the matrix a are ak × i + bk (where k is a number), the elements in the matrix B are ck × i + dk, and the elements in the matrix C are ek i + fk.

When the traditional matrix multiplication method is adopted for operation, although the operation is easy to realize, bank conflict is easily caused when the matrix B is subjected to access. The matrix B is arranged in the memory as shown in fig. 1(B), and the storage rule is first virtual and then real. When four numbers in a column are taken simultaneously, if the four numbers are in the same bank, bank conflict is generated to cause stop shooting; for example, in FIG. 1(b), when four numbers in the same column are addressed by double word addressing, bank conflicts will result if c1 d1 and c5 d5 are in the same bank.

To solve the bank conflict problem in the matrix multiplication process, referring to fig. 3(a) -3(c), the present invention provides a double-precision floating-point complex matrix operation optimization method based on an HXDSP chip, which specifically includes:

matrix a is passed through a double word addressing instruction: { x, y, z, t } Rs +1: s ═ Un + ═ 0, 2] four rows are taken out and placed into registers xRs +1: s, yRs +1: s, zRs +1: s, tRs +1: s, matrix B is addressed by doubleword { x, y, z, t } Rm +1: the m ═ Un ═ 8, 2 instruction takes four numbers in columns and puts them in xRm +1: m, yRm +1: m, zxRm +1: m, txRm +1: and m is selected. Then Rs +1: s, Rm +1: m are multiplied by double. Fig. 3(a) shows a storage situation of the matrix a, fig. 3(B) shows a storage situation of the matrix B, and fig. 3(C) shows a storage situation of the matrix C, wherein when the matrix A, B is multiplied, the calculation of the shaded portion in the figure is performed first, that is, the first element of the matrix a is taken first, and the register xRs +1: s, yRs +1: s, zRs +1: s, tRs +1: the first element of the matrix A is put in s; and then the first four elements xRm +1 of the first row of matrix B: m, yRm +1: m, zxRm +1: m, txRm +1: m, multiplying the product result with the value stored in xRk +1: k. yRk +1: k. zxRk +1: k. txRk +1: the four elements of matrix C of k four registers are added and the final result is stored at xRk +1: k. yRk +1: k. zxRk +1: k. txRk +1: k registers.

The next element of matrix a is taken and multiplied by the second row element of the first four columns of matrix B, and the result continues to be summed with register xRk +1: k. yRk +1: k. zxRk +1: k. txRk +1: k is added; and repeating the steps until the first four numbers are obtained, and outputting.

Next, the remaining elements in the current row of matrix a are multiplied by the remaining elements in the first four columns of matrix B in turn, respectively, and then the multiplied values are compared with the register xRk +1: k. yRk +1: k. zxRk +1: k. txRk +1: k are added, wherein the first four numbers are obtained and can be output. Then, as in the above process, the result of multiplying and accumulating the data of the first row of the matrix a by the data of the columns of the matrix B5-8 is calculated, and so on until the calculation of the first row of the matrix a and all the columns of the matrix B is completed, so that the operation of the first row of the matrix a and all the columns of the matrix B is completed.

The above process is also performed on the remaining row data of the matrix a until all row calculations of the matrix a are completed, and the specific flow is shown in fig. 4. In the process of the innermost loop, in order to fully use resources, an instruction parallel method is adopted, and the innermost loop mode is shown in fig. 5.

Specifically, the value of matrix a is a × i + B, and the value of matrix B is c × i + d, and the matrix multiplication operation of the present invention is a × B ═ (a × i + B) (c × i + d) ═ ad + bc × i + (bd-ac).

Each execution macro in the chip of the present invention internally includes 2 sets of registers, each set of registers has 64 registers (32-bit binary data), and the two sets are called a-plane registers and B-plane registers, respectively, as shown in fig. 6(a) and 6 (B).

Let the element in matrix A be a_k*i+b_kWherein k is the number, and the element in the matrix B is c_k*i+d_kIn matrix CElement is e_k*i+f_k。

Each register has four macros of x, y, z and t, taking the x macro of the a-plane register as an example, the data storage in the operation process is shown in fig. 6(a), and in the calculation process, 20 registers are used as a group for calculation, wherein the a-plane registers 0-19 are used as a group, 20-39 are used as a group, 40-59 are used as a group, and the B-plane registers 40-59 are used as a group (shown in fig. 6 (B)). The A-plane registers 60-63 initially store the values taken from the matrix C. And accumulating the result multiplied by the obtained result and storing the final operation result.

In FIG. 5, a set of numbers is entered once per cycle, one register being set into a set of numbers; the cycle start period is as shown in table 1:

TABLE 1 operation of four sets of numbers in adjacent cycles in the start of the cycle

As can be seen from the shaded portion in table 1, the different operation operations of the first group number are sequentially performed, so as to avoid hardware conflict.

After entering the normal cycle period, four groups of data are simultaneously staggered to perform corresponding operations, as shown in table 2:

TABLE 2 operation of four groups of numbers in adjacent cycles after the cycle body enters the normal cycle period

As can be seen from Table 2, different operation instructions are used alternately among the four groups of data, and hardware resources are fully utilized.

The following introduces the operation instruction of the double-precision floating-point complex matrix of the invention:

(1) the double multiply instruction is:

QMACCH＝DFRm+1:m*DFRn+1:n

DFRs+1:s＝QMACCH

or

QMACCL＝DFRm+1:m*DFRn+1:n

DFRs+1:s＝QMACCL

The above two instructions are used consecutively, and 64-bit floating-point multiplication can be performed. Given the 4 multipliers, the present instruction occupies four multipliers. The 8 multipliers are numbered according to 0-7, and 4 multipliers are lower or 4 multipliers are higher to form 4 paired multipliers. I.e., multiplier pairs 0-3, or multiplier pairs 4-7.

Wherein: DFRm +1: m represents a 64-bit floating point number. Wherein Rm +1 stores the upper 32 bits of the 64-bit floating point number, and Rm represents the lower 32 bits of the 64-bit floating point number.

As can be seen from the above, each macro includes 8 multipliers, each group performs four-group multiplication of a × d, b × c, b × a, and a × c, QMACCH + DFRm +1: m × DFRn +1: n is a complex multiplication instruction, and DFRs +1: s ═ QMACCH is an assignment instruction, which each occupy four multipliers and cannot be parallel, so that a × c and b × d can be calculated in the same beat, two beats are calculated, and each group of double multiplications requires four beats to complete.

(2) The Double addition instruction is:

DFHACCk＝DFHRm+DHFHRn

DFLACCk＝DFLRm+DFLRn

DFHRs＝DFHACCk

DFLRs＝DFLACCk

the above four instructions are used consecutively, and 64-bit floating point addition and subtraction can be performed.

The HRm and LRm registers respectively store the upper 32 bits and the lower 32 bits of a 64-bit floating point number and represent a 64-bit floating point number, the index numbers of the two general registers must be continuous and are recorded as DFRm +1: m, wherein Rm +1 stores the upper 32 bits of the 64-bit floating point number, and Rm stores the lower 32 bits of the 64-bit floating point number. The ALU selection K of the above 4 micro-operations must be the same, so that one 64-bit floating point number subtraction can be completed, and one ALU is occupied.

In a loop body, double addition itself requires four beats, and a loop body needs to calculate four sets of addition of ad + bc, bd-ac, e + Σ (ad + bc) and f + Σ (bd-ac), each macro uses 4 ALUs, and although each macro has 8 ALUs, four beats are still required.

A set of simultaneous data operations as shown in Table 3 was chosen for the description:

table 3 selected simultaneous set of data operations

The group of data of the multiplication-first and matrix-second operation needs four beats, and the instruction of each beat is as follows:

first beat

r23:22＝[v1+＝2,0]||xyr25:24＝[w1+＝4,1]||QMACCL＝DFR53:52*DFR55:54||QM

ACCH＝DFR43:42*DFR45:44||DFHACC0＝DFHR27-DFHR29||DFHACC1＝DFH

R26-DFHR28

||DFHACC4＝DFHR0+DFHR60||DFHACC5＝DFHR1+DFHR61

Second beat

r33:32＝[v1+＝2,0]||ztr25:24＝[w1+＝4,1]||DFR47:46＝QMACCL||DFR49:48＝QM

ACCH||DFHR21＝DFHACC0||DFHR20＝DFHACC1||DFHR60＝DFHACC4||DFH

R61＝DFHACC5

Third beat

xyr35:34＝[w1+＝4,1]||xr15:14＝yr5:4||yr5:4＝xr15:14||QMACCL＝DFR53:52*DF

R45:44||QMACCH＝DFR43:42*DFR55:54||DFHACC2＝DFHR37+DFHR39

||DFHACC3＝DFHR36+DFHR38||DFHACC6＝DFHR10+DFHR62||DFHACC7＝DFHR11+DFHR63

The fourth beat

ztr35:34＝[w1+＝4,1]||zr15:14＝tr5:4||tr5:4＝zr15:14||DFR57:56＝QMACCL||DFR

59:58＝QMACCH||DFHR31＝DFHACC2||DFHR30＝DFHACC3||DFHR62＝DFHACC6||DFHR63＝DFHACC7

The meaning of each beat of instructions is as follows:

if | is a parallel instruction, r23:22 ═ v1+ ═ 2,0, r33:32 ═ v1+ ═ 2,0, read data from the matrix a, read eight 32 bits at most, i.e. four double-type data, twice, and read four double-type data in total

Complex number data: a is_i,b_i,a_i+1,b_i+1,a_i+2,b_i+2,a_i+3,b_i+3。

xyr25:24＝[w1+＝4,1]、ztr25:24＝[w1+＝4,1]、xyr35:34＝[w1+＝4,1]、

ztr35:34＝[w1+＝4,1]Respectively representing the reading of data from the matrix B, four 32-bit data at a time, four times, and four double complex numbers c_i,d_i,c_i+1,d_i+1,c_i+2,d_i+2,c_i+3,d_i+3。

QMACCL＝DFR53:52*DFR55:54||QMACCH＝DFR43:42*DFR45:44、

DFR47:46＝QMACCL||DFR49:48＝QMACCH、

QMACCL＝DFR53:52*DFR45:44||QMACCH＝DFR43:42*DFR55:54、

DFR57:56 ═ QMACCL | | DFR59:58 ═ QMACCH respectively represent double multiplications, four macros are in parallel, and a total of 16 double multiplications a × d, b × c, b × d, a × c are completed in four beats.

DFHACC0＝DFHR27-DFHR29||DFHACC1＝DFHR26-DFHR28、

||DFHR21＝DFHACC0||DFHR20＝DFHACC1、

DFHACC2＝DFHR37+DFHR39||DFHACC3＝DFHR36+DFHR38、

DFHR31 ═ DFHACC2| | | DFHR30 ═ DFHACC3 indicates double addition operations, and a × d + b × c, b × d-a × c, four macros are added for 8 beats.

DFHACC4＝DFHR0+DFHR60||DFHACC5＝DFHR1+DFHR61、

DFHR60＝DFHACC4||DFHR61＝DFHACC5、

DFHACC6＝DFHR10+DFHR62||DFHACC7＝DFHR11+DFHR63、

DFHR62＝DFHACC6||DFHR63＝DFHACC7 is respectively an accumulation operation and a four-beat implementation

The four macros are added eight times in total.

According to the invention, through ingenious arrangement, the operations of multiplication and addition of the complex floating point matrix can be combined into four beats, so that the operation process is accelerated.

The invention relates to the design of the input aspect:

the internal data buses of the HXDDSP 104X kernel have four in total, two-time reading and two-time writing are realized, and at most three buses are allowed to work at the same time; i.e. up to two sets of numbers in one clock cycle. According to the above description, each group of operations needs four numbers a, b, c, d (16 numbers are entered by four macros), and two registers store one number because the data type is double precision. According to the above description, the data in the four macros of the first complex floating-point matrix a takes the same numbers a and b, and the specific instruction is as follows:

r23：22＝[v1+＝2，0]

r33：32＝[v1+＝2，0]

and the second complex floating-point matrix B adopts a method of taking four numbers as a period to avoid bank conflicts.

The specific count instruction is as follows:

xyr5:4＝[w2+＝4,1]

ztr5:4＝[w2+＝4,1]

xyr15:14＝[w2+＝4,1]

ztr15:14＝[w2+＝4,1]

wherein xyr5:4 ═ w2+ ═ 4,1] denotes xr 5:4 storage C1, yr5:4 storage D1; ztr5:4 ═ w2+ ═ 4,1, denotes zr 5:4 storage C2, tr5:4 storage D2; xyr15:14 ═ w2+ ═ 4,1] denotes xr15:14 for C3, yr15:14 for D3; ztr15:14 ═ w2+ ═ 4,1, denotes zr15:14 for C4, tr15:14 for D4.

The specific implementation instruction of the inter-macro transmission is as follows:

xr15:14＝yr5:4||yr5:4＝xr15:14

zr15:14＝tr5:4||tr5:4＝zr15:14

through the above instructions, c1, c2, c3, c4 and d 15:14 are ensured to be stored in r5:4, and d1, d2, d3 and d4 are ensured to be stored in r15: 14.

The cycle body counting method comprises the following steps: zero overhead loop register LC 0:

the instruction is in the form of If LC 0B label, representing a zero overhead loop. At the end of the loop body it is checked whether LC0 equals 0 and if it equals 0, no jump is made and the instructions are executed sequentially down, in which case the instructions are at the end of the loop body. The zero overhead loop is pipelined with the conditional jump. The instruction is in the form of If nLC 0B label, represents a jump based on LC0, and jumps to the label destination address If LC0 is checked to be equal to 0; if not, then the execution is sequential.

The model of the HXDDSP chip is HXDDSP 1042.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. The double-precision floating point complex matrix operation optimization method based on the HXPSP chip is characterized by comprising the following steps:

2. The double-precision floating-point complex matrix operation optimization method based on the HXDDSP chip according to claim 1, wherein the HXDDSP chip comprises 4 execution macros, each execution macro comprises a general register group with 64 words, registers in the general register group are divided into an A-side register and a B-side register, the number of the A-side register is respectively from 0 to 63, 0 to 19 are respectively a first register group, 20 to 39 are respectively a second register group, 40 to 59 are respectively a third register group, and the number of the B-side register is respectively from 40 to 59 and is respectively a fourth register group; the four operations to execute the macro may be in parallel; each execution macro also comprises 8 multipliers, 8 arithmetic logic units, 4 shifters and a super calculator; the HXDSP chip is provided with three address registers, and each address register comprises 16 registers;

the parallel optimization based on the HXPSP chip instruction specifically comprises the steps of respectively reading 8 data in a first double-precision floating point complex matrix and 8 data in a second double-precision floating point complex matrix; 8 data in the first double-precision floating-point complex matrix are 2 data corresponding to one element in 4 same first double-precision floating-point complex matrices, and 8 data in the second double-precision floating-point complex matrix are four elements in the same row in the second double-precision floating-point complex matrix; reading 8 data in the first double-precision floating-point complex vector and reading 8 data in the second double-precision floating-point complex vector are carried out in parallel;

3. The HXDSP chip-based double-precision floating-point complex matrix operation optimization method according to claim 1, wherein the inner loop expansion-based optimization specifically comprises: the double-precision floating-point complex multiplication process is executed for a plurality of times in one inner loop.

4. The double-precision floating-point complex matrix operation optimization method based on the HXPSP chip of claim 3, wherein the optimization based on loop expansion specifically comprises: the double-precision floating-point complex multiplication process is performed four times in one inner loop.

5. The HXPSP chip-based double-precision floating-point complex matrix operation optimization method of claim 1, wherein the specific process of multiplying the inner-loop double-precision floating-point complex is as follows:

6. The HXDSP chip-based double-precision floating-point complex matrix operation optimization method according to claim 1, wherein the software-based pipeline optimization specifically comprises: