CN111273889B

CN111273889B - Floating point complex FIR optimization method based on HXSDSP 1042 processor

Info

Publication number: CN111273889B
Application number: CN202010043946.1A
Authority: CN
Inventors: 苏涛; 朱晨曦; 张丽
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2023-05-02
Anticipated expiration: 2040-01-15
Also published as: CN111273889A

Abstract

The invention belongs to the technical field of signal processing, and particularly relates to a floating point complex FIR optimization method based on an HXPS 1042 processor, which comprises the following steps: setting a function interface of a floating point complex FIR, and determining a parameter transmission register corresponding to each function parameter in the function interface according to a register parameter transmission rule of the HXSSP 1042 processor; respectively obtaining the first addresses of the input sequence, the filter sequence and the output sequence and the lengths corresponding to the input sequence, the filter sequence and the output sequence according to the pass parameter register corresponding to each function parameter; performing stack pressing protection on the HXSDSP 1042 processor to obtain a stack pressed protected HXSDSP 1042 processor; obtaining a circulation control variable according to the lengths corresponding to the input sequence, the filter sequence and the output sequence; performing convolution operation on the input sequence and the filter sequence according to the circulation variable to obtain an output sequence, and storing the output sequence into a first address of the output sequence; and resetting the HXPSP 1042 processor after stack pressing protection. Has the beneficial effects of high treatment performance and wide application range.

Description

Floating point complex FIR optimization method based on HXSDSP 1042 processor

Technical Field

The invention belongs to the technical field of signal processing, and particularly relates to a floating point complex FIR optimization method based on an HXPSP 1042 processor.

Background

With the development of computer science and integrated circuit industry, digital signal processing technology has become a key application technology in an information society, and has been widely used in various scenes. Digital signal processors (Digital Signal Processor, DSP) have been rapidly developed and intensively studied as key application devices for digital signal processing technology, by virtue of their typical advantages of high performance and low power consumption.

The development of foreign commercialized signal processing equipment has been rapid, and some technological countries in Europe and America have always taken an international leading position in terms of development and production of DSP chips, such as analog devices (Analog Devices Inc, ADI), texas instruments (Texas Instruments, TI), feisha Calif. (Freescale), and the like. The execution speed of the DSP processor depends on abundant and dense resources and highly parallel instructions, and each IC manufacturer successively promotes multi-core DSP chips, such as ADSP-TS20x DSP of ADI, TMS320C66x DSP of TI, etc., and these processors are widely applied to the fields of communication, video and signal processing, and almost fully occupy the market. Although the start and development of the domestic DSP processor is relatively late, the relevant units in our country are striving to catch up. In the whole country, there are many scientific institutions and signal processing laboratories at universities to concentrate a large amount of resources on the study of high-performance digital signal processing equipment, and many achievements are achieved. The national defense science and technology university successfully develops a YHFT-DSP Galaxy-Feiteng series ultra-long instruction structure floating-point DSP compatible with 6000 series of TI company, wherein the main frequency reaches 238MHz, and the peak performance is 14 hundred million floating-point operations per second and 19 hundred million instructions. The Chinese electronics technology group company 14 adopts the design of DSP and CPU multi-core architecture to complete the multi-core DSP processor chip-special DSP processor chip for HuaRui No. 1. The 38 th institute of China electronic technology group company provides a BWDS 100 processor which has a domestic DSP chip with completely independent intellectual property rights, and the structure system, the instruction system, the design realization and the software and hardware matched development environment are completely independent to develop. Aiming at the development trend of the DSP processor, according to the performance index requirement of the radar signal processing field on the DSP processor, the development trend of the DSP processor is optimized and partially innovated on the basis of the BWPDSP 100 processor research of 38 th research of China electronic technology group company, and the HXPSP 1042 processor is provided, and the processor has a high-integration instruction system, a modularized structural program and higher parallelism. The successful development of the domestic processor not only breaks through a series of technical bottlenecks of the development of the high-performance general DSP, but also relieves the problem that the domestic processing of the high-end general digital signals is totally dependent on foreign devices to a great extent, and provides necessary basic conditions for the realization of the crossing development of important equipment in the country. The DSP processor in China realizes autonomous and controllable technology, which means that in the future military conflict possibly happening, the blocking and equipment banning of some developed countries on the radar signal processing technology in China can be broken, and the national security is further consolidated.

FIR (Finite Impulse Response) operations are the cornerstone of digital signal processing technology. As an important component of digital signal processing, the main task of digital filtering techniques is to extract the desired effective signal from the signal with noise interference, while suppressing unwanted noise signals, and digital filtering techniques are commonly applied in various digital systems as modules for signal pre-processing. The FIR digital filter is widely used in various fields of digital signal processing due to the characteristics of strict linear phase and constant stability. According to the hardware resource and instruction characteristics of the HXPSP 1042 processor, specific flow, writing, calling, theoretical period, actual period, running time, error comparison and the like of the research of the FIR filter algorithm are given, and the running period of the algorithm on other main stream high performance processors is given as comparison.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides a floating-point complex FIR optimization method based on the hdsp 1042 processor. The technical problems to be solved by the invention are realized by the following technical scheme:

a floating point complex FIR optimization method based on a HXPS 1042 processor comprises the following steps:

setting a function interface of a floating point complex FIR, and determining a parameter transmission register corresponding to each function parameter in the function interface according to a register parameter transmission rule of the HXPS 1042 processor;

respectively obtaining the first addresses of an input sequence, a filter sequence and an output sequence according to the parameter transfer registers corresponding to each function parameter, and the lengths corresponding to the input sequence, the filter sequence and the output sequence;

performing stack pressing protection on the HXPSP 1042 processor to obtain a stack pressed protected HXPSP 1042 processor;

obtaining a circulation control variable according to the lengths corresponding to the input sequence, the filter sequence and the output sequence;

performing convolution operation on the input sequence and the filter sequence according to the circulation variable to obtain an output sequence, and storing the output sequence into a first address of the output sequence;

and resetting the HXSDSP 1042 processor after stack pressing protection.

In one embodiment of the present invention, performing stack pressing protection on the hdsp 1042 processor to obtain a stack pressed protected hdsp 1042 processor includes:

and respectively performing push protection operation on a frame pointer (U9), a procedure call return address register (SER) and a called procedure storage register (Calle-Save Registers) in the HXPSP 1042 processor to obtain the push protected HXPSP 1042 processor.

In one embodiment of the invention, the input sequence includes sets of data, one for each 8 complex data.

The invention has the beneficial effects that:

the floating point complex FIR assembly algorithm fully utilizes the data bus and operation resources of the domestic processor HXSDSP 1042, designs and optimizes the assembly program by enhancing three methods of instruction parallelism, cyclic expansion and software pipelining, indirectly reduces the data volume by using register assignment, avoids pipeline pause caused by bank conflict and block conflict, and has remarkable advantages in processing performance. Meanwhile, compared with the DSPF_sp_fir_cplx function of the current main stream processor TMS320C6678, the invention improves the performance by several times, and can be widely used in domestic signal processing equipment.

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Drawings

Fig. 1 is a schematic flow chart of a floating point complex FIR optimization method based on an hdsp 1042 processor according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an internal execution macro structure of ec104+ in a processor of hdsp 1042, which is a floating point complex FIR optimization method based on a processor of hdsp 1042 according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a floating point complex FIR optimization method FIR assembly flow based on an hdsp 1042 processor according to an embodiment of the present invention;

fig. 4 is a functional test of an FIR algorithm obtained by a functional test of a floating-point complex FIR optimization method based on an hdsp 1042 processor according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but embodiments of the present invention are not limited thereto.

Regarding the HXSDSP 1042 processor:

the HXPSP 1042 processor can be widely applied to various high-performance computing fields, such as the signal processing fields of radar, electronic reactance, accurate guidance, communication guarantee, image processing and the like. The processor adopts a VLIW (Very Long Instruction Word ) architecture, has strong parallel processing capability, and can better meet the application requirements of high-speed real-time signal processing. Meanwhile, the processor adopts a 16-emission and SIMD (Single Instructuin Multipul Data, single instruction stream multiple data stream) architecture, so that each instruction can simultaneously control 1 to 4 basic operation units in an execution macro, and the processor can be widely applied to a computationally intensive processing system. The hybrid architecture of VLIW+SIMD greatly improves the ability of the processor to access data, which is a key basis for efficient processing rates. The following table gives a comparison of the resources of HXSSP 1042 and the like mainstream processors.

The HXPSP 1042 processor adopts a Harvard architecture and has a 512bit instruction bus, a 512bit internal data read bus and a 512bit internal data write bus. The HXPSP 1042 processor comprises 2 eCDN+ cores, the 2 cores are mutually independent, each eCDN+ core adopts a SIMD architecture, 1 program controller and 4 operation execution macros are arranged in the processor, and the internal structures of the 4 execution macros are the same. each execution macro within eC104+ contains 8 Arithmetic Logic Units (ALUs), 8 Multipliers (MULs), 4 Shifters (SHFs), 1 super operator (SPU), 1 general purpose register set, and 1 macro-addition adder. Wherein the general register component in each macro is divided into an AB two-sided general register, and 128 general registers are all arranged. The eCD104+ internal execution macro structure and the data transmission bit width between each operation unit and the register set are shown in FIG. 2, and FIG. 2 is a schematic diagram of the eCD104+ internal execution macro structure in the HXPSP 1042 processor according to the floating point complex FIR optimization method of the HXPSP 1042 processor provided by the embodiment of the invention.

Referring to fig. 1, fig. 1 is a flow chart of a floating point complex FIR optimization method based on an hdsp 1042 processor according to an embodiment of the present invention, including:

specifically, the function interface used in the invention is as follows:

void bw_DSPF_sp_fir_cplx(const float*x，const float*h，float*__restrict y，int nh,int ny)；

where ny is the length of the output sequence;

nh is the length of the filter coefficient;

x is a single-precision floating point complex type input sequence, and the length is 2 x (nh+ny-1);

h is a filter coefficient sequence, and the length is 2 x nh;

y is the output sequence and the length is 2 x ny.

The parameter registers of the function interfaces are shown in the following table according to the register parameter rules of HXSSP 1042.

Parameter transfer register list

Reference numerals	Register used
		Pointer parameter x	Address generator U0
Pointer parameter h	Address generator U1
		Pointer parameter y	Address generator U2
Integer parameter nh	General purpose register xr3
		Integer parameter ny	General purpose register xr4

Wherein xr3 and xr4 are all general Register names of the hdsp 1042, wherein "x" represents an x macro using any one core of the hdsp 1042, "r" is a short term of a Register (abbreviated as r), and "3" represents a 3 rd Register in the x macro, and naming rules are defined by the processor of the hdsp 1042.

specifically, the calculation formula is as follows:

x*y＝(a+bi)*(c+di)＝(ac-bd)+(ad+bc)i，

the multiplication of the single-precision floating-point complex numbers x and y can be split into the multiplication of 4 single-precision floating-point real numbers and the addition of 2 single-precision floating-point real numbers, wherein x and y are complex numbers which are respectively expressed as a+bi and c+di.

Each core of the hdsp 1042 processor has 32 multipliers and 32 adders, so that 8 single precision floating point complex multiplications can be calculated in one instruction line at most. The invention takes the input sequence as a group according to 8 complex data, and carries out multiply-add operation on 8 input complex data and a filter coefficient sequence with the length of nh, so that the times of multiply-add, the times of circulation and the remainder are needed to be calculated at the beginning of the program. As shown in the table below, the multiply-add times, the loop times, and the remainder are placed in zero overhead loop registers lc0, lc1, and lc2, respectively.

First, the zero overhead loop register lc0 is assigned: lc0=xr3;

secondly, calculating the cycle times according to the following code segments, wherein r4 stores the length of the filter coefficient, shifting r4 left by 3 bits by using a shift operation instruction of HXPS 1042, assigning a result to a general register xr7, and assigning a value of the general register xr7 to a zero overhead loop register lc1:

xr7＝r4 ashift-3，

lc1＝xr7，

finally, the remainder is calculated according to the following code segment, the length of the filter coefficient is stored in the general register r4, the data storage instruction of the HXPSP 1042 is firstly used for taking the lower 3 bits of r4, the result is assigned to the general register xr5, and then the value of the general register xr5 is assigned to the zero overhead loop register lc2.

xr5＝r4 fext(0:3,0)(z)，

lc2＝xr5，

Circulation control variable list

In the above table nh is the length of the filter coefficient sequence and ny is the length of the output sequence.

specifically, the FIR assembler groups the input sequence into a set of 8 complex numbers and performs a multiply-accumulate operation on the filter coefficient sequence. Referring to fig. 3, fig. 3 is a schematic diagram of a floating-point complex FIR optimizing method FIR assembly flow based on an hdsp 1042 processor according to an embodiment of the present invention, proc1 is a filtering process of a first group of 8 data, proc2 is a filtering process of a second group of 8 data, proc3 is a filtering process of a third group of 8 data, and so on. Meanwhile, the code written out by the result of the filtering process Proc1 and the code of the running water inlet number of the filtering process Proc2 are designed in parallel to obtain a circulating body LOOP0, the code written out by the result of the filtering process Proc2 and the code of the running water inlet number of the filtering process Proc3 are designed in parallel to obtain a circulating body LOOP1, and the codes of the circulating body LOOP0 and the circulating body LOOP1 are identical, so that only the circulating body LOOP is required to be designed, and the codes of the circulating body LOOP, the circulating body LOOP1 and the circulating body LOOP2 are identical. Finally, the running water inlet part of the filtering process Proc1 is used as a cycle preparation period except for a cycle body.

1. Cycle preparation period

The loop preparation period uses 2 groups of registers to participate in operation at the same time, registers r10-r25 are used as a first group of input data registers, registers r18-r33 are used as a second group of input data registers, registers r18-r25 are used as registers common to the first group of input data and the second group of input data, and registers r34-r41 store filter coefficients.

Xn represents the nth input data when any 8 input data is the starting address, hm represents the mth filter coefficient, and the processing procedure is described in connection with the following table, where each row of contents can be implemented by a row of assembler instructions.

Serial processing procedure

Sequence number	Group	2 registers	Group	1 registers	Filter coefficients	Calculation of
							1		ReadingInput data X1-X4
2	Reading input data X5-X8		Reading coefficient H1
					3	Reading input data X6-X9	Inter-macro transmission to obtain X2-X5	Reading coefficient H2
4	Reading input data X7-X10	Inter-macro transmission to obtain X3-X6	Reading coefficient H3
					5	Reading input data X8-X11	Inter-macro transmission to obtain X4-X7	Reading coefficient H4	Multiplication of X1-X8 with H1
6	Reading input data X9-X12	Assignment operation yields X5-X8	Reading coefficient H5	Multiplication of X2-X9 with H2
					7	Reading input data X10-X13	Assignment operation yields X6-X9	Reading coefficient H6	Multiplication of X3-X10 with H3

The specific analysis is as follows:

(1) Reading input data X1-X4 into registers xyztr11:10 respectively;

(2) Reading input data X5-X8 into registers xyztr19:18 and reading filter coefficients H1 into registers xyztr35:34, respectively;

(3) Reading input data X6-X9 into a register xyztr21:20, respectively, reading a filter coefficient H2 into a register xyztr37:36, and storing the input data X2-X5 into a register xyztr13:12 through inter-macro transmission;

(4) Reading input data X7-X10 into a register xyztr23:22, respectively, reading a filter coefficient H3 into a register xyztr39:38, and storing the input data X3-X6 into a register xyztr15:14 through inter-macro transmission;

(5) Reading input data X8-X11 into a register xyztr25:24, reading a filter coefficient H4 into a register xyztr41:40, storing the input data X4-X7 into the register xyztr17:16 through inter-macro transmission, multiplying the input data in the registers xyztr11:10 and xyztr19:18 obtained in the step (1) and the step (2) and the filter coefficient in the register xyztr35:34 obtained in the step (2), and storing the result into the B-plane registers r1:0, r3:2, r17:16 and r 19:18;

(6) Reading input data X9-X12 into a register xyztr27:26, reading a filter coefficient H5 into a register xyztr35:34, storing the input data X5-X8 into the register xyztr11:10 through assignment operation, multiplying the input data in the registers xyztr13:12 and xyztr21:20 obtained in the step (3) by the filter coefficient in the register xyztr37:36, and storing the result into B-plane registers r5:4, r7:6, r21:20 and r 23:22;

(7) Reading input data X10-X13 into a register xyztr29:28, reading a filter coefficient H6 into a register xyztr37:36, storing the input data X6-X9 into the register xyztr13:12 through assignment operation, multiplying the input data in the registers xyztr15:14 and xyztr23:22 obtained in the step (4) by the filter coefficient in the register xyztr39:38, and storing the result into B-plane registers r9:8, r11:10, r25:24 and r 27:26;

the results of X1:8H 1, X2:9H 2, X3:10H 3 are obtained by the steps, and are respectively stored in the B-plane registers r1:0, r3:2, r17:16, r19:18, r5:4, r7:6, r21:20, r23:22, r9:8, r11:10, r25:24 and r 27:26.

2. Circulation body

The loop body mainly completes multiply-accumulate operation and uses the designated register of the loop preparation period. Xn represents the nth input data when any 8 input data is used as the starting address, hm represents the mth filter coefficient, and the processing procedure is described in conjunction with the following table, where the contents of the first 8 rows in the table can be implemented by a row of assembler instructions, and each row determines, through the zero overhead loop register lc0, whether the program jumps back to row 1 or writes out the filtering result sequentially.

Design step of the circulation body

The specific analysis is as follows:

(1) Reading input data X11-X14 into a register xyztr31:30, reading a filter coefficient H7 into a register xyztr39:38, storing the input data X7-X10 into the register xyztr15:14 through assignment operation, multiplying the input data in the registers xyztr17:16 and xyztr25:24 by the filter coefficient in the register xyztr41:40, storing the result into B-plane registers r13:12, r15:14, r29:28 and r31:30, accumulating multiplication results in the B-plane registers r1:0, r3:2, r17:16 and r19:18 through an accumulator ACC, judging whether to jump to step (9) through a zero overhead loop register lc0, if lc0 is not 0, sequentially executing, and jumping to step (9) to write the filtering result if lc0 is 0;

(2) Reading input data X12-X15 into a register xyztr33:32, reading a filter coefficient H8 into a register xyztr41:40, storing the input data X8-X11 into the register xyztr17:16 through assignment operation, multiplying the input data in the registers xyztr19:18 and xyztr27:26 obtained in the step (4) by the filter coefficient in the register xyztr35:34, storing the result into B-plane registers r1:0, r3:2, r17:16 and r19:18, accumulating the multiplication results in the B-plane registers r5:4, r7:6, r21:20 and r23:22 through an accumulator ACC, judging whether to jump to the step (9) through a zero overhead loop register lc0, sequentially executing if lc0 is not 0, and writing out the filtering result in the step (9);

(3) Reading input data X13-X16 into registers xyztr19:18, reading a filter coefficient H9 into registers xyztr35:34, multiplying the input data in registers xyztr21:20 and xyztr29:28 by the filter coefficient in registers xyztr37:36, storing the result in B-plane registers r5:4, r7:6, r21:20 and r23:22, accumulating multiplication results in B-plane registers r9:8, r11:10, r25:24 and r27:26 through an accumulator ACC, judging whether to jump to step (9) through a zero overhead loop register lc0, sequentially executing the steps, if lc0 is not 0, and jumping to step (9) to write out a filtering result;

(4) Reading input data X14-X17 into a register xyztr21:20, reading a filter coefficient H10 into a register xyztr37:36, multiplying the input data in the registers xyztr23:22 and xyztr31:30 by the filter coefficient in the registers xyztr39:38, storing the result in B-plane registers r9:8, r11:10, r25:24 and r27:26, accumulating the multiplication results in the B-plane registers r13:12, r15:14, r29:28 and r31:30 through an accumulator ACC, judging whether to jump to the step (9) through a zero overhead loop register lc0, sequentially executing the steps, if lc0 is not 0, and jumping to the step (9) to write out a filtering result;

(5) Reading input data X15-X18 into a register xyztr23:22, reading a filter coefficient H11 into a register xyztr39:38, multiplying the input data in the registers xyztr25:24 and xyztr33:32 by the filter coefficient in the registers xyztr41:40, storing the result in a B-plane register r13:12, r15:14, r29:28 and r31:30, accumulating multiplication results in the B-plane registers r1:0, r3:2, r17:16 and r19:18 through an accumulator ACC, judging whether to jump to the step (9) through a zero overhead loop register lc0, sequentially executing the steps, if lc0 is not 0, and jumping to the step (9) to write out a filtering result;

(6) Reading input data X16-X19 into a register xyztr25:24, reading a filter coefficient H12 into a register xyztr41:40, multiplying the input data in the registers xyztr11:10 and xyztr19:18 by the filter coefficient in the registers xyztr35:34, storing the result in B-plane registers r1:0, r3:2, r17:16 and r19:18, accumulating multiplication results in B-plane registers r5:4, r7:6, r21:20 and r23:22 through an accumulator ACC, judging whether to jump to the step (9) through a zero overhead loop register lc0, sequentially executing the steps, if lc0 is not 0, and switching to the step (9) to write out a filtering result if lc0 is 0;

(7) Reading input data X17-X20 into a register xyztr27:26, reading a filter coefficient H13 into a register xyztr35:34, storing the input data X13-X16 into the register xyztr11:10 through assignment operation, multiplying the input data in the registers xyztr13:12 and xyztr21:20 by the filter coefficient in the register xyztr37:36, storing the result into B-plane registers r5:4, r7:6, r21:20 and r23:22, accumulating the multiplication results in the B-plane registers r9:8, r11:10, r25:24 and r27:26 through an accumulator ACC, judging whether to jump to the step (9) through a zero overhead loop register lc0, if lc0 is not 0, sequentially executing, and jumping to the step (9) to write the filtering result if lc0 is 0;

(8) Reading input data X18-X21 into a register xyztr29:28, reading a filter coefficient H14 into a register xyztr37:36, storing the input data X4-X17 into the register xyztr13:12 through assignment operation, multiplying the input data in the registers xyztr15:14 and xyztr23:22 by the filter coefficient in the register xyztr39:38, storing the result into B-plane registers r9:8, r11:10, r25:24 and r27:26, accumulating the multiplication results in the B-plane registers r13:12, r15:14, r29:28 and r31:30 through an accumulator ACC, judging whether to jump to step (9) through a zero overhead loop register lc0, if lc0 is not 0, sequentially executing, and jumping to step (9) to write the filtering result if lc0 is 0;

(9) Writing out the filtering results obtained in the steps (1) - (8), and reading the next group of data in a running way according to the processing mode of the cyclic preparation period; judging whether to skip to the step (1) through a zero overhead loop register lc1, if lc1 is not 0, skipping to the step (1) for execution, and if lc0 is 0, skipping to the step (10) for remainder processing;

(10) And writing out the filtering result of the remainder part, and exiting the program.

The assembly code corresponding to the step (2) in the loop body is given below, and the code segment is explained:

.code_align 16

if nlc0 b__fir_cplx_out

||r19:18＝[u3+＝u4,u5]||r35:34＝[v3+＝v4,v5]||r13:12＝r29:28

||cfr1:0＝cfr35:34*fr19<BA>||cfr3:2＝cfr35:34*fr18<BA>

||cfr17:16＝cfr35:34*fr27<BA>||cfr19:18＝cfr35:34*fr26<BA>

||cfacc1:0+＝cfr5:4<AB>||cfacc3:2+＝cfr7:6<AB>

||cfacc5:4+＝cfr21:20<AB>||cfacc7:6+＝cfr23:22<AB>

specifically, if nc 0 b __ fir_cplx_out is a jump instruction supported by the hdsp 1042, determining whether to jump to step (9) through a zero overhead loop register lc0, if lc0 is not 0, executing sequentially, if lc0 is 0, jumping to step (9) and writing out a filtering result;

r19=u3+=u4, u5] represents that the input data is read from the address generator U3, the address interval is U5, and the address offset is U4;

r35:34= [ v3+ = v4, V5] represents reading input data from the address generator V3, the address interval is V5, and the address offset is V4;

r13:12=r29:28 is a valuation operation;

cfr1:0=cfr35:34×fr19< BA > |cfr3:2=cfr35:34×fr18< BA > |cfr17:16=cfr35:34×fr27< BA > |cfr19:18=cfr35:34×fr26< BA > means that the input data in registers xyztr19:18, xyztr27:26 and the filter coefficients in register xyztr35:34 are multiplied and the result is stored in the B-plane registers r1:0, r3:2, r17:16 and r19:18;

cfacc1:0+=cfr5:4 < AB > ||cfacc3:2+=cfr7:6 < AB > |cfacc5:4+=cfr21:20 < AB > |cfacc7:6+=cfr23:22 < AB > means that the multiplication results in the B-plane registers r5:4, r7:6, r21:20 and r23:22 are accumulated by the accumulator ACC.

And resetting the HXSDSP 1042 processor after stack pressing protection.

and respectively performing push protection operation on the frame pointer U9, the procedure call return address register SER and the called procedure storage register Calle-Save Registers in the HXPSP 1042 processor to obtain the push protected HXPSP 1042 processor.

Specifically, when protecting the frame pointer U9 and the procedure call return address register SER, the return address of the subroutine call is automatically recorded in the subroutine return address register (Subroutine Return Address Register, abbreviated as SER) by the hdsp 1042, and the contents of the frame pointer U9 and the subroutine return address register SER need to be saved in the procedure stack due to the existence of the procedure call relationship, which is specifically shown in the following code segments:

xr39＝SER||yr39＝u9，

[u8+＝-2,-1]＝xyr39，

u9＝u8，

where xr39 is the 39 th register of any one of the cores x of the HXPS 1042, yr39 is the 39 th register of any one of the cores y of the HXPS 1042 processor, xyr is the 39 th register of any one of the cores xy of the HXPS 1042 processor. The values of the frame pointer U9 and the subroutine return address register SER are assigned to xr39 and yr39 respectively, then the contents of xr39 and yr39 are written into a stack area pointed by the address generator U8, and finally the value of the address generator U8 is temporarily stored in the U9 through assignment.

When protecting the called program Save Registers, the register class of the HXSSP 1042 processor shows that when the program uses the Registers responsible for protection by the called function (the calle-Save Registers, called calle-Save Registers for short), the old values must be saved at the beginning of the function and restored before the function returns. Since the algorithm uses a zero-overhead loop register (Loop Count Register, LC or LC for short) LC2 to store the loop control variable, and LC2 belongs to the calee-Save register, the old value of LC2 is saved as follows.

xr39＝lc2

[u8+＝-2,-1]＝xr39，

Where xr39 is the 39 th register of any one of the core x macros of HXPS 1042. The value of the zero overhead loop register lc2 is first assigned to xr39, and then the contents of xr39 are written to the stack area pointed to by address generator U8.

The following is a further detailed description of the implementation of the present invention in conjunction with the diagrams and assembler instructions:

X1

xr11:10

X2

yr11:10

xr13:12

X3

zr11:10

yr13:12

xr15:14

X4

tr11:10

zr13:12

yr15:14

xr17:16

X5

tr13:12

zr15:14

yr17:16

xr19:18

X6

tr15:14

zr17:16

yr19:18

xr21:20

X7

tr17:16

zr19:18

yr21:20

xr23:22

X8

tr19:18

zr21:20

yr23:22

xr25:24

X9

tr21:20

zr23:22

yr25:24

xr27:26

X10

tr23:22

zr25:24

yr27:26

xr29:28

X11

tr25:24

zr27:26

yr29:28

xr31:30

X12

tr27:26

zr29:28

yr31:30

xr33:32

X13

tr29:28

zr31:30

yr33:32

X14

tr31:30

zr33:32

X15

tr33:32

…

wherein, X1-X are input sequences, X is the X macro of any one core of the HXPSP 1042 processor, y is the y macro of any one core of the HXPSP 1042 processor, R is the abbreviation of a Register (abbreviated as R), xr11:10 is two registers R10 and R11 adjacent to the X macro of any one core of the HXPSP 1042 processor, wherein R10 is a real part, R11 is an imaginary part, and so on.

The floating point complex FIR assembly algorithm fully utilizes the data bus and operation resources of the domestic processor HXSDSP 1042, designs and optimizes the assembly program by enhancing three methods of instruction parallelism, cyclic expansion and software pipelining, indirectly reduces the data volume by using register assignment, avoids pipeline pause caused by bank conflict and block conflict, and has remarkable advantages in processing performance. Meanwhile, compared with the DSPF_sp_fir_cplx function of the current main stream processor TMS320C6678, the invention improves the performance by several times, and can be widely used in domestic signal processing equipment. s is(s)

The invention also verifies the provided method through experiments:

1. functional testing

According to the design method, corresponding random test excitation is generated in Matlab, and then the random test excitation is sent to an assembly function to be tested for processing, and the obtained calculation result is compared and analyzed with the Matlab processing result. Referring to fig. 4, fig. 4 is a functional test of an FIR algorithm obtained by a functional test of a floating-point complex FIR optimization method based on an hdsp 1042 processor according to an embodiment of the present invention, which is an error result under 4 sets of different test stimuli, which indicates that the algorithm function is correct.

2. Performance testing

The execution efficiency of the function is an important index for measuring a processor, the actual running clock cycle of each function is compared with the theoretical clock cycle in the development process of the assembly function, and when the actual clock cycle is smaller than 1.5 times of the theoretical clock cycle, the function is considered to meet the design requirement.

The theoretical running time max1 = nh ny/8 under the algorithm resource limit, the theoretical running time under the throughput limit is max2 = (nh+ny)/8, and the Calle-Save Registers are protected before and after the function body, and the serial period C = 41 is needed, so the theoretical running time of the floating point complex FIR function is: l-cycle=c+max { MAX1, MAX2} =c+nx nh/8. The actual running period of the function is shown in the following table, and all the actual running periods meet 1.5 times of the theoretical running time.

FIR run period test

Input data specification	512*64	1024*128	2048*256
				Actual number of clock cycles	1589	5289	18437
Theoretical clock cycle number	1074	4146	16434

The invention tests DSPF_sp_fir_cplx function of main stream processor TMS320C6678 in market, and clock period is shown in the following table. The main frequency of TMS320C6678 is 1.25GHz, the main frequency of HXDSP1042 is 700MHz, and the time consumption of floating point complex FIR on two processors can be calculated by combining clock cycles.

HXPSP 1042 and TMS320C6678 cycle comparison

Input data specification	512*64	1024*128	2048*256
				HXSSP 1042 actual clock cycle number	1589	5289	18437
Actual number of clock cycles of TMS320C6678	11304	38949	274469
				HXSDSP 1042 is time consuming	2.27us	7.55us	26.3us
TMS320C6678 time consuming	9.04us	31.159us	219us

Comparing the actual time consumption of the two processors in the table, it can be found that the efficiency of the processor HIXDSP1042 is improved by 3.98 times, 4.13 times and 8.32 times respectively in different specifications of input data compared with the TMS320C6678 of TI, and the design optimization requirement is met.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. The floating point complex FIR optimization method based on the HXPS 1042 processor is characterized by comprising the following steps:

specifically, the calculation formula is as follows:

x*y＝(a+bi)*(c+di)＝(ac-bd)+(ad+bc)i，

the multiplication of single-precision floating-point complex numbers x and y can be split into multiplication of 4 times of single-precision floating-point real numbers and addition of 2 times of single-precision floating-point real numbers, wherein x and y are complex numbers and are respectively expressed as a+bi and c+di;

the multiply-add times, the loop times, and the remainder are placed in zero overhead loop registers lc0, lc1, and lc2, respectively.

First, the zero overhead loop register lc0 is assigned: lc0=xr3;

xr7＝r4 ashift-3，

lc1＝xr7，

finally, calculating remainder according to the following code segments, wherein the general register r4 stores the length of the filter coefficient, firstly using a data storage instruction of the HXPSP 1042 to fetch the low 3 bits of r4, assigning the result to the general register xr5, and then assigning the value of the general register xr5 to the zero overhead loop register lc2;

xr5＝r4 fext(0:3,0)(z)，

lc2＝xr5，

performing convolution operation on the input sequence and the filter sequence according to the circulation control variable to obtain an output sequence, and storing the output sequence into a first address of the output sequence;

and resetting the HXSDSP 1042 processor after stack pressing protection.

2. The method for optimizing floating point complex FIR based on hdsp 1042 processor according to claim 1, wherein performing push protection on the hdsp 1042 processor to obtain a push protected hdsp 1042 processor, includes:

3. The floating point complex FIR optimization method based on the hdsp 1042 processor according to claim 1, characterized in that the input sequence includes several sets of data, one set for each 8 complex data.