CN107357552A

CN107357552A - The optimization method of floating-point complex vector summation is realized based on BWDSP chips

Info

Publication number: CN107357552A
Application number: CN201710419846.2A
Authority: CN
Inventors: 苏涛; 秦越; 王瑞昕; 邱瑾; 张杏
Original assignee: Xidian University; Xian Cetc Xidian University Radar Technology Collaborative Innovation Research Institute Co Ltd
Current assignee: Xidian University; Xian Cetc Xidian University Radar Technology Collaborative Innovation Research Institute Co Ltd
Priority date: 2017-06-06
Filing date: 2017-06-06
Publication date: 2017-11-17
Anticipated expiration: 2037-06-06
Also published as: CN107357552B

Abstract

The invention belongs to the bottom function optimization field of digital signal processor, disclose a kind of optimization method that floating-point complex vector summation is realized based on high-performance general signal processor BWDSP chips, the summation of floating-point complex vector is the vector summation of first the second floating-point complex of floating-point complex vector sum, and the vector summation of first the second floating-point complex of floating-point complex vector sum is the circulation being made up of multiple floating-point complex summation process；The optimization based on BWDSP chip parallel instructions is included in each floating-point complex summation process, i.e., controls multiple arithmetic elements to perform the optimization that identical operates simultaneously using an instruction；Optimization based on loop unrolling, i.e., deploy the optimization of multiple identical loop code in one cycle；Optimization based on software flow, repeatedly it will intersect the optimization of execution parallel by identical loop code；The hardware resource of BWDSP chips can be made full use of, obtains efficient bottom function.

Description

The optimization method of floating-point complex vector summation is realized based on BWDSP chips

Technical field

The invention belongs to the bottom function optimization field of digital signal processor, more particularly to one kind to be based on high performance universal Signal processor BWDSP chips realize the optimization method of floating-point complex vector summation.

Background technology

With the rapid development of large scale integrated circuit and digital computer, digital signal processor is real at a high speed to adapt to When signal processing tasks demand and gradually grow up.Independent development DSP processor chips have been increasingly becoming China's numeral letter The important topic of number treatment technology development.Under this background, certain research institute is proposed high performance series DSP processor BWDSP, this DSP are since architecture to instruction system, and the development environment supporting to design realization, software and hardware is completely certainly Main research and development, its performance are even also higher than some general in the world products.In national defense safety, public safety, Internet of Things, communication etc. The prospect of being widely applied is respectively provided with industry, its success will break the external ridge to China's high end digital signal processing chip It is disconnected.

BWDSP chips are a 32 static superscalar processors, and using 16 transmittings, (i.e. each instruction cycle most multipotency is same When transmitting 16 instruction), single instruction stream multiple data stream (Single Instruction Multiple Data, SIMD) framework, I.e. one instruction handles multiple data simultaneously.Processor instruction highway width is 512；Internal data bus is using asymmetric complete Duplex bus, internal data bus bit wide are 256.The processor while compatible 16 and 32 fixed-point data forms, are used Very long instruction word (Very Long Intstruction Word, VLIM) framework, has powerful parallel processing capability, can be compared with Meet the application requirement of High speed real-time signal processing well.

Software support is provided for BWDSP development, used development language is C language or assembler language, it is contemplated that is carried The operation efficiency of high function, although C language is readable, portable good, it is not easy to the direct control to hardware system, it is impossible to send out The characteristics of waving dsp chip itself, therefore the design of assembler language completion built-in function is selected, to ensure the processor hardware utilization of resources Maximize.But the assembler language for directly compiling to obtain from C language is not bound with the hardware characteristicses of dsp chip, exists certain Defect.

The content of the invention

For above-mentioned the deficiencies in the prior art, it is an object of the invention to provide one kind to be based on high performance universal signal transacting Device BWDSP realizes the optimization method of floating-point complex vector summation, can make full use of the hardware resource of BWDSP chips, obtain height The bottom function of efficiency.

To reach above-mentioned purpose, the present invention, which adopts the following technical scheme that, to be achieved.

A kind of optimization method that floating-point complex vector summation is realized based on BWDSP chips,

The summation of floating-point complex vector is summed for first the second floating-point complex of floating-point complex vector sum vector, first floating-point Complex vector located and described second floating-point complex vector summation is the circulation being made up of multiple floating-point complex summation process；

Comprising the optimization based on BWDSP chip parallel instructions, based on loop unrolling in each floating-point complex summation process Optimization and the optimization based on software flow；

It is described to be optimized for using an instruction based on BWDSP chip parallel instructions while control multiple arithmetic elements to perform The optimization of identical operation；It is described to be optimized for deploying in one cycle the excellent of multiple identical circulation based on loop unrolling Change；It is described to be optimized for based on software flow by the parallel optimization for intersecting execution of the repeatedly identical circulation.

The characteristics of technical solution of the present invention and further it is improved to：

(1) the BWDSP chips include 4 execution it is grand, each perform it is grand in include 64 words general register group, institute The numbering of register in general register group is stated respectively from 0 to 63, each perform it is grand in also include 8 adders；

The optimization based on BWDSP chip parallel instructions, is specifically included：

8 numbers in the first floating-point complex vector are read respectively in an instruction according to this and in the second floating-point complex vector 8 data；And read 8 data in the first floating-point complex vector and read 8 data in the second floating-point complex vector It is parallel to carry out；The data amount check of the second floating-point complex vector is respectively N described in the first floating-point complex vector sum, and N value It is far longer than 8；

8 data in the first floating-point complex vector are corresponding in turn to and are stored in by four of BWDSP chips In eight registers that the first register and the second register in register group form；In the second floating-point complex vector 8 data are corresponding in turn to the 3rd register and the 4th register being stored in four register groups by BWDSP chips In eight registers of composition；And storage and 8 in the second floating-point complex vector of first 8 data in floating-point complex vector The memory parallel of individual data is carried out；

One instruction in using the BWDSP chips each perform it is grand in first adder by the first register Data be added with the data in the 3rd register, and summed result is stored in the 5th register, second adder is by Data in two registers are added with the data in the 4th register, and summed result is stored in the 6th register；And the The data of one adder, which are added, is added parallel progress with the data of second adder.

(2) optimization based on loop unrolling, it is specially：

Multiple floating-point complex summation process is performed in one cycle.

(3) optimization based on loop unrolling, it is specially：

Floating-point complex summation process three times is performed in one cycle.

(4) optimization based on software flow, it is specially：

Represent circulation adjacent three times respectively using Iter1, Iter2, Iter3, i0~i5 represents that circulation needs are held every time Capable instruction, i0 are the read instruction from two vectors, and i1 instructs for summation operation, and for i2 to write several instructions, i3 is that counter adds 1 Instruction, i4 are the data length instruction that subtracts 1, and i5 is decision instruction；

Circulation starting phase, circulation core phase and cycle end are included in systemic circulation successively after optimizing based on software flow Phase, totally 8 instruction cycles：

The circulation starting phase is made up of instruction cycle 0 and instruction cycle 1, and in the circulation starting phase, circulation Iter1 performs reading successively Number instruction i0 and summation operation instruction i1；Circulate Iter2 and perform read instruction i0；Circulation Iter3 is not carried out any instruction；

The circulation core phase terminates by the instruction cycle 2 to the instruction cycle 5, and in the circulation core phase, circulation Iter1 is held successively Row write number instruction i2, counter add 1 instruction i3, data length subtract 1 instruction i4, decision instruction i5；Circulation Iter2 is performed successively to be asked With operational order i1, write number instruction i2, counter and add 1 instruction i3, data length the instruction i4 that subtracts 1；Circulation Iter3 performs reading successively Count instruction i0, summation operation instructs i1, write number instruction i2, counter adds 1 instruction i3；

The cycle end phase is made up of instruction cycle 6 and instruction cycle 7, and in the cycle end phase, circulation Iter1 is not carried out any Instruction；Circulate Iter2 and perform decision instruction i5；Circulation Iter3 performs the data length instruction i4 and decision instruction i5 that subtracts 1 successively.

By the optimal way of parallel instructions, the instruction slots of each dos command line DOS are maximally utilized, while use SIMD as far as possible Instruction, control multiple arithmetic units to perform same operation simultaneously by an instruction, make code more succinct, it is even more important It is that can largely improve efficiency of code execution；On the basis of parallel instructions, by being followed incoherent in loop body Ring is deployed, and repeatedly loop code will be sequentially placed in loop code section, because the last judgement of loop code redirects finger Order has multiple flowing water to pause, and cycle-index is more, and the number that the last item instruction performs is more, and its pause causes to efficiency Influence it is also bigger, therefore the execution number for judging skip instruction can be reduced by loop unrolling, improved to a certain extent Efficiency；Comprehensive directive is parallel and loop unrolling, then carries out final software flow optimization, according to instruction flow line feature, will circulate Rearrangement is entered in the circulation deployed in vivo, makes adjacent loop code weave in, eliminates the pause of streamline, very significantly Improve efficiency.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is single execution macro architecture schematic diagram provided in an embodiment of the present invention；

Fig. 2 is parallel instructions data call schematic diagram provided in an embodiment of the present invention；

Fig. 3 is software flow schematic diagram provided in an embodiment of the present invention；

Fig. 4 is flowing water provided in an embodiment of the present invention optimization instruction arrangement schematic diagram；

Fig. 5 is that flowing water provided in an embodiment of the present invention optimizes packet schematic diagram.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.

The resource developed during bottom function mainly uses execution grand, the execution unit of processor are grand included in four execution In, four execution are grand to be referred to as macro x, macro y, macro z and macro t.Four perform grand external interface and Internal structure is completely the same, and they obtain operational order from decoder, obtains operand from data storage, carries out various specific Operation.

Each perform the local general register group (R0~R63), 8 adders, 8 that grand inside includes 64 word Multiplier, 2 shift units and 1 are super to calculate device.Perform grand internal structure such as accompanying drawing 1, wherein adder is from performing grand register Source operand is obtained in group, performs add operation with source operand, and its output result is returned in register group.

Except perform it is grand in resource, often to use address generator in exploitation, BWDsP processors have three addresses productions Raw unit, is individually identified as U, V, W, its 26S Proteasome Structure and Function is identical.Each address-generation unit has 16 address registers, numbers 0~15.

It is the description to conventional addressing, addition instruction below：

Linear addressing is a kind of mode of direct addressin in instruction, after base address determines, is measured based on base address offset Address required for other is calculated according to linear relationship.There are individual character addressing and double-word addressing two ways.Below with Illustrated exemplified by address-generation unit U, the addressing operation of V, W unit is identical with U.

U cell individual character addresses：{ x, y, z, t } Rs=[Un+=Um, Uk]；Un is base address, and Um is base address correction, It is the address offset amount for reading multiple data simultaneously in Uk.Subscript refers to the label of address background register, thus its marked as 0~ 15.In individual character addressing, 4 data can be once read, mutually in requisition for 4 group address are provided, actual address is [Un], [Un+ Uk], [Un+2Uk], [Un+3Uk] totally 4 addresses, read 4 data respectively and be sent to 4 computing macroelements；Each grand reception 1 data, if whether reception occurs depending on memonic symbol { x, y, z, t }, occurs showing to read, otherwise need not read.Um Base address is modified after instruction performs, in case needs are read in address next time, that is, carries out [Un]=[Un+Um] computings. If there is no { x, y, z, t } in instruction, show instruction in 4 execution are grand while run.

U cell double-word addressing instructs：{ x, y, z, t } Rs+1：S=[Un+=Um, Uk]；, will be every during double-word operation One address and its adjacent address are as one group of composition, one address pair, and other data addresses read from memory are then by address Offset U k determines that, due to the double byte reading used here, old place location offset is to be multiplied by the actual adjustment formed after 2 Amount, [Un] and [Un+1], [Un+2Uk] and [Un+2Uk+1], [Un+2*2Uk] and [Un+2*2Uk+1], [Un+ are formed with this 3*2Uk] and [Un+3*2Uk+1] totally 8 addresses, 8 data are read respectively is sent to computing macroelement.Occur in { x, y, z, t } Each grand 2 data of reception.

Add operation：{ macro } Rs=Rm+Rn；Register Rm adds Rn, is as a result deposited into Rs registers.

Every instruction can be operated in instruction in the case where single or multiple execution are grand respectively, if do not had in instruction { macro }, represent that instruction performs at 4 and grand run simultaneously.

Plural number plus computing：{macro}CRs+1：S=CRm+1：m+CRn+1：n；Plural CRm+1：M and plural CRn+1：n Add operation, and result is stored in CRs+1：s.CRx+1：X represents a complex data, and its real part, imaginary part are two adjacent Register Rx+1 and Rx, prefix ' C ' represent plural referents symbol.One grand execution instruction needs two adders to complete, phase When in simultaneously complete Rs+1=Rm+1+Rn+1 and Rs=Rm+Rn operation；If do not had { macro } in instruction, represent instruction 4 Individual execution is grand to be run simultaneously.

Zero-overhead loop：if 1c0b 1able；Conditional jump based on 1c0.Zero-overhead register is checked at circulation end Whether 1c0 is equal to 0, if equal to 0, then do not redirect, down order performs；If 1c0 is not 0,1c0 value is subtracted automatically One, while jump to 1able.

Because floating-point complex vector summing function has the three phases processing (reading, a step computing, writing number) of standard, because This illustrates the optimization process of three parts by taking the function as an example, and its C function interface is int cvectadd (complex_ Float*in1, complex_float*in2, unsigned int n, complex_float*out)；Wherein complex_ Float is the plural structure of definition.The algorithm is mainly made up of a circulation, and its C code is as follows：

Parameter list in corresponding C interface, writes assembler, it is first determined how parameter is transmitted, in the following table can be with Clearly find out parameter and return value transmission agreement that compiler is supported.

For scalar data type, in preceding 8 parameter words, all with the grand interior general register transmission of x；For pointer number According to only when it is located in preceding 4 parameter words, with U address register transmission.In the interface parameters of vector summation, the first two ginseng Number in1, in2 type are complex_float*, belong to pointer data, are passed and joined using U；Parameter n is unsigned int Type, ginseng is passed using x general register；Similarly parameter out passes ginseng using U.Each parameter reference is it in parameter list Relative position, the function specifically passes ginseng and is described as：U0 points to in1, and U1 points to in2, and n is stored in xr2, and U3 points to out.

The assembly code for realizing above-mentioned C functions directly is write, without using any optimization means, obtained native assembler code As follows, wherein 1c0 represents zero-overhead loop register, and its value is vector data length n herein, and expression _ mainLoop circulates n It is secondary.When going to the last item sentence, if 1c0 is not zero, 1c0 values are subtracted one automatically, no longer redirected when 1c0 is zero To _ mainLoop, continue to perform code downwards.

Above-mentioned code description：Before _ mainLoop, ul value is assigned to v0, u3 value is assigned to w0, now u0, v0, W0 is respectively directed to in1, in2, out；Among _ mainLoop, first dos command line DOS xr10=[u0+=1,1] represents to take from u0 Go out a number and be assigned to register xr10, while u0 adds next data in 1 sensing in1；Article 2 dos command line DOS xr12=[v0+ =1,1] represent that a number is taken out from v0 is assigned to register xr12, while v0 adds next data in 1 sensing in2；3rd Adders of the incoming x of xr10 and xr12 value in grand is carried out add operation by bar dos command line DOS, and result is transferred back into register In xr22；Article 4 dos command line DOS [w0+=1,1]=xr22 represents to write the result in xr22 in the address location of w0 sensings. The data that each time _ mainLoop circulations will be obtained in two vector sums.

When setting data length n is 960, this section of code cycle of operation number is 7693, extremely inefficient.

This section of native assembler code is optimized, principle is implemented and result is as follows：

(1) optimization based on parallel instructions：

Run between each basic processing unit cores of BWDSP using SIMD modes, i.e., one instruction can control multiple fortune simultaneously Calculate part and perform same operation, realized inside basic operation core by the way of MIMD, i.e., each arithmetic unit is independent Instruction control, complete independently computing.Each instruction can at most control 4 basic processing unit cores, same instruction week Phase can perform most 16 instructions parallel, be separated between each instruction using instruction separator " | | ", with a line with " | | " point Every instruction be referred to as instruction slots.Make full use of it to read and write bus and internal arithmetic resource, executing efficiency can be greatly improved, Reduce size of code simultaneously.

In order to make full use of the grand interior adder of each arithmetic element of BWDSP processors, floating-point complex plus instruction are used CFRs+1：S=CFRm+1：m+CFRn+1：N, the F in instruction represent floating-point operation, and this instruction controls four computings simultaneously Unit performs add operation, equally as follows using parallel reading mode, improved code：

Above-mentioned code description：Illustrated with reference to accompanying drawing 2.First dos command line DOS uses two instruction slots, respectively from two Reading operation, r13 the and r12 registers during r13: 12 expression x, y, z, t tetra- are grand, altogether 8 deposits are carried out in input vector Device, therefore r13: 12=[u0+=8,1] represents that one cycle reads 8 numbers, i.e. [u0], [u0+1], [u0 from u0 address locations + 2], the data in [u0+3], [u0+4], [u0+5], [u0+6], [u0+7] pass to respectively xr12, xr13, yr12, yr13, Zr12, zr13, tr12, tr13, because plural number stores in internal memory according to the high mode of empty low reality, i.e., imaginary part is in address location Low level, real part in a high position, so in circulating for the first time, [u0], [u0+1] represent the imaginary part and real part of first number, thus [u0]~[u0+7] is expressed as the address location of preceding 4 plural totally 8 data；Similarly r11: 10=[v0+=8,1] represents once to follow Ring reads 8 numbers from v0 address locations；U0 and v0 respectively adds 8 sensings to circulate the data address of reading next time after reading.

Article 2 dos command line DOS is using plural number plus instruction, and the instruction is equivalent to fr22=fr12+fr10 and fr23=fr13+ Fr11 is performed parallel, it is each it is grand in r12 and r10 registers in value be transmitted to this it is grand in adder 1 carry out computing, by r11 With the value in r13 registers be transmitted to this it is grand in adder 2 carry out computing, two adder operation results are returned to pair respectively Ying Hongzhong register r22, r23.

Article 3 dos command line DOS [w0+=8,1]=r23: 22 by the summed result in four the register r22 and r23 in grand, 4 plural numbers totally 8 data, write in [w0]~[w0+7] address locations, and [w0], [w0+1] represent first plural imaginary part And real part, data below are understood successively, are write w0 after number every time and are added 8 sensings to circulate next time and write several address locations.

Because now 8 data in output vector can be calculated in one cycle, the value of 1c0 now is removed for n With 8 integer part result, expression _ mainLoop circulations can perform n/8 times.When n is 960, the periodicity of this section of code is performed For 973, compared with native assembler code, performance improves an order of magnitude.

(2) optimization based on loop unrolling：

Concentrated in usual program than relatively time-consuming part in larger circulation, therefore the core of program optimization is to circulation Optimization.Loop unrolling, it is a kind of optimization method for the execution speed that size for sacrificing program carrys out faster procedure, by that will circulate Body Code copying is repeatedly realized.Loop unrolling can increase the space of instruction scheduling, reduce the expense of loop branches instruction.Circulation Expansion can preferably prefetch data.

The cyclic part of this function can be divided into the step of three orders perform：Access, summation, storage result.These three steps Rapid order can not be upset, although sentence has a strict time order and function order in circulation, adjacent circulation twice be it is independent, it Do different processing to different pieces of information, it is thus related in the absence of data, code is optimized using loop unrolling, will Connected loop fusion three times is completed in one cycle, so as to reduce the preceding judgement number of circulation every time and call jump instruction Number, code efficiency is improved, it is as follows using the code after the optimization based on loop unrolling：

Above-mentioned code description：Loop unrolling is on the basis of parallel instructions, equivalent to will parallel instructions Optimized code above First three rows instruction, reading, computing, write number, expanded in one cycle three times, so per three times circulate once sentenced It is disconnected, reduce condition judgment and expense that branch redirects.

Because now containing the code of loop unrolling three times in one cycle, thus perform once _ mainLoop to obtain 24 data into output vector, thus lc0 now represent be n divided by 24 integer part result, represent _ MainLoop circulations can perform n/24 times.This section of code cycle of operation number is 893 when n is 960, relative to increase, and works as expansion Period it is more when, periodicity can be smaller, but size of code can also increase simultaneously.

(3) optimization based on software flow：

Similar with loop unrolling, software flow is to reduce the technology postponed between instruction, software flow by recombinating circulation It is not simply by loop unrolling, but follows difference by parallel instructions weave in the instruction between different loop bodies Instruction between ring body is tried to perform parallel together, so as to reducing number of cycles at double, is eliminated in conjunction with loop unrolling between instruction Correlation, in single loop body instruction remain as serial, can thus reach can both make full use of DSP hardware resource, again The purpose of circulation expense can be reduced.

In software flow work, when some processing units are performing this circulate operation, other processing units are same Other circulate operations of Shi Zhihang, this requires to reduce the correlation between each register of loop body and operand, present execution cycle Output not as subsequent cycle input when can enter loop unrolling.It is write out according to floating-point complex vector summation to collect pseudo- generation Code is as follows：

For above false code, 3 plastic flow water schematic diagrames referring to the drawings, Iter1~Iter3 represents circulation unrelated three times, I0~i5 represents that circulation every time needs the instruction performed, and i0 is the read instruction from two vectors, and i1 instructs for summation operation, i2 To write several instructions, i3 is that counter adds 1 instruction, and for recording cycle-index, i4 is length from the instruction that subtracts 1, is divided as i5 The Rule of judgment that branch redirects, i5 are decision instruction, circulate and terminate when length is 0, otherwise jump to i0 and circulated next time.

In the instruction cycle 0,1, Iter1 performs read instruction i0 and operational order i1, Iter2 perform read instruction i0, Iter3 is not carried out any instruction, is now referred to as the circulation starting phase, partly circulates in and do initial work；Opened from the instruction cycle 2 Begin, Iter3 starts to perform read instruction i0, and until the instruction cycle 5, the final step that Iter1 has been performed in its circulation judges to refer to I5 is made, Iter1~Iter3 is performed parallel during this, referred to as circulates the core phase.

Instruction cycle 6,7, now and not all circulation performs parallel, because part circulation has been completed i5 and instructed successively, Flowing water enters the ending phase, obtains the operation result of core phase.

Due to pipeline stall, such instruction arrangement principle be present：Result of calculation uses every two dos command line DOSs； Result uses every two dos command line DOSs after reading；Result to be written proposes the preparation of the first two dos command line DOS.

According to the principle, reading in _ mainLoop circulations, sum, write and to be separated by two dos command line DOSs respectively between number, In order to make full use of the resource of the two dos command line DOSs, by previous cycle it is adjacent under circulation twice carry out loop unrolling, it is and current Circulation interweaves in _ mainLoop to improve efficiency, i.e. lower two instructions in previous cycle read instruction (/ summation/writes number) Row carries out down the reading circulated twice (/ summation/writes number).Corresponding Optimized code is as follows：

Above-mentioned code description：On the basis of loop unrolling code, according to streamline feature, deploy in general _ mainLoop Circulation weave in three times.

The pipelining steps of six dos command line DOSs in _ mainLoop specific description, i-th of instruction have been subjected in accompanying drawing 4 Row respectively reads 8 numbers, using these data as one group, referred to as Group i from two input vectors；To the operating procedure of data It is divided into three steps：1. the reading from input vector address location；2. summation operation；3. result of calculation is write the address of output vector In unit.During first time execution _ mainLoop, dos command line DOS 1 reads preceding 8 data in input vector, after two dos command line DOSs Carry out adding computing, add the result of computing to return to present instruction row 1 after two dos command line DOSs, enter row write result, now writing result is In second of execution _ mainLoop；Instruction 2,3 similarly, reading plus computing, writes number and will be separated by two dos command line DOSs and perform. The repetitive instruction row 1~3 of dos command line DOS 4~6, identical computing form and step, because first three rows have handled 24 data, equally Three rows are also 24 data of processing afterwards；The repetitive instruction row 1~3 of dos command line DOS 4~6 is the embodiment of loop unrolling, is so reduced most A line decision instruction if lc0 b_mainLoop execution number afterwards because the expense for performing the sentence is larger, cycle-index compared with Efficiency can be significantly reduced when more.

R13: 12=in first dos command line DOS [u0+=24,1] | | r11: 10=[v0+=24,1], u0 point to input vector In1, v0 point to input vector in2, and reading 8 from in1 is counted in four grand register r12 and r13, and 8 are read from in2 Individual to count in four grand register r10 and r11, u0 and v0 value adds 24 after reading, should when representing to use next time u0 and v0 Read since in1 and in2 lower 24 numbers；Cfr23: 22=cfr13: 12+cfr11: 10 represent to using it is each it is grand in two Individual adder carries out summation operation, and 8 data results are transferred back in register；[w0+=24,1]=r23: 22 represent will As a result write back in the internal memory of the output vector of w0 sensings, write and w0 is added into lower 24 address locations of 24 sensings in case in two rows after counting Reading operation afterwards uses.

Article 2 dos command line DOS, Article 3 dos command line DOS are identical with first dos command line DOS operation, wherein u1=u0+8, v1=v0+ 8, u2=u1+8, v2=v1+8, it is corresponding plus 24 to perform after reading operation their address, and first three dos command line DOS is equivalent to calculating To 24 data results, as shown in accompanying drawing 5, once _ mainLoop is performed, first dos command line DOS handles the data of a-quadrant, Article 2 dos command line DOS handles the data of B area, and the data in Article 3 dos command line DOS processing C regions, each dos command line DOS is from two 8 data are respectively read in input vector in1 and in2, in1 preceding 8 numbers are read into register by first dos command line DOS successively Xr10, xr11, yr10, yr11, zr10, zr11, tr10, tr11, by in2 preceding 8 numbers be read into successively xr12, xr13, Yr12, yr13, zr12, zr13, tr12, tr13, Article 2 dos command line DOS read 8 data in in1 and in2 below respectively Enter to xr14~tr15, Article 3 dos command line DOS by followed by in1 and in2 in 8 data be read into register xr18~ In tr19, each group of data being calculated in 8 output vector out, be put into successively register xr22~tr23, xr24~ In tr25, xr26~tr27.4th~6 article of dos command line DOS repeats the 1st~3 article of dos command line DOS to the code of data manipulation, makes circulation exhibition Open raising efficiency.

This section of code cycle of operation number is 211 when n is 960, and analysis is understood, each dos command line DOS in every time _ mainLoop 8 data results are calculated, there are 6 dos command line DOSs in _ mainLoop, 48 numbers are obtained equivalent to once _ mainLoop is performed According to result, the last item decision instruction if 1c0 b_mainLoop are because native instructions flowing water reason can pause 6 bats, before adding The dos command line DOS optimized has been accomplished in 5, face on flowing water, will not produce pause between it, thus comes to 11 bats, thus obtains Theoretical cycle of operation number is：It is close with actual cycle number 211.By software flow to code optimization after, effect Rate improves 40 times.

Specifically, when Practical Project is realized, code optimization is carried out in accordance with the following steps：

Step 1：According to the algorithm of floating-point complex vector summing function, consider assembler first writes step, mainly It is made up of a circulation：Reading is write to the data summation in general register, general register, by result of calculation from internal memory In the internal memory of output vector.The cyclical function first to be collected in the circulating cycle according to the circular treatment mode in c program is write.

Step 2：Loop unrolling optimization is carried out to native assembler code, while considers maximized parallel instructions.Consider Deploy repeatedly incoherent loop code, while consider parallel instructions in the code of one cycle, make an instruction as same as possible When control multiple arithmetic units to perform same operation, in this function, due to corresponding multiple data pair between two vectors All it is to do identical sum operation, therefore can uses simultaneously to 4 basic processing units, uses 4 arithmetic elements of control Instruction, and 8 adders in each arithmetic element can be made full use of, take full advantage of Resources on Chip.

Step 3：On the basis of step 2, consider influence of the pipeline stall to instruction layout, reading result, calculate knot Fruit is all that two dos command line DOSs come into force afterwards, thus be can contemplate in two dos command line DOSs of this pipeline stall to rear adjacent Circulation is deployed twice, all dos command line DOSs is all effectively utilized, is not in the pause of streamline.

Step 4：Final step is optimized on the basis of step 3, and circulation is used as using the reading of first time iteration The first row of code, to be the dos command line DOS of this iteration progress read group total after this downward two code line, still further below after two rows The instruction of row write number operation can be entered, now by judging 1c0, when circulation does not terminate (i.e. all data do not calculate completion) When, make it jump to the first row of circulation, carry out first time iteration writes several operations.First time iteration carries out read group total simultaneously Dos command line DOS can carry out the cache flush mode of the 4th iteration simultaneously, the first row for entering back into circulation is the summation of the 4th iteration Calculate operation, by that analogy behind successive ignition.Each dos command line DOS is so maximumlly make use of, each dos command line DOS is being done Reading, calculate, write several operations, just for different iterationses；Decrease and led in circulation because of loop unrolling simultaneously The dos command line DOS of cause increases, by successive ignition weave in, and being realized to one kind of software flow in this way.

By making full use of the bottom hardware resource of BWDSP processor chips, and the finger developed according to its hardware characteristics Order collection, carries out writing for Compilation function, on this basis, to consider series of optimum method, be optimal efficiency.Especially when When loop code in program be present, if directly writing corresponding assembly code according to the mode of c program, circulation every time can only be located Manage low volume data, it is impossible to the characteristics of making full use of each instruction cycle to perform 16 instructions parallel, and base will not be used In the SIMD instruction of bottom hardware exploitation, the operation order that many scripts perform parallel can so carried out, taken seriously, Especially when circulating larger, it can cause significantly to reduce in efficiency.By the optimal way of parallel instructions, maximally utilize The instruction slots of each dos command line DOS, while SIMD instruction is used as far as possible, control multiple arithmetic units to perform simultaneously by an instruction Same operation, make code more succinct, it is even more important that can largely improve efficiency of code execution；Instructing On the basis of parallel, by the way that incoherent circulation in loop body is deployed, repeatedly circulation will be sequentially placed on by loop code In code segment because loop code it is last judge that jump instruction has multiple flowing water to pause, cycle-index is more, the last item Instruct the number performed more, it pauses and is influenceed to caused by efficiency also bigger, therefore can reduce and sentence by loop unrolling The execution number of disconnected skip instruction, improves efficiency to a certain extent；Comprehensive directive is parallel and loop unrolling, then carries out final Software flow optimizes, and according to instruction flow line feature, the circulation deployed in loop body is entered into rearrangement, hands over adjacent loop code It is woven in once, eliminates the pause of streamline, very significantly improve efficiency.

When the bottom function to the chip is developed, hardware resource is made full use of, and is selected using suitable instruction, To consider how to optimize when writing, the optimal way of parallel instructions, loop unrolling, software flow is embodied in code, from And develop efficient bottom function.

The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims

A kind of 1. optimization method that floating-point complex vector summation is realized based on BWDSP chips, it is characterised in that

The summation of floating-point complex vector is summed for first the second floating-point complex of floating-point complex vector sum vector, first floating-point complex The summation of the second floating-point complex vector is the circulation being made up of multiple floating-point complex summation process described in vector sum；

The optimization based on BWDSP chip parallel instructions, the optimization based on loop unrolling are included in each floating-point complex summation process And the optimization based on software flow；

It is described to be optimized for using an instruction based on BWDSP chip parallel instructions while control multiple arithmetic elements execution identical Operation optimization；The optimization for being optimized for deploying in one cycle multiple identical circulation based on loop unrolling；Institute State and be optimized for based on software flow by the parallel optimization for intersecting execution of the repeatedly identical circulation.
2. a kind of optimization method that floating-point complex vector summation is realized based on BWDSP chips according to claim 1, it is special Sign is, the BWDSP chips include that 4 execution are grand, each perform it is grand in include 64 words general register group, it is described general The numbering of register is respectively from 0 to 63 in register group, each perform it is grand in also include 8 adders；

The optimization based on BWDSP chip parallel instructions, is specifically included：

Read 8 of 8 numbers in the first floating-point complex vector according to this and in the second floating-point complex vector respectively in an instruction Individual data；And read 8 data in the first floating-point complex vector and read 8 data parallels in the second floating-point complex vector Carry out；The data amount check of the second floating-point complex vector is respectively N described in the first floating-point complex vector sum, and N value is much More than 8；

8 data in the first floating-point complex vector are corresponding in turn to four deposits being stored in by BWDSP chips In eight registers that the first register and the second register in device group form；8 in the second floating-point complex vector Data are corresponding in turn to the 3rd register being stored in four register groups by BWDSP chips and the 4th register forms Eight registers in；And storage and 8 numbers in the second floating-point complex vector of first 8 data in floating-point complex vector According to memory parallel carry out；

In an instruction using the BWDSP chips each perform it is grand in first adder by the number in the first register It is added according to the data in the 3rd register, and summed result is stored in the 5th register, second adder is posted second Data in storage are added with the data in the 4th register, and summed result is stored in the 6th register；And first adds The data of musical instruments used in a Buddhist or Taoist mass, which are added, is added parallel progress with the data of second adder.
3. a kind of optimization method that floating-point complex vector summation is realized based on BWDSP chips according to claim 1, it is special Sign is, the optimization based on loop unrolling, is specially：Multiple floating-point complex summation process is performed in one cycle.
4. a kind of optimization method that floating-point complex vector summation is realized based on BWDSP chips according to claim 3, it is special Sign is, the optimization based on loop unrolling, is specially：

Floating-point complex summation process three times is performed in one cycle.
5. a kind of optimization method that floating-point complex vector summation is realized based on BWDSP chips according to claim 4, it is special Sign is, the optimization based on software flow, is specially：

Represent circulation adjacent three times respectively using Iter1, Iter2, Iter3, i0~i5 represents that circulation needs what is performed every time Instruction, i0 are the read instruction from two vectors, and i1 instructs for summation operation, and for i2 to write several instructions, i3 is that counter adds 1 finger Order, i4 are the data length instruction that subtracts 1, and i5 is decision instruction；

Circulation starting phase, circulation core phase and cycle end phase are included in systemic circulation successively after optimizing based on software flow, Totally 8 instruction cycles：

The circulation starting phase is made up of instruction cycle 0 and instruction cycle 1, and in the circulation starting phase, circulation Iter1 performs reading and referred to successively I0 and summation operation is made to instruct i1；Circulate Iter2 and perform read instruction i0；Circulation Iter3 is not carried out any instruction；

The circulation core phase terminates by the instruction cycle 2 to the instruction cycle 5, and in the circulation core phase, circulation Iter1 is performed successively to be write Number instruction i2, counter add 1 instruction i3, data length subtract 1 instruction i4, decision instruction i5；Circulation Iter2 performs summation fortune successively Instruction i1 is calculated, number instruction i2, counter is write and adds 1 instruction i3, data length the instruction i4 that subtracts 1；Circulation Iter3 performs reading and referred to successively I0, summation operation instruction i1 are made, number instruction i2, counter is write and adds 1 instruction i3；

The cycle end phase is made up of instruction cycle 6 and instruction cycle 7, and in the cycle end phase, circulation Iter1 is not carried out any finger Order；Circulate Iter2 and perform decision instruction i5；Circulation Iter3 performs the data length instruction i4 and decision instruction i5 that subtracts 1 successively.