CN107357552A - The optimization method of floating-point complex vector summation is realized based on BWDSP chips - Google Patents
The optimization method of floating-point complex vector summation is realized based on BWDSP chips Download PDFInfo
- Publication number
- CN107357552A CN107357552A CN201710419846.2A CN201710419846A CN107357552A CN 107357552 A CN107357552 A CN 107357552A CN 201710419846 A CN201710419846 A CN 201710419846A CN 107357552 A CN107357552 A CN 107357552A
- Authority
- CN
- China
- Prior art keywords
- instruction
- floating
- point complex
- circulation
- summation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Advance Control (AREA)
- Complex Calculations (AREA)
Abstract
The invention belongs to the bottom function optimization field of digital signal processor, disclose a kind of optimization method that floating-point complex vector summation is realized based on high-performance general signal processor BWDSP chips, the summation of floating-point complex vector is the vector summation of first the second floating-point complex of floating-point complex vector sum, and the vector summation of first the second floating-point complex of floating-point complex vector sum is the circulation being made up of multiple floating-point complex summation process;The optimization based on BWDSP chip parallel instructions is included in each floating-point complex summation process, i.e., controls multiple arithmetic elements to perform the optimization that identical operates simultaneously using an instruction;Optimization based on loop unrolling, i.e., deploy the optimization of multiple identical loop code in one cycle;Optimization based on software flow, repeatedly it will intersect the optimization of execution parallel by identical loop code;The hardware resource of BWDSP chips can be made full use of, obtains efficient bottom function.
Description
Technical field
The invention belongs to the bottom function optimization field of digital signal processor, more particularly to one kind to be based on high performance universal
Signal processor BWDSP chips realize the optimization method of floating-point complex vector summation.
Background technology
With the rapid development of large scale integrated circuit and digital computer, digital signal processor is real at a high speed to adapt to
When signal processing tasks demand and gradually grow up.Independent development DSP processor chips have been increasingly becoming China's numeral letter
The important topic of number treatment technology development.Under this background, certain research institute is proposed high performance series DSP processor
BWDSP, this DSP are since architecture to instruction system, and the development environment supporting to design realization, software and hardware is completely certainly
Main research and development, its performance are even also higher than some general in the world products.In national defense safety, public safety, Internet of Things, communication etc.
The prospect of being widely applied is respectively provided with industry, its success will break the external ridge to China's high end digital signal processing chip
It is disconnected.
BWDSP chips are a 32 static superscalar processors, and using 16 transmittings, (i.e. each instruction cycle most multipotency is same
When transmitting 16 instruction), single instruction stream multiple data stream (Single Instruction Multiple Data, SIMD) framework,
I.e. one instruction handles multiple data simultaneously.Processor instruction highway width is 512;Internal data bus is using asymmetric complete
Duplex bus, internal data bus bit wide are 256.The processor while compatible 16 and 32 fixed-point data forms, are used
Very long instruction word (Very Long Intstruction Word, VLIM) framework, has powerful parallel processing capability, can be compared with
Meet the application requirement of High speed real-time signal processing well.
Software support is provided for BWDSP development, used development language is C language or assembler language, it is contemplated that is carried
The operation efficiency of high function, although C language is readable, portable good, it is not easy to the direct control to hardware system, it is impossible to send out
The characteristics of waving dsp chip itself, therefore the design of assembler language completion built-in function is selected, to ensure the processor hardware utilization of resources
Maximize.But the assembler language for directly compiling to obtain from C language is not bound with the hardware characteristicses of dsp chip, exists certain
Defect.
The content of the invention
For above-mentioned the deficiencies in the prior art, it is an object of the invention to provide one kind to be based on high performance universal signal transacting
Device BWDSP realizes the optimization method of floating-point complex vector summation, can make full use of the hardware resource of BWDSP chips, obtain height
The bottom function of efficiency.
To reach above-mentioned purpose, the present invention, which adopts the following technical scheme that, to be achieved.
A kind of optimization method that floating-point complex vector summation is realized based on BWDSP chips,
The summation of floating-point complex vector is summed for first the second floating-point complex of floating-point complex vector sum vector, first floating-point
Complex vector located and described second floating-point complex vector summation is the circulation being made up of multiple floating-point complex summation process;
Comprising the optimization based on BWDSP chip parallel instructions, based on loop unrolling in each floating-point complex summation process
Optimization and the optimization based on software flow;
It is described to be optimized for using an instruction based on BWDSP chip parallel instructions while control multiple arithmetic elements to perform
The optimization of identical operation;It is described to be optimized for deploying in one cycle the excellent of multiple identical circulation based on loop unrolling
Change;It is described to be optimized for based on software flow by the parallel optimization for intersecting execution of the repeatedly identical circulation.
The characteristics of technical solution of the present invention and further it is improved to:
(1) the BWDSP chips include 4 execution it is grand, each perform it is grand in include 64 words general register group, institute
The numbering of register in general register group is stated respectively from 0 to 63, each perform it is grand in also include 8 adders;
The optimization based on BWDSP chip parallel instructions, is specifically included:
8 numbers in the first floating-point complex vector are read respectively in an instruction according to this and in the second floating-point complex vector
8 data;And read 8 data in the first floating-point complex vector and read 8 data in the second floating-point complex vector
It is parallel to carry out;The data amount check of the second floating-point complex vector is respectively N described in the first floating-point complex vector sum, and N value
It is far longer than 8;
8 data in the first floating-point complex vector are corresponding in turn to and are stored in by four of BWDSP chips
In eight registers that the first register and the second register in register group form;In the second floating-point complex vector
8 data are corresponding in turn to the 3rd register and the 4th register being stored in four register groups by BWDSP chips
In eight registers of composition;And storage and 8 in the second floating-point complex vector of first 8 data in floating-point complex vector
The memory parallel of individual data is carried out;
One instruction in using the BWDSP chips each perform it is grand in first adder by the first register
Data be added with the data in the 3rd register, and summed result is stored in the 5th register, second adder is by
Data in two registers are added with the data in the 4th register, and summed result is stored in the 6th register;And the
The data of one adder, which are added, is added parallel progress with the data of second adder.
(2) optimization based on loop unrolling, it is specially:
Multiple floating-point complex summation process is performed in one cycle.
(3) optimization based on loop unrolling, it is specially:
Floating-point complex summation process three times is performed in one cycle.
(4) optimization based on software flow, it is specially:
Represent circulation adjacent three times respectively using Iter1, Iter2, Iter3, i0~i5 represents that circulation needs are held every time
Capable instruction, i0 are the read instruction from two vectors, and i1 instructs for summation operation, and for i2 to write several instructions, i3 is that counter adds 1
Instruction, i4 are the data length instruction that subtracts 1, and i5 is decision instruction;
Circulation starting phase, circulation core phase and cycle end are included in systemic circulation successively after optimizing based on software flow
Phase, totally 8 instruction cycles:
The circulation starting phase is made up of instruction cycle 0 and instruction cycle 1, and in the circulation starting phase, circulation Iter1 performs reading successively
Number instruction i0 and summation operation instruction i1;Circulate Iter2 and perform read instruction i0;Circulation Iter3 is not carried out any instruction;
The circulation core phase terminates by the instruction cycle 2 to the instruction cycle 5, and in the circulation core phase, circulation Iter1 is held successively
Row write number instruction i2, counter add 1 instruction i3, data length subtract 1 instruction i4, decision instruction i5;Circulation Iter2 is performed successively to be asked
With operational order i1, write number instruction i2, counter and add 1 instruction i3, data length the instruction i4 that subtracts 1;Circulation Iter3 performs reading successively
Count instruction i0, summation operation instructs i1, write number instruction i2, counter adds 1 instruction i3;
The cycle end phase is made up of instruction cycle 6 and instruction cycle 7, and in the cycle end phase, circulation Iter1 is not carried out any
Instruction;Circulate Iter2 and perform decision instruction i5;Circulation Iter3 performs the data length instruction i4 and decision instruction i5 that subtracts 1 successively.
By the optimal way of parallel instructions, the instruction slots of each dos command line DOS are maximally utilized, while use SIMD as far as possible
Instruction, control multiple arithmetic units to perform same operation simultaneously by an instruction, make code more succinct, it is even more important
It is that can largely improve efficiency of code execution;On the basis of parallel instructions, by being followed incoherent in loop body
Ring is deployed, and repeatedly loop code will be sequentially placed in loop code section, because the last judgement of loop code redirects finger
Order has multiple flowing water to pause, and cycle-index is more, and the number that the last item instruction performs is more, and its pause causes to efficiency
Influence it is also bigger, therefore the execution number for judging skip instruction can be reduced by loop unrolling, improved to a certain extent
Efficiency;Comprehensive directive is parallel and loop unrolling, then carries out final software flow optimization, according to instruction flow line feature, will circulate
Rearrangement is entered in the circulation deployed in vivo, makes adjacent loop code weave in, eliminates the pause of streamline, very significantly
Improve efficiency.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this
Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is single execution macro architecture schematic diagram provided in an embodiment of the present invention;
Fig. 2 is parallel instructions data call schematic diagram provided in an embodiment of the present invention;
Fig. 3 is software flow schematic diagram provided in an embodiment of the present invention;
Fig. 4 is flowing water provided in an embodiment of the present invention optimization instruction arrangement schematic diagram;
Fig. 5 is that flowing water provided in an embodiment of the present invention optimizes packet schematic diagram.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made
Embodiment, belong to the scope of protection of the invention.
The resource developed during bottom function mainly uses execution grand, the execution unit of processor are grand included in four execution
In, four execution are grand to be referred to as macro x, macro y, macro z and macro t.Four perform grand external interface and
Internal structure is completely the same, and they obtain operational order from decoder, obtains operand from data storage, carries out various specific
Operation.
Each perform the local general register group (R0~R63), 8 adders, 8 that grand inside includes 64 word
Multiplier, 2 shift units and 1 are super to calculate device.Perform grand internal structure such as accompanying drawing 1, wherein adder is from performing grand register
Source operand is obtained in group, performs add operation with source operand, and its output result is returned in register group.
Except perform it is grand in resource, often to use address generator in exploitation, BWDsP processors have three addresses productions
Raw unit, is individually identified as U, V, W, its 26S Proteasome Structure and Function is identical.Each address-generation unit has 16 address registers, numbers
0~15.
It is the description to conventional addressing, addition instruction below:
Linear addressing is a kind of mode of direct addressin in instruction, after base address determines, is measured based on base address offset
Address required for other is calculated according to linear relationship.There are individual character addressing and double-word addressing two ways.Below with
Illustrated exemplified by address-generation unit U, the addressing operation of V, W unit is identical with U.
U cell individual character addresses:{ x, y, z, t } Rs=[Un+=Um, Uk];Un is base address, and Um is base address correction,
It is the address offset amount for reading multiple data simultaneously in Uk.Subscript refers to the label of address background register, thus its marked as 0~
15.In individual character addressing, 4 data can be once read, mutually in requisition for 4 group address are provided, actual address is [Un], [Un+
Uk], [Un+2Uk], [Un+3Uk] totally 4 addresses, read 4 data respectively and be sent to 4 computing macroelements;Each grand reception
1 data, if whether reception occurs depending on memonic symbol { x, y, z, t }, occurs showing to read, otherwise need not read.Um
Base address is modified after instruction performs, in case needs are read in address next time, that is, carries out [Un]=[Un+Um] computings.
If there is no { x, y, z, t } in instruction, show instruction in 4 execution are grand while run.
U cell double-word addressing instructs:{ x, y, z, t } Rs+1:S=[Un+=Um, Uk];, will be every during double-word operation
One address and its adjacent address are as one group of composition, one address pair, and other data addresses read from memory are then by address
Offset U k determines that, due to the double byte reading used here, old place location offset is to be multiplied by the actual adjustment formed after 2
Amount, [Un] and [Un+1], [Un+2Uk] and [Un+2Uk+1], [Un+2*2Uk] and [Un+2*2Uk+1], [Un+ are formed with this
3*2Uk] and [Un+3*2Uk+1] totally 8 addresses, 8 data are read respectively is sent to computing macroelement.Occur in { x, y, z, t }
Each grand 2 data of reception.
Add operation:{ macro } Rs=Rm+Rn;Register Rm adds Rn, is as a result deposited into Rs registers.
Every instruction can be operated in instruction in the case where single or multiple execution are grand respectively, if do not had in instruction
{ macro }, represent that instruction performs at 4 and grand run simultaneously.
Plural number plus computing:{macro}CRs+1:S=CRm+1:m+CRn+1:n;Plural CRm+1:M and plural CRn+1:n
Add operation, and result is stored in CRs+1:s.CRx+1:X represents a complex data, and its real part, imaginary part are two adjacent
Register Rx+1 and Rx, prefix ' C ' represent plural referents symbol.One grand execution instruction needs two adders to complete, phase
When in simultaneously complete Rs+1=Rm+1+Rn+1 and Rs=Rm+Rn operation;If do not had { macro } in instruction, represent instruction 4
Individual execution is grand to be run simultaneously.
Zero-overhead loop:if 1c0b 1able;Conditional jump based on 1c0.Zero-overhead register is checked at circulation end
Whether 1c0 is equal to 0, if equal to 0, then do not redirect, down order performs;If 1c0 is not 0,1c0 value is subtracted automatically
One, while jump to 1able.
Because floating-point complex vector summing function has the three phases processing (reading, a step computing, writing number) of standard, because
This illustrates the optimization process of three parts by taking the function as an example, and its C function interface is int cvectadd (complex_
Float*in1, complex_float*in2, unsigned int n, complex_float*out);Wherein complex_
Float is the plural structure of definition.The algorithm is mainly made up of a circulation, and its C code is as follows:
Parameter list in corresponding C interface, writes assembler, it is first determined how parameter is transmitted, in the following table can be with
Clearly find out parameter and return value transmission agreement that compiler is supported.
For scalar data type, in preceding 8 parameter words, all with the grand interior general register transmission of x;For pointer number
According to only when it is located in preceding 4 parameter words, with U address register transmission.In the interface parameters of vector summation, the first two ginseng
Number in1, in2 type are complex_float*, belong to pointer data, are passed and joined using U;Parameter n is unsigned int
Type, ginseng is passed using x general register;Similarly parameter out passes ginseng using U.Each parameter reference is it in parameter list
Relative position, the function specifically passes ginseng and is described as:U0 points to in1, and U1 points to in2, and n is stored in xr2, and U3 points to out.
The assembly code for realizing above-mentioned C functions directly is write, without using any optimization means, obtained native assembler code
As follows, wherein 1c0 represents zero-overhead loop register, and its value is vector data length n herein, and expression _ mainLoop circulates n
It is secondary.When going to the last item sentence, if 1c0 is not zero, 1c0 values are subtracted one automatically, no longer redirected when 1c0 is zero
To _ mainLoop, continue to perform code downwards.
Above-mentioned code description:Before _ mainLoop, ul value is assigned to v0, u3 value is assigned to w0, now u0, v0,
W0 is respectively directed to in1, in2, out;Among _ mainLoop, first dos command line DOS xr10=[u0+=1,1] represents to take from u0
Go out a number and be assigned to register xr10, while u0 adds next data in 1 sensing in1;Article 2 dos command line DOS xr12=[v0+
=1,1] represent that a number is taken out from v0 is assigned to register xr12, while v0 adds next data in 1 sensing in2;3rd
Adders of the incoming x of xr10 and xr12 value in grand is carried out add operation by bar dos command line DOS, and result is transferred back into register
In xr22;Article 4 dos command line DOS [w0+=1,1]=xr22 represents to write the result in xr22 in the address location of w0 sensings.
The data that each time _ mainLoop circulations will be obtained in two vector sums.
When setting data length n is 960, this section of code cycle of operation number is 7693, extremely inefficient.
This section of native assembler code is optimized, principle is implemented and result is as follows:
(1) optimization based on parallel instructions:
Run between each basic processing unit cores of BWDSP using SIMD modes, i.e., one instruction can control multiple fortune simultaneously
Calculate part and perform same operation, realized inside basic operation core by the way of MIMD, i.e., each arithmetic unit is independent
Instruction control, complete independently computing.Each instruction can at most control 4 basic processing unit cores, same instruction week
Phase can perform most 16 instructions parallel, be separated between each instruction using instruction separator " | | ", with a line with " | | " point
Every instruction be referred to as instruction slots.Make full use of it to read and write bus and internal arithmetic resource, executing efficiency can be greatly improved,
Reduce size of code simultaneously.
In order to make full use of the grand interior adder of each arithmetic element of BWDSP processors, floating-point complex plus instruction are used
CFRs+1:S=CFRm+1:m+CFRn+1:N, the F in instruction represent floating-point operation, and this instruction controls four computings simultaneously
Unit performs add operation, equally as follows using parallel reading mode, improved code:
Above-mentioned code description:Illustrated with reference to accompanying drawing 2.First dos command line DOS uses two instruction slots, respectively from two
Reading operation, r13 the and r12 registers during r13: 12 expression x, y, z, t tetra- are grand, altogether 8 deposits are carried out in input vector
Device, therefore r13: 12=[u0+=8,1] represents that one cycle reads 8 numbers, i.e. [u0], [u0+1], [u0 from u0 address locations
+ 2], the data in [u0+3], [u0+4], [u0+5], [u0+6], [u0+7] pass to respectively xr12, xr13, yr12, yr13,
Zr12, zr13, tr12, tr13, because plural number stores in internal memory according to the high mode of empty low reality, i.e., imaginary part is in address location
Low level, real part in a high position, so in circulating for the first time, [u0], [u0+1] represent the imaginary part and real part of first number, thus
[u0]~[u0+7] is expressed as the address location of preceding 4 plural totally 8 data;Similarly r11: 10=[v0+=8,1] represents once to follow
Ring reads 8 numbers from v0 address locations;U0 and v0 respectively adds 8 sensings to circulate the data address of reading next time after reading.
Article 2 dos command line DOS is using plural number plus instruction, and the instruction is equivalent to fr22=fr12+fr10 and fr23=fr13+
Fr11 is performed parallel, it is each it is grand in r12 and r10 registers in value be transmitted to this it is grand in adder 1 carry out computing, by r11
With the value in r13 registers be transmitted to this it is grand in adder 2 carry out computing, two adder operation results are returned to pair respectively
Ying Hongzhong register r22, r23.
Article 3 dos command line DOS [w0+=8,1]=r23: 22 by the summed result in four the register r22 and r23 in grand,
4 plural numbers totally 8 data, write in [w0]~[w0+7] address locations, and [w0], [w0+1] represent first plural imaginary part
And real part, data below are understood successively, are write w0 after number every time and are added 8 sensings to circulate next time and write several address locations.
Because now 8 data in output vector can be calculated in one cycle, the value of 1c0 now is removed for n
With 8 integer part result, expression _ mainLoop circulations can perform n/8 times.When n is 960, the periodicity of this section of code is performed
For 973, compared with native assembler code, performance improves an order of magnitude.
(2) optimization based on loop unrolling:
Concentrated in usual program than relatively time-consuming part in larger circulation, therefore the core of program optimization is to circulation
Optimization.Loop unrolling, it is a kind of optimization method for the execution speed that size for sacrificing program carrys out faster procedure, by that will circulate
Body Code copying is repeatedly realized.Loop unrolling can increase the space of instruction scheduling, reduce the expense of loop branches instruction.Circulation
Expansion can preferably prefetch data.
The cyclic part of this function can be divided into the step of three orders perform:Access, summation, storage result.These three steps
Rapid order can not be upset, although sentence has a strict time order and function order in circulation, adjacent circulation twice be it is independent, it
Do different processing to different pieces of information, it is thus related in the absence of data, code is optimized using loop unrolling, will
Connected loop fusion three times is completed in one cycle, so as to reduce the preceding judgement number of circulation every time and call jump instruction
Number, code efficiency is improved, it is as follows using the code after the optimization based on loop unrolling:
Above-mentioned code description:Loop unrolling is on the basis of parallel instructions, equivalent to will parallel instructions Optimized code above
First three rows instruction, reading, computing, write number, expanded in one cycle three times, so per three times circulate once sentenced
It is disconnected, reduce condition judgment and expense that branch redirects.
Because now containing the code of loop unrolling three times in one cycle, thus perform once _ mainLoop to obtain
24 data into output vector, thus lc0 now represent be n divided by 24 integer part result, represent _
MainLoop circulations can perform n/24 times.This section of code cycle of operation number is 893 when n is 960, relative to increase, and works as expansion
Period it is more when, periodicity can be smaller, but size of code can also increase simultaneously.
(3) optimization based on software flow:
Similar with loop unrolling, software flow is to reduce the technology postponed between instruction, software flow by recombinating circulation
It is not simply by loop unrolling, but follows difference by parallel instructions weave in the instruction between different loop bodies
Instruction between ring body is tried to perform parallel together, so as to reducing number of cycles at double, is eliminated in conjunction with loop unrolling between instruction
Correlation, in single loop body instruction remain as serial, can thus reach can both make full use of DSP hardware resource, again
The purpose of circulation expense can be reduced.
In software flow work, when some processing units are performing this circulate operation, other processing units are same
Other circulate operations of Shi Zhihang, this requires to reduce the correlation between each register of loop body and operand, present execution cycle
Output not as subsequent cycle input when can enter loop unrolling.It is write out according to floating-point complex vector summation to collect pseudo- generation
Code is as follows:
For above false code, 3 plastic flow water schematic diagrames referring to the drawings, Iter1~Iter3 represents circulation unrelated three times,
I0~i5 represents that circulation every time needs the instruction performed, and i0 is the read instruction from two vectors, and i1 instructs for summation operation, i2
To write several instructions, i3 is that counter adds 1 instruction, and for recording cycle-index, i4 is length from the instruction that subtracts 1, is divided as i5
The Rule of judgment that branch redirects, i5 are decision instruction, circulate and terminate when length is 0, otherwise jump to i0 and circulated next time.
In the instruction cycle 0,1, Iter1 performs read instruction i0 and operational order i1, Iter2 perform read instruction i0,
Iter3 is not carried out any instruction, is now referred to as the circulation starting phase, partly circulates in and do initial work;Opened from the instruction cycle 2
Begin, Iter3 starts to perform read instruction i0, and until the instruction cycle 5, the final step that Iter1 has been performed in its circulation judges to refer to
I5 is made, Iter1~Iter3 is performed parallel during this, referred to as circulates the core phase.
Instruction cycle 6,7, now and not all circulation performs parallel, because part circulation has been completed i5 and instructed successively,
Flowing water enters the ending phase, obtains the operation result of core phase.
Due to pipeline stall, such instruction arrangement principle be present:Result of calculation uses every two dos command line DOSs;
Result uses every two dos command line DOSs after reading;Result to be written proposes the preparation of the first two dos command line DOS.
According to the principle, reading in _ mainLoop circulations, sum, write and to be separated by two dos command line DOSs respectively between number,
In order to make full use of the resource of the two dos command line DOSs, by previous cycle it is adjacent under circulation twice carry out loop unrolling, it is and current
Circulation interweaves in _ mainLoop to improve efficiency, i.e. lower two instructions in previous cycle read instruction (/ summation/writes number)
Row carries out down the reading circulated twice (/ summation/writes number).Corresponding Optimized code is as follows:
Above-mentioned code description:On the basis of loop unrolling code, according to streamline feature, deploy in general _ mainLoop
Circulation weave in three times.
The pipelining steps of six dos command line DOSs in _ mainLoop specific description, i-th of instruction have been subjected in accompanying drawing 4
Row respectively reads 8 numbers, using these data as one group, referred to as Group i from two input vectors;To the operating procedure of data
It is divided into three steps:1. the reading from input vector address location;2. summation operation;3. result of calculation is write the address of output vector
In unit.During first time execution _ mainLoop, dos command line DOS 1 reads preceding 8 data in input vector, after two dos command line DOSs
Carry out adding computing, add the result of computing to return to present instruction row 1 after two dos command line DOSs, enter row write result, now writing result is
In second of execution _ mainLoop;Instruction 2,3 similarly, reading plus computing, writes number and will be separated by two dos command line DOSs and perform.
The repetitive instruction row 1~3 of dos command line DOS 4~6, identical computing form and step, because first three rows have handled 24 data, equally
Three rows are also 24 data of processing afterwards;The repetitive instruction row 1~3 of dos command line DOS 4~6 is the embodiment of loop unrolling, is so reduced most
A line decision instruction if lc0 b_mainLoop execution number afterwards because the expense for performing the sentence is larger, cycle-index compared with
Efficiency can be significantly reduced when more.
R13: 12=in first dos command line DOS [u0+=24,1] | | r11: 10=[v0+=24,1], u0 point to input vector
In1, v0 point to input vector in2, and reading 8 from in1 is counted in four grand register r12 and r13, and 8 are read from in2
Individual to count in four grand register r10 and r11, u0 and v0 value adds 24 after reading, should when representing to use next time u0 and v0
Read since in1 and in2 lower 24 numbers;Cfr23: 22=cfr13: 12+cfr11: 10 represent to using it is each it is grand in two
Individual adder carries out summation operation, and 8 data results are transferred back in register;[w0+=24,1]=r23: 22 represent will
As a result write back in the internal memory of the output vector of w0 sensings, write and w0 is added into lower 24 address locations of 24 sensings in case in two rows after counting
Reading operation afterwards uses.
Article 2 dos command line DOS, Article 3 dos command line DOS are identical with first dos command line DOS operation, wherein u1=u0+8, v1=v0+
8, u2=u1+8, v2=v1+8, it is corresponding plus 24 to perform after reading operation their address, and first three dos command line DOS is equivalent to calculating
To 24 data results, as shown in accompanying drawing 5, once _ mainLoop is performed, first dos command line DOS handles the data of a-quadrant,
Article 2 dos command line DOS handles the data of B area, and the data in Article 3 dos command line DOS processing C regions, each dos command line DOS is from two
8 data are respectively read in input vector in1 and in2, in1 preceding 8 numbers are read into register by first dos command line DOS successively
Xr10, xr11, yr10, yr11, zr10, zr11, tr10, tr11, by in2 preceding 8 numbers be read into successively xr12, xr13,
Yr12, yr13, zr12, zr13, tr12, tr13, Article 2 dos command line DOS read 8 data in in1 and in2 below respectively
Enter to xr14~tr15, Article 3 dos command line DOS by followed by in1 and in2 in 8 data be read into register xr18~
In tr19, each group of data being calculated in 8 output vector out, be put into successively register xr22~tr23, xr24~
In tr25, xr26~tr27.4th~6 article of dos command line DOS repeats the 1st~3 article of dos command line DOS to the code of data manipulation, makes circulation exhibition
Open raising efficiency.
This section of code cycle of operation number is 211 when n is 960, and analysis is understood, each dos command line DOS in every time _ mainLoop
8 data results are calculated, there are 6 dos command line DOSs in _ mainLoop, 48 numbers are obtained equivalent to once _ mainLoop is performed
According to result, the last item decision instruction if 1c0 b_mainLoop are because native instructions flowing water reason can pause 6 bats, before adding
The dos command line DOS optimized has been accomplished in 5, face on flowing water, will not produce pause between it, thus comes to 11 bats, thus obtains
Theoretical cycle of operation number is:It is close with actual cycle number 211.By software flow to code optimization after, effect
Rate improves 40 times.
Specifically, when Practical Project is realized, code optimization is carried out in accordance with the following steps:
Step 1:According to the algorithm of floating-point complex vector summing function, consider assembler first writes step, mainly
It is made up of a circulation:Reading is write to the data summation in general register, general register, by result of calculation from internal memory
In the internal memory of output vector.The cyclical function first to be collected in the circulating cycle according to the circular treatment mode in c program is write.
Step 2:Loop unrolling optimization is carried out to native assembler code, while considers maximized parallel instructions.Consider
Deploy repeatedly incoherent loop code, while consider parallel instructions in the code of one cycle, make an instruction as same as possible
When control multiple arithmetic units to perform same operation, in this function, due to corresponding multiple data pair between two vectors
All it is to do identical sum operation, therefore can uses simultaneously to 4 basic processing units, uses 4 arithmetic elements of control
Instruction, and 8 adders in each arithmetic element can be made full use of, take full advantage of Resources on Chip.
Step 3:On the basis of step 2, consider influence of the pipeline stall to instruction layout, reading result, calculate knot
Fruit is all that two dos command line DOSs come into force afterwards, thus be can contemplate in two dos command line DOSs of this pipeline stall to rear adjacent
Circulation is deployed twice, all dos command line DOSs is all effectively utilized, is not in the pause of streamline.
Step 4:Final step is optimized on the basis of step 3, and circulation is used as using the reading of first time iteration
The first row of code, to be the dos command line DOS of this iteration progress read group total after this downward two code line, still further below after two rows
The instruction of row write number operation can be entered, now by judging 1c0, when circulation does not terminate (i.e. all data do not calculate completion)
When, make it jump to the first row of circulation, carry out first time iteration writes several operations.First time iteration carries out read group total simultaneously
Dos command line DOS can carry out the cache flush mode of the 4th iteration simultaneously, the first row for entering back into circulation is the summation of the 4th iteration
Calculate operation, by that analogy behind successive ignition.Each dos command line DOS is so maximumlly make use of, each dos command line DOS is being done
Reading, calculate, write several operations, just for different iterationses;Decrease and led in circulation because of loop unrolling simultaneously
The dos command line DOS of cause increases, by successive ignition weave in, and being realized to one kind of software flow in this way.
By making full use of the bottom hardware resource of BWDSP processor chips, and the finger developed according to its hardware characteristics
Order collection, carries out writing for Compilation function, on this basis, to consider series of optimum method, be optimal efficiency.Especially when
When loop code in program be present, if directly writing corresponding assembly code according to the mode of c program, circulation every time can only be located
Manage low volume data, it is impossible to the characteristics of making full use of each instruction cycle to perform 16 instructions parallel, and base will not be used
In the SIMD instruction of bottom hardware exploitation, the operation order that many scripts perform parallel can so carried out, taken seriously,
Especially when circulating larger, it can cause significantly to reduce in efficiency.By the optimal way of parallel instructions, maximally utilize
The instruction slots of each dos command line DOS, while SIMD instruction is used as far as possible, control multiple arithmetic units to perform simultaneously by an instruction
Same operation, make code more succinct, it is even more important that can largely improve efficiency of code execution;Instructing
On the basis of parallel, by the way that incoherent circulation in loop body is deployed, repeatedly circulation will be sequentially placed on by loop code
In code segment because loop code it is last judge that jump instruction has multiple flowing water to pause, cycle-index is more, the last item
Instruct the number performed more, it pauses and is influenceed to caused by efficiency also bigger, therefore can reduce and sentence by loop unrolling
The execution number of disconnected skip instruction, improves efficiency to a certain extent;Comprehensive directive is parallel and loop unrolling, then carries out final
Software flow optimizes, and according to instruction flow line feature, the circulation deployed in loop body is entered into rearrangement, hands over adjacent loop code
It is woven in once, eliminates the pause of streamline, very significantly improve efficiency.
When the bottom function to the chip is developed, hardware resource is made full use of, and is selected using suitable instruction,
To consider how to optimize when writing, the optimal way of parallel instructions, loop unrolling, software flow is embodied in code, from
And develop efficient bottom function.
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any
Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained
Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.
Claims (5)
- A kind of 1. optimization method that floating-point complex vector summation is realized based on BWDSP chips, it is characterised in thatThe summation of floating-point complex vector is summed for first the second floating-point complex of floating-point complex vector sum vector, first floating-point complex The summation of the second floating-point complex vector is the circulation being made up of multiple floating-point complex summation process described in vector sum;The optimization based on BWDSP chip parallel instructions, the optimization based on loop unrolling are included in each floating-point complex summation process And the optimization based on software flow;It is described to be optimized for using an instruction based on BWDSP chip parallel instructions while control multiple arithmetic elements execution identical Operation optimization;The optimization for being optimized for deploying in one cycle multiple identical circulation based on loop unrolling;Institute State and be optimized for based on software flow by the parallel optimization for intersecting execution of the repeatedly identical circulation.
- 2. a kind of optimization method that floating-point complex vector summation is realized based on BWDSP chips according to claim 1, it is special Sign is, the BWDSP chips include that 4 execution are grand, each perform it is grand in include 64 words general register group, it is described general The numbering of register is respectively from 0 to 63 in register group, each perform it is grand in also include 8 adders;The optimization based on BWDSP chip parallel instructions, is specifically included:Read 8 of 8 numbers in the first floating-point complex vector according to this and in the second floating-point complex vector respectively in an instruction Individual data;And read 8 data in the first floating-point complex vector and read 8 data parallels in the second floating-point complex vector Carry out;The data amount check of the second floating-point complex vector is respectively N described in the first floating-point complex vector sum, and N value is much More than 8;8 data in the first floating-point complex vector are corresponding in turn to four deposits being stored in by BWDSP chips In eight registers that the first register and the second register in device group form;8 in the second floating-point complex vector Data are corresponding in turn to the 3rd register being stored in four register groups by BWDSP chips and the 4th register forms Eight registers in;And storage and 8 numbers in the second floating-point complex vector of first 8 data in floating-point complex vector According to memory parallel carry out;In an instruction using the BWDSP chips each perform it is grand in first adder by the number in the first register It is added according to the data in the 3rd register, and summed result is stored in the 5th register, second adder is posted second Data in storage are added with the data in the 4th register, and summed result is stored in the 6th register;And first adds The data of musical instruments used in a Buddhist or Taoist mass, which are added, is added parallel progress with the data of second adder.
- 3. a kind of optimization method that floating-point complex vector summation is realized based on BWDSP chips according to claim 1, it is special Sign is, the optimization based on loop unrolling, is specially:Multiple floating-point complex summation process is performed in one cycle.
- 4. a kind of optimization method that floating-point complex vector summation is realized based on BWDSP chips according to claim 3, it is special Sign is, the optimization based on loop unrolling, is specially:Floating-point complex summation process three times is performed in one cycle.
- 5. a kind of optimization method that floating-point complex vector summation is realized based on BWDSP chips according to claim 4, it is special Sign is, the optimization based on software flow, is specially:Represent circulation adjacent three times respectively using Iter1, Iter2, Iter3, i0~i5 represents that circulation needs what is performed every time Instruction, i0 are the read instruction from two vectors, and i1 instructs for summation operation, and for i2 to write several instructions, i3 is that counter adds 1 finger Order, i4 are the data length instruction that subtracts 1, and i5 is decision instruction;Circulation starting phase, circulation core phase and cycle end phase are included in systemic circulation successively after optimizing based on software flow, Totally 8 instruction cycles:The circulation starting phase is made up of instruction cycle 0 and instruction cycle 1, and in the circulation starting phase, circulation Iter1 performs reading and referred to successively I0 and summation operation is made to instruct i1;Circulate Iter2 and perform read instruction i0;Circulation Iter3 is not carried out any instruction;The circulation core phase terminates by the instruction cycle 2 to the instruction cycle 5, and in the circulation core phase, circulation Iter1 is performed successively to be write Number instruction i2, counter add 1 instruction i3, data length subtract 1 instruction i4, decision instruction i5;Circulation Iter2 performs summation fortune successively Instruction i1 is calculated, number instruction i2, counter is write and adds 1 instruction i3, data length the instruction i4 that subtracts 1;Circulation Iter3 performs reading and referred to successively I0, summation operation instruction i1 are made, number instruction i2, counter is write and adds 1 instruction i3;The cycle end phase is made up of instruction cycle 6 and instruction cycle 7, and in the cycle end phase, circulation Iter1 is not carried out any finger Order;Circulate Iter2 and perform decision instruction i5;Circulation Iter3 performs the data length instruction i4 and decision instruction i5 that subtracts 1 successively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710419846.2A CN107357552B (en) | 2017-06-06 | 2017-06-06 | Optimization method for realizing floating-point complex vector summation based on BWDSP chip |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710419846.2A CN107357552B (en) | 2017-06-06 | 2017-06-06 | Optimization method for realizing floating-point complex vector summation based on BWDSP chip |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107357552A true CN107357552A (en) | 2017-11-17 |
CN107357552B CN107357552B (en) | 2020-10-16 |
Family
ID=60272219
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710419846.2A Active CN107357552B (en) | 2017-06-06 | 2017-06-06 | Optimization method for realizing floating-point complex vector summation based on BWDSP chip |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107357552B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109474268A (en) * | 2018-12-19 | 2019-03-15 | 北京比特大陆科技有限公司 | Circuit structure, circuit board and supercomputer equipment |
CN110058884A (en) * | 2019-03-15 | 2019-07-26 | 佛山市顺德区中山大学研究院 | For the optimization method of calculation type store instruction set operation, system and storage medium |
CN111273889A (en) * | 2020-01-15 | 2020-06-12 | 西安电子科技大学 | Floating point complex FIR optimization method based on HXDDSP 1042 processor |
CN111291320A (en) * | 2020-01-16 | 2020-06-16 | 西安电子科技大学 | Double-precision floating-point complex matrix operation optimization method based on HXDDSP chip |
CN111553123A (en) * | 2020-04-26 | 2020-08-18 | 西安电子科技大学 | Code execution optimization method under complex function limited register based on DSP |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130151822A1 (en) * | 2011-12-09 | 2013-06-13 | International Business Machines Corporation | Efficient Enqueuing of Values in SIMD Engines with Permute Unit |
CN104615584A (en) * | 2015-02-06 | 2015-05-13 | 中国人民解放军国防科学技术大学 | Method for vectorization computing of solution of large-scale trigonometric linear system of equations for GPDSP |
CN104715463A (en) * | 2015-04-09 | 2015-06-17 | 哈尔滨工业大学 | Optimizing method for performing ultrasonic image smooth treating program based on DSP (Digital Signal Processor) |
-
2017
- 2017-06-06 CN CN201710419846.2A patent/CN107357552B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130151822A1 (en) * | 2011-12-09 | 2013-06-13 | International Business Machines Corporation | Efficient Enqueuing of Values in SIMD Engines with Permute Unit |
CN104615584A (en) * | 2015-02-06 | 2015-05-13 | 中国人民解放军国防科学技术大学 | Method for vectorization computing of solution of large-scale trigonometric linear system of equations for GPDSP |
CN104715463A (en) * | 2015-04-09 | 2015-06-17 | 哈尔滨工业大学 | Optimizing method for performing ultrasonic image smooth treating program based on DSP (Digital Signal Processor) |
Non-Patent Citations (1)
Title |
---|
甄扬: "基于多核VLIW_DSP的数字信号变换函数并行优化", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109474268A (en) * | 2018-12-19 | 2019-03-15 | 北京比特大陆科技有限公司 | Circuit structure, circuit board and supercomputer equipment |
CN109474268B (en) * | 2018-12-19 | 2024-02-06 | 北京比特大陆科技有限公司 | Circuit structure, circuit board and super computing device |
CN110058884A (en) * | 2019-03-15 | 2019-07-26 | 佛山市顺德区中山大学研究院 | For the optimization method of calculation type store instruction set operation, system and storage medium |
CN111273889A (en) * | 2020-01-15 | 2020-06-12 | 西安电子科技大学 | Floating point complex FIR optimization method based on HXDDSP 1042 processor |
CN111291320A (en) * | 2020-01-16 | 2020-06-16 | 西安电子科技大学 | Double-precision floating-point complex matrix operation optimization method based on HXDDSP chip |
CN111291320B (en) * | 2020-01-16 | 2023-12-15 | 西安电子科技大学 | Double-precision floating point complex matrix operation optimization method based on HXPS chip |
CN111553123A (en) * | 2020-04-26 | 2020-08-18 | 西安电子科技大学 | Code execution optimization method under complex function limited register based on DSP |
CN111553123B (en) * | 2020-04-26 | 2024-03-26 | 西安电子科技大学 | Code execution optimization method based on DSP (digital Signal processor) under complex function limited register |
Also Published As
Publication number | Publication date |
---|---|
CN107357552B (en) | 2020-10-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107357552A (en) | The optimization method of floating-point complex vector summation is realized based on BWDSP chips | |
JP2928695B2 (en) | Multi-thread microprocessor using static interleave and instruction thread execution method in system including the same | |
US6675376B2 (en) | System and method for fusing instructions | |
TWI599949B (en) | Method and apparatus for implementing a dynamic out-of-order processor pipeline | |
US5710902A (en) | Instruction dependency chain indentifier | |
Chow | The Mips-X RISC Microprocessor | |
US8677330B2 (en) | Processors and compiling methods for processors | |
CN101957744B (en) | Hardware multithreading control method for microprocessor and device thereof | |
US5404555A (en) | Macro instruction set computer architecture | |
CN110321159A (en) | For realizing the system and method for chain type blocks operation | |
CN107038019A (en) | The method and computing system of process instruction in single-instruction multiple-data computing system | |
CN104536914B (en) | The associated processing device and method marked based on register access | |
Ashok et al. | ASIC design of MIPS based RISC processor for high performance | |
Uhrig et al. | A two-dimensional superscalar processor architecture | |
Nielsen et al. | A behavioral synthesis frontend to the haste/tide design flow | |
JP2874351B2 (en) | Parallel pipeline instruction processor | |
Yeh et al. | Branch history table indexing to prevent pipeline bubbles in wide-issue superscalar processors | |
CN111291320B (en) | Double-precision floating point complex matrix operation optimization method based on HXPS chip | |
JP2828219B2 (en) | Method of providing object code compatibility, method of providing object code compatibility and compatibility with scalar and superscalar processors, method for executing tree instructions, data processing system | |
Akram | A study on the impact of instruction set architectures on processor’s performance | |
Huang et al. | ASIA: Automatic synthesis of instruction-set architectures | |
JP2008523523A (en) | Compiling method, compiling device and computer system for loop in program | |
JP2006515446A (en) | Data processing system with Cartesian controller that cross-references related applications | |
CN101907999A (en) | Binary translation method of super-long instruction word program | |
JP2861234B2 (en) | Instruction processing unit |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |