CN1514358A

CN1514358A - Optimization method of parallel operation treatment base on StarCore digital singal processor

Info

Publication number: CN1514358A
Application number: CNA021605467A
Authority: CN
Inventors: 翔陈; 陈翔; 丁剑锋; 李火林; 蔺荣岩; 周海军; 温占波
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2002-12-31
Filing date: 2002-12-31
Publication date: 2004-07-21

Abstract

A method is as follows: analysing code in circular body and carrying on decomposition-disconnection for it, carrying on simple combination for decomposed codes and eliminating redundance, polymerizing code to be several code blocks as per causal relation, confirming code backbone and polymerizing other codes in it, adjusting code combination with mode of code malposition combination, making parallel and combining process for head-tail part codes which not be integrated into new circular body and partial external codes of original circular body, as well as translating codes to be SC 140 assembly language.

Description

A kind of optimization method of handling based on the concurrent operation of StarCore digital signal processor

Technical field

The present invention relates to have the optimization method of calculation process of the digital signal processor DSP of concurrent operation structure, relate in particular to a kind of optimization method of the arithmetic processing method based on StarCore (StarCore is a kind of title of new DSP architecture, has been registered trade mark) digital signal processor DSP.

Background technology

Usually can use DSP to finish the special computing that some have very high computing expense in the modern communications product, as encoding and decoding speech, image or coding and decoding video, encrypting and decrypting etc.Good more to calculation process speed expense optimization effect, then the processing power of every DSP is high more, and corresponding properties of product price is higher more than also.Therefore the calculation process optimization method of DSP one of gordian technique in the communication products often.

Have only the optimization thinking among the DSP of single arithmetic logic unit alu very directly perceived, also fairly simple.Along with the ALU increase integrated in the monolithic dsp chip and the continuous growth of storage space, the parallel optimization of calculation process speed shows its importance all the more now.

In the code, the general bigger part of computing expense concentrates on a smaller number of circulation, especially among some Multiple Cycle bodies, therefore concentrates main strength that this part code is optimized, and can obtain bigger effect with less input.The program that Motorola announces based on the DSP of StarCore kernel reform and optimization method introduce in (referring to the Motorola technical documentation: Programming Techniques forParallel DSP Architectures:Optimal Performance on the StarCoreSC140) told about the optimization method of several main flows, mainly contained division summation technology and multichannel sampling technique.Main thought is exactly the independence according to code in the loop body, it is become many line lockings by a single line goes forward side by side, so just can utilize a plurality of data arithmetic logic unit DALU and address-generation unit AGU among the DSP to carry out a plurality of independently arithmetical operations and address arithmetic simultaneously, reach the purpose that makes full use of the DSP resource, thereby improve travelling speed.Except top two kinds of major techniques, also have some ancillary techniques in code optimization, also often to use, comprise multibyte alignment (helping single instrction in a machine cycle, between RS, to transmit a plurality of words or double word), circulation is disassembled, etc.

But because the complicacy and the otherness of modern arithmetic processing method is increasing, only use top these optimization methods intuitively, a part of module can not reach abundant optimization effect.In SC140 (StarCoreDSP) DSP nuclear, 4 independently DALU and 2 AGU are arranged, can finish 4 multiply-add operations and 2 address arithmetics in the machine cycle simultaneously at a DSP when most effective.But because the restriction of DSP order set inside, the potentiality of this concurrent operation usually are difficult to all bring into play.Particularly between interior adjacent two the round-robin codes of loop body calculate cause-effect relationship is closely arranged, and code result of calculation has uncertain situation when (as including the if statement block) in the loop body, common optimization method can not be accomplished sufficient parallel optimization.The situation that depends on a preceding cycle calculations result that once circulates after this is ubiquitous.

Summary of the invention

Can not make full use of the resource of DSP at existing optimization method, to being arranged, tight cause-effect relationship and the uncertain loop body of code result of calculation can not accomplish the shortcoming of abundant optimization, the purpose of this invention is to provide the optimization method that a kind of concurrent operation is efficiently handled, after special dislocation reorganization, make loop body become the code of can highly-parallel carrying out, thereby improved the efficient of digital signal processor calculation process greatly.

In order to achieve the above object, the present invention adopts following technical scheme:

A. code in the analysis cycle body at first, the complexity of its decomposition being split into every line statement inside is as far as possible little, and can be corresponding one by one with SC140 digital signal processor assembly code;

B. top code is simply merged, operations such as variable replacement make the code letter refine, and eliminate redundancy, are easy to do the polymerization reorganization and handle;

C. code aggregates into several code blocks or is called code queue according to cause-effect relationship in will circulating, the note of the instruction character that adds corresponding assembler directive for each line code simultaneously and used, explanation are data arithmetic logic unit instruction or address-generation unit instruction;

D. in circulation, find out the longest code block working time, as the code trunk, progressively be added to the code in other code block in this code block and carry out polymerization, form some code blocks (parallel block) that can executed in parallel, purpose is the minimum number that makes parallel block in the circulation;

E. the result also is not optimum among the general step d, and is promptly abundant inadequately to the utilization of data arithmetic logic unit in the digital signal processor and address-generation unit resource.Need to adjust the code combination mode, as long as the circulation in code ringing around order constant, and the resource limit and the language that do not exceed the SC140 digital signal processor, and the causality conflict that code calculates does not appear, the mode of several code block dislocation combinations can be grasped flexibly in the circulation, and way is as follows:

E1. the length (number that refers to parallel block) that exceeds the code trunk as if code total length in the circulation.Can adjust the merging position of inner each parallel block of some code block with respect to trunk.Even the afterbody that this code block head is positioned in the circulation is also no problem, the code of its back will be around to the front of loop body and go, but it is constant around order;

E2. when the code total length equals the length of code trunk in the circulation, can partly merging in the parallel block end to end of code trunk be gone,, thereby further improve the code degree of parallelism with compression trunk length by adjusting code;

This adjustment process often will be carried out several times repeatedly just can reach optimal effectiveness;

F. after the reorganization in the previous step is finished, because it is the dislocation reorganization between a kind of a plurality of loop code in essence.Also need and to integrate in the former circulation less than handling parallel merging of partial code work beyond partial code end to end in the new loop body and the former loop body;

G. the code of handling is translated into the SC140 assembly language correspondingly.

Owing to adopted above technical scheme, be compared to existing optimization method, optimization method provided by the invention reconfigures the code in the same circulation, total code length is general littler than existing optimization method, there is tight association in optimization method of the present invention between processing adjacent circulation code calculates, when particularly having the code of if statement block, the effect of parallel optimization will have been improved greatly.

Description of drawings

Fig. 1 is one section C code that optimization is preceding in the specific embodiment provided by the invention;

Fig. 2 is that the code with Fig. 1 among the embodiment carries out variable and replaces and decompose design sketch after splitting;

Fig. 3 is code with Fig. 2 among the embodiment design sketch after according to the polymerization of cause-effect relationship tightness degree piecemeal;

Fig. 4 is with the design sketch after the preliminary combination of circulation three Codabar codes row process of Fig. 3 among the embodiment;

Fig. 5 be the second code of Fig. 4 among embodiment row ring is moved and recombinate after design sketch;

Fig. 6 is the final effect figure after code is recombinated through optimization in the circulation among the embodiment;

Fig. 7 among the embodiment takes the circulation of Fig. 6 apart, to show the relation between the loop code before itself and the reorganization;

Fig. 8 is the synoptic diagram according to loop body code dislocation reorganization of the present invention;

Fig. 9 is the process flow diagram of the optimization method of concurrent operation processing of the present invention.

Embodiment

Below in conjunction with accompanying drawing, describe a specific embodiment of the present invention in detail.

Fig. 1 is one section C code before optimizing in the present embodiment.For reaching best optimization effect, also be convenient to carry out simultaneously gradual simulative debugging.Present embodiment adopts the C code to simulate the process of parallel optimization, reaches final optimization as a result the time, in mode one to one, the C code translation is become the DSP assembly language again.

Do like this and can use Integrated Development Environment to carry out emulation testing and bug check each step in the optimizing process, each step can both eliminate most of possible mistakes, if the DSP code empirical tests after the final translation is still wrong, then the reference as bug check of the optimizing process of C code conveniently can be carried out location of mistake.

Several special functions in the code are the rudimentary algorithm function that the ITU of International Telecommunications Union (ITU) defines, and its definition can be with reference to associated documents.Add, mult, L_mac, L_msu, the concise and to the point implication of round is respectively: Add: addition, mult: multiply each other L_mac: multiply accumulating, L_msu: take advantage of and subtract each other, round: low word is rounded to a high position and extracts high-word._ 1_4 and _ 1_2 represents 1/4 and 1/2 respectively.

The concrete steps of optimizing process are described below:

A. as shown in Figure 2, code decomposition in the loop body is split, Fig. 2 carries out the code that the back formation of splitting was replaced and decomposed to variable with former code.The target of this step is that the C code is decomposed the situation of splitting into the inner complexity minimum of statement as far as possible, each row C code can correspond to a simple DSP assembly code (code of being made up of two C statements several row among Fig. 2 can correspond to the assembly statement of SC140 DSP).For example, ((alp, sq1), sq alp_16) disassembles into following two row to L_mult to statement s=L_msu

s1＝L_mult(alp，sq1)；

s2＝L_mult(sq，alp_16)；

The calculating formula that s=s1-s2 why do not occur is because ensuing statement if (s＞0) can equivalence converts if (s1＞s2), and this statement only can be realized with an assembly statement to.Disassembling of other code is visual and understandable, has not just illustrated one by one here.

B. this step is optional: the partial code after splitting is simply merged and simplifies, and code is the most terse also to be easy to most walk abreast reorganization in order that make, as shown in Figure 3.For example: disassemble alp1=L_mac (alp0, rr[i1] [i1], _ 1_4) statement increases a pointer variable prr1=﹠amp; Rr[i1] [i1].3 statements after disassembling and the possibility of other statement executed in parallel are big a lot.

alp1＝*prr1；prr1+＝n0；

alp1＝alp1＞＞2；

alp1＝alp1+alp0； (1)

Disassemble alp1=L_mac (alp1, rr[i0] [i1], _ 1_2) during statement, except increasing a pointer variable prr2=﹠amp; Rr[i0] [i1] in addition, also to increase a variable alp2, in order that later on alp2 can carry out parallel computation with alp1 more.

alp2＝*prr2；prr2+＝n0；

alp2＝alp2＞＞1；

alp2＝alp2+alp1；

Consider the code line (1) of code alp_16=round (alp2) and front; Calculating to alp_16 can be rewritten into alp_16=(L_add (alp0, alp2))＞＞16 fully; The code line of front (1) can be cancelled, and effect is just the same.

C. continue as shown in Figure 3, the code after splitting according to the tightness degree difference that is associated, is aggregated in several code blocks respectively or is called in the code queue.Make that code has direct front and back cause-effect relationship in the piece, this cause-effect relationship is smaller between the code block, so a few Codabar code formations of Xing Chenging basis that is the back parallel optimization.Its DSP assembler directive of translating into of annotating simultaneously and belong to still AGU instruction of DALU instruction in each row C code back.The benefit of doing like this is can be for the parallel reorganization of the code in the following step and adjust the directviewing description that DALU and AGU operating position in each parallel block are provided.This note has been arranged, just lifted a finger when finally translating into assembly language.Be noted that each code block that aggregates into can be divided into several parallel blocks more here, after the C code that only belongs to same parallel block is finally translated into the DSP code, could executed in parallel.

Code is divided into three in this routine loop body, separates (as shown in Figure 3) with null each other.Wherein the longest trunk code is second in the loop body shown in Figure 3, can be divided into four parallel blocks.Numeral 21 as Far Left one row) 22) 23) 24), wherein high-order numeral 2 expressions belong to second code block, and two code blocks all respectively are divided into three parallel blocks in addition.

D. from these code blocks, find out maximum that of its inner parallel block.Be trunk just then with this code block, restriction according to code correlation, DSP assembly code restriction after DALU and AGU restricted number and the translation, code in other code block is appended to step by step in each parallel block of this trunk code block and go, to make full use of remaining DALU and AGU resource in each parallel block.

The d1.SC140 instruction set has an important feature: when doing to calculate with certain variable, another result calculated can also be write in this variable simultaneously, both can executed in parallel, to not influence of result.Though be to finish in the same machine cycle, in fact read operation than the write operation outline more in advance.These characteristics are often used when parallel optimization, and are extremely important.

D2. can not confounding around order of each code block internal code, but the start position that code block is attached on the code trunk can change can both be utilized DALU and AGU resource fully to reach each cycle that allows on the trunk in predetermined restricted.The DALU computing can not be above 4 in the cycle for each executed in parallel, and the AGU computing can not be above 2.

Be the new code combination that forms through preliminary combination back as shown in Figure 4, form, separate with null between each parallel block by five parallel blocks.

E. the code of clearly finishing in the steps d does not also reach the parallel effect of the optimum that makes full use of the DSP resource, should further be optimized and revised, condition is resource limit and the language that does not change the cyclic order of each code block inside and meet SC140 DSP, and the causality conflict that code calculates do not occur.Specific practice is as follows:

E1. because the code total length is the length (4 parallel blocks) that 5 (parallel block numbers) exceed the code trunk in the circulation.With trunk code block (i.e. second formation) upwards ring move once, two rowers of its Central Plains in first parallel block are 21) code moved on in last parallel block, other is labeled as 22) 23) 24) and code all move a parallel block on successively, according to DALU and AGU utilization of resources situation, carry out code again and separate and polymerization.Become as the situation among Fig. 5.

E2. observe Fig. 5, the parallel block number is than having reduced one among Fig. 4 as can be known, and is the same with the parallel block number of trunk, but also is not optimum.At this moment can merge to the method for going in the parallel block by part end to end, reach compression trunk length, be labeled as 24 in the 3rd, 4 parallel block the code trunk) and 21) code be respectively afterbody and the head that former second code is listed as.Trial is merged into one with the 3rd, 4 parallel block, just in time can reach the requirement of all restrictive conditions.

Fig. 6 shows is exactly loop code after reorganization is finished.Can see that code realized by three parallel blocks, promptly translate into corresponding D SP compilation after, a circulation as long as 3 DSP machine cycles just can finish.The DSP resource that these three parallel blocks use is listed as follows respectively: (4 DALU, 1 AGU), (3 DALU), (4DALU, 2 AGU) visible optimization is very abundant, can not recompress 2 DSP cycles.

Comparison diagram 3 and Fig. 6, can see and have only first code block on order after the reorganization, not change among Fig. 3, in addition the code in 2 code blocks all carried out around, and the calculating of alp_16 also is combined in a parallel block with the assignment operation of alp1 and alp2 and has suffered in the trunk code of Fig. 6, and the trunk code length has just become three parallel blocks by four parallel blocks like this.

Code among Fig. 6 is through reorganization, but also not the net result of code optimization, because the mutual dependence of code is with no longer identical originally in the new loop body, that is to say that each circulation all has the code of part to calculate and depends on previous or even preceding two round-robin result of calculations, also be beforehand with computing simultaneously for 1～2 circulation of back.Like this reorganization after first circulation front also must have belong to former loop body begin the part code; Last circulation back also has the back-page code of former loop body certainly.So though code can both be corresponding one by one with the code after former circulation is split in the loop body after the reorganization, cycle index≤former cycle index-2.Here it is Fig. 1, cycle index is 8 among Fig. 2 and Fig. 3, and cycle index is 6 reason among Fig. 6.

F. the code after will recombinating several cycles end to end launches, as shown in Figure 7.What the numeral that first row indicate among the figure showed is that this line statement belongs to which circulation in the former loop body.As seen from the figure, comprised former circulation 1 in the new circulation 1 after the reorganization, former circulation 2, the statement in 3 three circulations of former circulation has then comprised former circulation 2 in the new circulation 2, former circulation 3, the statement in 4 three circulations of former circulation, and the like.Belong to the code in former circulation 1 and the former circulation 2 outside first circulation in addition, this part code generally can carry out the merging of parallelization with the code of loop body outside, and the additional code of afterbody is also done same processing.

G. the C code that will finally finish all works of treatment is translated into the SC140 assembly language in mode one to one.Then this routine C code parallel optimization work is promptly accused and is finished.

Fig. 8 has illustrated the notion of core of the present invention-dislocation reorganization in the mode of image.The perpendicular shape square frame of a plurality of grey is represented the 1st, 2,3,4 in the original loop body, 5 among the figure ... Deng circulation, suppose that code is tentatively reassembled into 3 in each circulation, corresponding to 3 identical numerals in each frame; Through the circulation in the new loop body after the dislocation reorganization is respectively the reorganization circulation 1 that frame of broken lines is represented, reorganization circulation 2, reorganization circulation 3 etc.Three circulations in the original loop body have all been crossed in each circulation after the reorganization, and the notion of its dislocation combination comes into plain view.

Existing parallel optimization method changes name variable with a plurality of round-robin codes in the loop body, be merged together then, corresponding cycle index has been lacked several times, but lines of code has increased several times in the circulation, this can cause circulation internal variable number to be multiplied, and the register number that DSP can provide is limited, can only be by earlier register being temporarily stored in the storer, access when needing again and use, the result has additionally increased no small computing expense.And by the foregoing description as seen, in the methods of the invention, the variable number changes seldom in the circulation, and the register that the DSP nuclear of SC140 provides enough uses, and has not also just had this extra computing expense.

With reference to embodiment the optimization method of a kind of arithmetic processing method of the disclosed StarCore140 of relating to DSP is described in detail, those skilled in the art can be understood, under the situation that does not depart from principle of the present invention and innovation characteristic, those skilled in the art can carry out all conspicuous modification on form and the details to it, but the present invention is not limit by embodiment described above, but will be in the most wide in range scope the feature of principle according to the invention and innovation.

Claims

1. optimization method of handling based on the concurrent operation of StarCore digital signal processor is characterized in that this method may further comprise the steps:

A. code in the analysis cycle body decomposes and splits;

B. the code after decomposing is simply merged, eliminate redundant;

C. code is aggregated into several code blocks by cause-effect relationship;

D. determine the code trunk, other code block be aggregated in the code trunk that form some code blocks that can executed in parallel, purpose is the minimum number that makes parallel block in the circulation;

E. do not change code ringing around order, do not exceed the resource limit and the language of SC140 digital signal processor and do not occur adjusting code combination in the mode of code block dislocation combination under the condition of the causality conflict that code calculates;

F. will integrate in the former loop body less than handling parallel merging of partial code work beyond partial code end to end in the new loop body and the former loop body;

G. code translation is become the SC140 assembly language.

2. the optimization method of handling based on the concurrent operation of StarCore digital signal processor as claimed in claim 1, it is characterized in that, decomposition among the described step a is split, and code is meant the inside complexity minimum that code is splitted into every line statement in the loop body, and can be corresponding one by one with the SC140 assembly code.

3. the optimization method of handling based on the concurrent operation of StarCore digital signal processor as claimed in claim 1 is characterized in that, the code trunk in the described steps d is meant circulation interior working time of the longest code block.

4. a kind of optimization method of handling based on the concurrent operation of StarCore digital signal processor as claimed in claim 1 is characterized in that, the mode with code block dislocation combination among the described step e is adjusted code combination, further comprises:

If e1. total code length exceeds code trunk length in the circulation, can adjust the merging position of the inner parallel block of code block with respect to trunk;

If e2. the code total length equals code trunk length in the circulation, the part end to end of code trunk can be merged in the parallel block and go.