CA2458199A1 - Method for the translation of programs for reconfigurable architectures - Google Patents

Method for the translation of programs for reconfigurable architectures Download PDF

Info

Publication number
CA2458199A1
CA2458199A1 CA002458199A CA2458199A CA2458199A1 CA 2458199 A1 CA2458199 A1 CA 2458199A1 CA 002458199 A CA002458199 A CA 002458199A CA 2458199 A CA2458199 A CA 2458199A CA 2458199 A1 CA2458199 A1 CA 2458199A1
Authority
CA
Canada
Prior art keywords
data
memories
recited
configuration
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
CA002458199A
Other languages
French (fr)
Inventor
Martin Vorbach
Frank May
Armin Nueckel
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
PACT XPP Technologies AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/967,847 external-priority patent/US7210129B2/en
Priority claimed from PCT/EP2002/002398 external-priority patent/WO2002071248A2/en
Application filed by PACT XPP Technologies AG filed Critical PACT XPP Technologies AG
Publication of CA2458199A1 publication Critical patent/CA2458199A1/en
Abandoned legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/447Target code generation

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)
  • Executing Machine-Instructions (AREA)
  • Stored Programmes (AREA)

Abstract

The invention relates to data processing with multidimensional fields and hi gh- level language codes which can be used advantageously therefor .

Description

MPthnd. for Translating Prog,ams for Reconfigurable Architectures 1. In.txoduetion The present invention rel tes to that which is claimed in the definition of the species. The present invention thus relates S to the question of how re onfigurable architectures may be opt~.mally used and in pax icular how ira,structions in a given high-level language may b optimally executed in reeonfigurable architectu es.
To implement execution of instructions for hax~dling data (programs? that are writs in high-level languages in a paxticular architecture us d for data handling, there are known compilers which tran late the ~.natructiona of the high-level language into instru tions that ar~ better adapted to the architecture used. Com filers which support highly parallel architectures in particuls axe thus paralleli2ing Compilers.
Parallelizing compilers ac ording to the related art normally use spec~.al constructs sue as semaphores and/or other methods for eyrschronizatiox~. Techn logy-specific methods are typically used_ tcnown methods are na suitable for combining functionally specified areiitectures with the particular dynamic response and imperatively specified algorithms. The methods used therefore yie d satisfactory rGSUlts only in special cases.
Compilers for reconfigura 1~ architectures, in particular fox reconfigurable processors generally use macros created specifically for the into ded reconPigurable hardware, mostly using hardware desariptio languages such se Verilog, VHDL, pr System~C to creatQ the ma ras. These macros are then called up from the program flo'w~ (in tantiated) by an ordinary high-level language (e.g. , C, C++) .
There are known compilers~for parallel computers which map program parts onto multip a processors on a coarse-grained structure, mostly basad o complete functions or Lhreads.
In addition, there axe kn vectorizing compilers which convert extensive linear to processing, such as computations of large expressions into vectorized form arid thus permit computation on superscala processors and vector processors (e. g., Pentium, Cray), A method for automatic madding of Functionally or imperatively formulated computation pr educes on different target technologies is described ere, in particular on ASTCe, reconfigurable modules (F As, DPGAs, VPUs, Chess Array, Kresa Array, Chameleon, etc.; h einafter combined under the term VpU), sequential processor (CIBC-/RZSC-CPUs, DSPs, etc.;
hereinafter summarised by he term CpU) and parallel computer systems (SMP, MMP. eta.). t1 this connection, referena~ is made in particular to the ollowing patents and patent applications by the prose applicant: P 44 16 881.0-53, DE 197 81 412.3, DE 197 81483.2, DE 196 54 846.2-63, DE 196 54 593.5-53, DE 197:04 044.6-53, DE 198 B0 1.29.7, DE 198 61 088.2-53, DE 199 BO 312.9, PCT/DE 00/01869, DE 100 36 627_9-33, DE 100 28 397.7, DE 101 10 530.4, NY01 90T128 V 1 ~~ 2 . .... . . a,~... .. _ . . . . . _ . . .

DE 101 11 014.6, PCT/EP 0 /10516, EP 01 102 6~g.7, PACT13, PACT17, PACT18, PACT22, CT24, PACT25, PACT26U8, PACT02, PACT04, PACTOB, PACT10, p CT15, PACT18(a), PACT 27, FACT19.
These are hereby fully in orporated into the present patent application for disclvsur~ purposes.
VPUs are basically composed of a multidimen~ional, homogeneous or inhomogeneous, flat or hierarchical configuration (&A) of cells (Pa's) which are ab a to execute any functions, in l0 particular logic function and/or arithmetic functions and/or memory functsons and/or n twork functions. PAEe are typically assigned a loading unit ( T) which determines the function of the PAEs by configuration and optionally raconfiguration_ This method is based on axj~ abstract parallel machine model which in addition to the 'trite automaton also integrates imperative problem specif ations and permits an efficient algorithmic derivation of ' n implementation on different technologies.
The following compiler classes are knotnm according to the related a.rt Classical compilers, whiC often generate stack machine code and are suitable for very imple processors which are essentially designed ae n al sequencers (see N. Wirth, Compiler Design, Teubner rlag).
vectorizing compilers cons ~uc~ tnOfitly linear. Code which is ... . .
intended for special vecto computers yr for h~.ghly pipeliried processors. These compiler were originally available for vector cvmput~rs such as t a Cray_ Modern proeeccors such as Pentium processors require. similar methods because of the long pipeline structure. Since he individual computation steps are NYOt 887128 v 1 performed ss vectorized ipelined? steps, ache code is much more efficient. However, he coz~ditional jump means problems for the pipeline. Theref e, a jump prediction is appropriate, which assumes a jump des nation. rf this assumption is incorrect, however, the tire processing pipeline must be deleted. In other words, ach jump for these compilers is problematical and actuall there is no parallel processing.
Jump predictions and simi or mechanisms require a considerable extra complexity in terms~of hardware.
io There are hardly any coar e-grained parallel compilers in the actual sense, parallelism typically being marked and administered by the programmer or Lhe operating system, e.g_, it is usually performed o~ a thread level in MM8 Computer Systems Such as ~crarious T M architectures, ASCI Rod, etc. A
thread is a largely indep ndent program block or e~ren a separate program. Therefo e, threads are easily parallelized on a coarca-grained level Synchronization and daCa consistency must be ensuY by the programmer or the operating System. This is complex 1; program and reguires a significant portion of the computativ power of a parallel Computer. In addition, only a fragment f the parallelism which is actually possible is in fact usabl due to this coarse parallelizstivn.
Fine-grained' parallel c~om~~.lers (e.g. , VLIW), attempt to map the parallelism in a fine rained farm in 'VLZW arithmetic means that are capable of xecuzing a plurality of computation, operations in one clock c 1e but have a common register set.
This limited register set s a significant problem because it must provide the data far '11 the computation operations. In addition, the data depends cies and inconsistent read/write operations (LOAD/STORE) ma a paralleli2ation difficult.
Reeonfigurable processors eve a large number of independent arithmetic units, typicall located in a field. These sre NY01 667128 v 1 ~j 4 typically interconnected y buses instead of by a common register set. Therefore, ector arithmetic units are easily constructed, while it is leo poscibl~ to perform simple parallel operations. In ntrast to traditional register S concepts, data dependenci s are resolved by the bus connections.
The object of the preserit~ir~vention is to provide something novel for commercial applI~ICation.
The method for achieving his object is the characterixed in the independent patent c1 ims. Preferxed embodiments are characterised in the sube,~aims.
Tt is therefore propos~d j~hat the aoncepte of vectorizing compilers and parallelizi g compilers (e.g., vLxw) shall be used at the same time for reconfigurable processors and thus vectorisation and paralle.ixation shall be performed on a fine-grained level.
An important advantage is~~that the compiler need not map onto a fixedly predetermined h dware structure, but instead the hardware structure may be ~ onfigured using the method according to the present vention so that it is optimally suitable far mapping the rticular compiled algorithm_ The finite automaton is used as the basis for processing practically any method fo specifying algorithms. The structure of a finite auto aton is shown in Figure 1. A simple finite automaton is broken down into a combinatory network and a register stage for tempo ary storage of data between the individual data processing cycles.
NV01 897129 v 7 A finite automaton execut s a number of purely combinatory data manipulations (i.e., logic andfyr arithmetic) arid then reaches a stable state w eh is represented in a register (set). Based on this stab a state, a decision is made a.s to which next state is to be reached in the next processing step and thus also which combi story data mar~ipulativns are to be performed in the next stem.
For example, a finite aut maton is represented by a processor or aequenoer. rn a first° rocessing øtep, subtraction of data may be performed. The result ie stored. In the next step, a jump may be performed, be ed on the result of th~ subtraction, resulting in more further~processirig, depending on the plus or minus sign of the result.
1, 5 The finite automaton perm.ts mapping of oomplex algorithms onto any sequential machi s, as illusCrated iri Figure 2. The complex finite automaton picted here is made up of a complex combinatory network, a me ry for storing data, and an address generator for adBressirig a data in the memory.
Any sequential program ma fundamentall~r be interpreted as a finite automaton, but usu 1y the result is a very large combinatory network. In p .gramming classical "von Neumann"
architectures - i.e., with all CPUs - the Combinatory operations are therefore oken down into a sequence of simple individual fi~cedly predet mined operations (opcadea) on CPU-internal registers. This eakdown results in states for control of the combinator operation, which is broken down into a aequenoe, but thc~sc states are not present per se within the original combin tory operation or are not required.
Therefore, however, the st tes Of a von Neumann machine that are to be proce9sed differ fundamentally from the algorithmic states of a combinatory ne work, i.e., from the registers of NY01 4H~128 v 1 ~~ 6 f ir~ite automatons .
Tt has rlow been recogni2 ' that th~ VPU t~chnology~ (such as that definee es3entially y some or all of the publications PACT01, PACT02, PACT03, CT04, PACT05, PACT08, PACT10, PACT13, PACT17, PACT18, CT22, PACT24, which are herewith fully incorporated into t a present patent application through this reference) in contra t with the rigid opcodes of CpUs makes it possible to comp 1e complex instructions according to an algorithm to be mapped as in flexible configurations.
In the operation of the c mpiler, it is particularly Z5 advantageous if the x do a are g~nerated so that they are executed for the longest possible period of time in the PAE matrix without _re vn i~uratiori.
Iri addition, the compiler a ex' a finite automa~~r,?n 2o preferably from the imper ive source text in such a way that it ie particularly wmll a pted to the partl,Cular PAE matrix, i.e., operations which t 'cally use coarse-grained logic circuits tALUs, etc_), if ecessary also fine-grai.r~,ed elements that are also present (FP~ Cells iri the VPU, ztate maohines, 25 eto.) particularly e~fici tly, are provideQ.
The compiler-generated fir~~te automaton is then broken down into cox~EiguraLions I.
30 The processing (interpret ion) of the finite automaton is performed on a VpU in sue a way that the generated configurations are mapped ucceSSively onto the PAE matrix and the operating data and/or states which are to be transmitted between the configurations are stored in the memory. The NY01 667128 v 1 method known from PACT014 nd/or the corresponding architecture may be used to this end,. his memory is determined or provided by the compiler. A confli ration here represents a plurality of instructions: at they s me time, a configuration determines the operation of the PA~ azrix for a plurality of cycles. and during these oyolea a c~er sin amount of data is processed in the matrix. This data abm s from a VPLT-external source and/or an internal memory and ~a written to an external source and/or an internal memory. Thel~i ternal memories here replace the register set of a CPU a~c rdixig to the related art in such a way that, for example, ~ egist~r ie represented by a memory so that instead of one ~a a word per register, a full data set is stored per memory. I~
It may also be essential hat the data and/or states of the processing of a configu~a ion that is running are stored in the memory in a manner qle rmined by the compiler and are thus available for the next I~o iguration to be run.
2o A significant difterenc om compilers which parallelize on the basis of instxuctiors s Lhus that Lhe method configures and reconfigures the p trix in such a way that a configured sequence of o inatory networks is emulated on a VPU while traditional c~m 'lers combine loaded sequences of instructions (opcodes>,~w re one instruction may be considered a combirsator~ twork.
The functioning of the qom~iler is illustrated below on the basis of a simple langua~g 'as an example. This assumes a language the principles ~of which are already known, but a known publication [refer~n a "Armin Ntiekel dissertation"7 describes only the mappi~g~of a function onto a static . . ...~.....ia._ . .... .. . . . . . . . .. . . .

combinatory network, whe as with the present invention, the mapping is performed ont configurations vohi,ch are then mapped on the PAE matrix in a c~ onoZogical ordex according to the algorithm and the states hat result during processing.
'the programming language . ssumes that there exists a "WH2LE"
instruction in addition simple logical and/or arithmetic links, and this instrucC n is defined by the following ~yntax:
WHZhE ...
possible constructs era thus: Znstruetian sequence of instructions Lovp An instruction or a segue ce of instructions is mappable onto a combinatory network usi;g the compiler method described here.
Figure 3a shows a combina~ ry network and the pertinent variables. The content of ne and the same variable (e.g., x1) may change fxom one step 301) of the network to the next step (0302).
This change is depicted i~ Figure 3b for the assignment xl _ xl + 1.
For addressing for readiri the operands and for storing the results, address generato may now be synchronized with the combinatory network of th as9ignment. With each variable processed, corresponding w addresses are generated for operands and results (P'igu a 3c). In principle, any type of address generator may be a ed, depending on the addressing schemes of the compiled ap lication. Sharod, combined, or NY01 867128 v 1 ~ ~ 9 completely independent address generators may be implemented for operands and re9ulte I~.
Typically a plurality of ate is processed within a certain configuration of PAEs in he data processing as in the present data processing model. 'Th refore the compiler is preferably designed for the simple F FO mode which is possible in many if not most applications ar~d:i.s applicable at least for the data memories, which are used 'ithin this description for storing data and states of the da a processing (more or less as a replacement for a traditi nal register set of con~rentional CPVs). In other words, me cries are used for temporary storage of variables between cvnf gurations. Here again, a configuration ie like an nstruction of a normal processor, and the memories (in part~eular a plurality of mwmoriee) are comparable to the registe set of a normal processor.
2o A sequence of the exempla assignment may be generated as follows (Figure 4a):
x1 :~ 0;
WHIhE TRUE IaO
x1 . x1 * 1;
This sequence may now be pped via an assignment according to 2.2.1 and address generat a for operands and results.
Finite senuences For the sake of thvrvughn~s, a particular embodiment of saqu~nees apart from the fined constructs of the W~iILE
language will now be disc sed. A finite sequence of the exemplary assignment may generated as follows:
FOR i:< 1 TO 10 x1 :a XZ + 1l NYO~ eonza v ~ ~ III 10 ,::

Such a sequence may be it lemented by two methods:
a) ey generating an ad er fox computing i according to the WHILE construct (se 2.2.4) and another adder for computing x1. The s uence is mapped as a loop and is Computed iterativel (Figure 5a).
b) sy rolling out the, op, so that the oomputation of i as a function is elimi ~ Led. The computation of x1 is instantiated i time and is constructed as a pipeline, ZO resulting in i adds connected in series (Figure 5b).
2.2.4 Co itions Conditions may be expressed via WHILE. For example:
x1 . 0:
WHILE X1 < 10 DO
x1 := x1 + l;
The mapping generates an dditional PAE for processing zhe comparison. The result of the comparison is represented by a status signal (see Trigge in PACT08), which is analyz~d by the instruction processin PAEs and the address generators.
Figure 4b shows the resulting mapping.
ay analysing the conditio (wiIILE here, in general sad reasonably also other iris uvtzone such as IF; CASE
implementable), a status i generated which may be made available to the followin data processing (DE 197 04 728.9) and/or may b~ sent to the T or a local load control (DE 196 54 846.2) which t ' ri deri~rep information therefrom regarding the further pro am flow and any pending reconfigurations NY01 667128 v 1 ~~ 2~

According to the previou method, each program is mapped in a system having the follow' g structure:
1. Memory for operands (comparabl~ to the register of a CPS') 2. Memory for resulCS omparable to the register of a CP~f) s 3. Netr~ork of a) assig ants and/or b) compare instructions, i_e_, conditions ~u ac IF, CR$~1, loops (WHILE, FOR, REPEAT) 4. Optional address ge retorts) for triggering the memories according to 1 and 2 2.2_6 Handling of atatg~, For the purposes of the compiler described here, a distinction.
is now made between state that arc relevant to the algorithm and those that are not_ S ates are usually depicted in the 'V'pU
technology by staLUs sign is te.g., trigger in PACT08) and/or handshake~ (e.g., ~x/ACK in PACT02). In general, states (in particular in other techn logiea, e.g., FPC3As, DPGAs, Chameleon modules, Morphi s, etc.) may be represented by any 2p signals, signal bundles a /or registers. The compiler method disclosed here may also b applied to such technologies, although essential parts the description are preferentially focused on the VPU_ Relevant states are nocesry within zhe algorithm to describe its correct function. The are essential for the algorithm.
zrrelwant scales occur d to the hardware used and/or the selected mapping or for:o er secondary reasons. They are thus essential for the mapping i.e., the hardware).
Only the relevant states st be obL3in~d with the data.
Therefore, these states a stored together with the data in the memories because they cour either as a result of the NY01 667128 v 1 ~ ~~ 12 processing with the data r they are necessary as operands with the data for the ~e~processing cycle.
Irrelevant states, how~v , are necessary only locally and/or chronologically locally ad therefore need not be scored.
Example:
a) Tha state informatio of a Comparison is relevant for further processing o data because it determines the functions to be e~eec fed.
b) zt is assumed that a sequential divider is formed, e.g., by mapping a divisio instruction onto hardware which supports only seq,ien ial division. This results in a state which characte ixee the computation step within the division. This state is irrelevant because only the result (1.e., the~di ision that is performed) is necessary for the:'a1 orithm. Thus, only the resul>r and the time infcrmat~on (i.e., the availability) are recauired in this gas The compiler preferably differentiates beC:we such relevant and irrelevant states. ' The time information is a~eessible, for example, via the RDY/ACK handshake in VPU chnology according to PACTOl, 02, 13. However, it should be ointed out here chat Lhe.handshake also does not represent a elevant state because it only signals the validity of t data so this in turn reduces the remaining relevant infoYm ion to the existence of valid data.
2.2_7 xandlina of time n In many programming langu es, in particular in sequential languag~s such as C, an;e et chronological sequence is determined implicitly by a language. In cequ~zzirial NY01 667128 v 1 programmin3 langu~agps, thi:~ is accomplished, for example, Ly t.hP sPCyence of individual iinstructiciis . 2t this is raquired by Lrie pi'Uc~xaiQm111g language and/or ~.thc algorithm, the compiler method is executed in cuch~a way that the time inf~rmat.;~n may be mapped onto syn~hr~ni~:at~ion models such as kDY/ACK and/~w lt~:U/ACx or a time scamp mel~h~c3 according to DE 1o1 1~ 530.4.
zn other word, the implicilt time information ~f sPC~uential languages is mapped in a h~nr3~hake protocol ~.n such a way that the riandshake protocol (RD~'/ACh ~rwLvc:ol) transmits the time inf~rnnd~.lUll Clllll in partiau7~ar ensuxes the acquence of inetructionc.
For example, the followinq~FOR lUUp is ruu and iterated only wlmr~ oho variable input steam is acknowledged using a RDY for each run. If there i~ no Y, the loop run is stnPp~c~ »n>:il RDY arrives.
while TRUE
2O i ~ O
for i:_ 1 to s . s t inpuzstyedw;
The sequential language property of teeing r'nnr.rolled only by instruction rrn~RSSi ng is 7~hus linl~ed in compillllc~ l.U c:he data flow pririelple to contzwl ~.xm processing through the data i stream, i.e., the existcnc~ of data. In other words, a command and/or an instruction (e . g J , s . s + inrut-,~t.rPam; ) is prnrP~fiar3 ~n J.y when the operation may be executed diia ~.lm data jU is available.
It is noteworthy that this~method dnPfi nor. usually result in any change in the syntax o~ semantics ~f a, r~i5l~-lGVe1 ld.mguav,~.e. Thus existing high-level,.languagc code may be NY01 Sfi7198 v 1 ; 1~

brought to execution on a vpU without problems by recompiling.
. 2 . 8 Loadina and ,gr..~rina data Far the principles of the load/store operations, the following is noteworthy.
The following types of addressing are treated dif~erantly:
1. External addressing, i.e., data transfers with external modules Intexnal addressing: "i~.e., 'data transfer between pABa, in particular between RAM PREe arid ALU pABs.
Furthermore, sp~eial attention should be devoted to chronological decoupling of data proceeding and data loading and storing.
Hue transfers axe broken down into internal and external transfers.
bt1) External read accesses (load operations) are separated, and in on~ possible execution, they are also preferentially translated into a separate cvnfiguxation. Th~ data is transferred frvrn an external memory to an internal memor~r.

bt2) Internal accesses are linked to the data prooeeeing, i.e., the internal memories (register operations) are read and/or written to for the data prooassing_ bt3) External write acce8ses (store operations) are eeparatedr i.e_, in a preferred possible embodiment they are also translated into a separate configuration. The data is transferred from an internal memory to an external, memory.

2t is essential that btl, bc2 and bt3 - i.e., loading the data (load), processing the data (data Droces8irig and bt2) and writing the data (bt3) - may be translated into different configurations wriich may, if necessary, be ex~cuted at a 8.ifferent point in time.
This method will now be ~.llustrated vn the basis of the following:
function example (a, b , integer) --r x . integer for i:- 1 to 100 for j . 1 to 100 x ti] . a [i) w b [j ]
1S This function may be transformed by the compiler into three parts, i.e., coritigurations (subconfig):
example#dload: loads th~ data from external (memory, peripheral, ete_) and writes it into internal memories.
2o znternal memories are'identified by r# and the nam~ cf the original variable.
example#process: corresponds to the actual data process, zt reads the data out o~ internal operands sad writes the results 25 back into inte~cnal memoriep.
example#dstore: writes Lhe results from the internal memories to external (memories, peripheral, etc.), 30 Function example# (a, b . integer) -> X . integer subeonfig example#dload for i . 1 to 100 r#a [i] . a [i]
for j : a 1 to 100 NY01 007128 ~ ~ L 6 r#b[j] := b[j]
subconfig example#proCess ~or i := 1 to 100 for j . s to 100 r#x (i] . r#a (i] * r#b [j J
subconfig example#dstvre fOr i . 1 to 100 x (i] : a r#x (i]
One essential effect of this method ie that instead of i*j -100 '* 100 = 10,000 external accesses only i+j -. 100 + 100 200 external aceesses are executed to read the operands. These accesses axe also completely linear, which greatly accelerates the transfer speed with today's bus systems (burst) and/or memories (3DRAM, DDRRM, RAMBUS, ete.)_ The internal memory accesses are perform~d in parallel because different memories have been assigned to the operands.
For waiting the results, i = 100 ext~~~'ilal accesses arc rieceasary and these may also be performed again linearly with maximum performance.
If the number of data transfers iS not known in advance (e. g., WHILE loop) or is very large, a method that reloads the operands as needed through subprogram calls and/or which writes the results externally may be used. In a preferred embodiment, the states of the FIFOS may (also) be queried:
'empty' when the FIFO is empty and 'full' when th~ FIFO is full.
The program flow responds according to th~ states. It should be pointed out that certain variables (e.g., a1. b1, xi) arc defined globally. For perfoz'mance optimization, a scheduler NY01 66T120 v 1 17 may execute the configurations example#pdloada, example#dloadb even before the call of example#proaess according to the methods already described, so that data i8 already prcloaded.
Likew3e~, example#dstore(n) may also be called up after termination og example#procesa in ord~r to empty r#x.
subconfig example#dloada(n) while !full(r#a) AND ai<=ri r#a [ail . a [aid a1++
aubconfig example#dloadb(n) while lfull(r#b) AND bi<=n r#b [biz , b [bid I5 bit-r subconfig example#dstere(n) while !empty(r#x) AND xi<~n x (xi7 . r#x [xi7 2 0 xa.++
subconfig example#~procees for i : - 1 to n for j . 7, to m 25 iE empty(r#a) then example#dloada(n) if empty(r#b) then example#dloadb(m) if empty(r#x) then example#dstore(n) r#x [t1 . r#a (i~3 ~.....r#b [j 1 .... .
3 0 bj . 1 The subprogram calls and the administrata.on o~ the global variables are comparatively complex for xeeerifigurable architeoturee. Therefore, in a preferred embodiment, the NY01 667128 v 1 1 B

following optimization may be perform~d, in which all the configurations are running largely independently and terminate after complete processing. Since the data b[j] is needed repeatedly, example#dloadb must be run several time accordingly. Two alternatives arc presented ~or this as an example:
Alternativ~ 1: example#dloadb terminates after each run and is reconfigured by example#process for each restart.
Alternative 2: example#dloadb runs infinitely and is terminated by example#proce9s.
A configuration ie inactive (waiting) during 'idle.' eubconfig example#dloada(n) For i . 1 to n while full (r#a) idle r#a [i] : = a (iJ
terminate subconfig example#dloadb(n) whi 1 a 1 // ~,,~'Ey~TATI VE
for i . 1 to n while full(r#b) idle r#b [i] . ~ a [i1 terminate eubcontig ex.ample##dsatorc (n) for i . _ 1 to n while empty(r#b) idle NYU1 067128 v 1 19 x [il : ~ r#x [i]
zerminata subconfig example#process for 1 . 1 to n for j . 1 to m whil~ empty (r#a) or empty (r#b) or full (rø#x) idle r#x [i) . r#a [i) * r#b C~ ]
cjo~nf.zac~ramr~~'[e#d~odadb n,) // ~?,Lv TERNATIjTF 1 .i a a l a o terminate To prevent walling cycles, configurations rnay also be terminated ae aeon as they are temporarily unable to continue fulfilling their function. The eOrrespoizding configuration ie reme~cred from the reconfigurable module but remains in the scheduler. The instruction°''xe~nter' iB used for this purpose below. The rele~crant ~srariables are saved befoxe Lez"minativn and are restored in the repeated configuration:
subConrig example#dlaada(n) For ai : _ ~. to n if full(r#a) reenter r#a Lai1 := a Cai]
Lerminace euboonfig example#dloadb(Al ~,Z 1e_ /l ALTERNATrVE 2 for bi := 1 co n if full(r#b) ra~nter r#b Cbi] . _ a [bi]
terminate NY01 987128 v 1 2 0 r subconfig example#datore(n) for xi : - I to n if ~mptg(r#b) reenter .~c [xi J : = r#x [x1 J
terminate subcox~fig example#proCess for i . 1 to n for j : = 1 to m if empty(r#a) or empty(r#b) or full (r##x) reenter r#x [i7 - r#a f~7 * f#b [j Sonfia example#dloadb lr~~ /l ALTF,~2.NATIVE 1 te'~-m'~llare example#dloa~ /l ATIVE 2 terminate 1.5 2.3 Macros More complex functions of a high-level language such as loops are typically implemented by macros. The macros arc therefore preael~cted by the compiler and era inatantiated at translation time (see Figure 4).
The maoroe are eithex constructed from simple language constructs of the high-level language or they are constructed at the assembler level. Macros may b~ parameterized to permit simple adaptation to the algorithm described (see Figure 5, 0502). In the present case, the macros are also to be incorporated.
2.4 Feedback loo,.ps and :~e_g~~ters Undelayed feedback loops may occur within the mapping of an algorithm into a combinatory network and may oscillate in an unvontrolled manner.
NY01 6671 z6 v i 21 In the VPU technologies implemented in practical terms according to PACT02, this is prevented by a con8truct Of the PAE in which at least one register is defined by decoupling, more or less fixedly in the PASS.
In general, undelayed feedback loops are discernible through graphic analysis of the resulting combinatory network.
Registers are then deliberately inserted for separation into the data pathways in Which there is an undelayed feedback loop. The compiler may thus manage registez' means, 1.e., memory means.
Due to the use og handshake protocols (e. g., RDY/ACK according to Z.z.7), the correct functioning of the computation is also ensured by the insertion of registers.
2_ Ode tim c~ 'n t' Basically each PAE matrix implemented has ~~i,rtually only one finite variable. In the following step, a partitioning of the algorithm aocording to section Z.2.5 paragraph 4 a/b into a plurality of configurations, which are configured one after the other into the PAE matrix, must therefore be performed.
The goal ie typically to compute the largest possible z~urnber of data packets in the network without having to perform a 2S reconfiguration.
Between configurations, a temporary storage is performed, with the temporary memory - similar to a register in the case of CPUs - storing the data between the individual configurations that are executed sequentially.
A sego~ntial processor model is thus created by r~configuring configurations, by data processing in the PAE matrix and temporary storage in the memories.
NY01 887128 v 1 Ia other words. duc to the compiling described here, an vpcode is not executed sequentially in the VPU technology lout instead complex oonfigurations are executed. While in the c~.se of CPUS, an opcode typically processes a data word, in vPU
technology a plurality of data words (a data packet) is processed by a configuration. The efficiency of the reconfigurable architecture is therefore increased ue to a better ratio between reconfiguration, complexity and data processing.
At the same time, instead of a register, a memory i used in VPU technology because instead of proveeeing data w rds, data packets are processed between configurations_ z5 This memory may be designed as a random access memo y, a stack, a FIFO or any other memory architevtuxe, alt ough a FTFO is typically the best and the easiest option to be implemented.
Data is then pYOCessed by the PAE matrix according'to the configured algorithm and .~s stored in one or more emories.
The PAE matxix ie reconfigured after processing a olume of data anal the new configuration wiChdraws the irate ediata results from the memory (memories) and continues t a execution of the.progrdm. New data may then of course be inC rporated into the computation from external memories and/or the periphery and likewise reSUlLS may be written to a terx~.a1 memories and/or the periphery.
3o In other words, the typical' seqiie~'ice in 'data prvce sing is read out from internal RAMS, processing the data i the matrix and writing data to the internal memories, and any external Sources yr targets for the data transfer may be us d additionally or instead of the internal memorise for the data NY01 667128 v 1 2 3 processing.
Although "sequencing" in CFUS is defined as never loading of an opcode, "secauericing" of VFUS is thus defined as (re) configuring configurations according to what was said above. However, this does not mean that under certain coridizions, parts of the field may not be operated. as a sequences in the conventional sense.
The information as to when and/or how sequencing ie perform~ad, i.e., which next configuration is to be configured, may be represented by various information which may be usCd individually or in combination. ~'or example, zhe following strategies are appropriate for deriving the information either 15_-~.alone_ and/or in combination, 1.e., alternatively=
a) defined by the compiler at translation time;
b) defined by the e~rent network (trigger. DE 197 04 728.9), whereby the event network may represent internal and/or external states;
c) by the degree of filling of th~ memorise (trigger, DE 197 04 728.9, D~ 196 5a 846.2-53).
2.5.1 Influ nc o~he TDM on the ~proceeeor model' Partitioning of the algorithm determines Lo a significant extent the relevant states which are stored in the mcmoriee between the various configurations. =f a state ie relevant only within a configuration (locally relevant state), it is not necessary to etor~ it, which is taken into account preferentially by the compiler method.
For the purposes of debugging the program to be executed, however, it may n~vertheless be expedient to store these states to permit access to the debugger. Reference is made to NY01 salsas v 1 24 DE 1.01 ~2 904_5 which is herewith incorporated to Lhe full extent for disclosure purposes.
In addition, additional states may become relevar~.t when using a clock switch mechanism (e.g., by an opexating system or interrupt sources) and the configurations being executed currently are interrupt~d, other configurations are load~d and/or the aborted recontlgux'ativn is to be continued at a later point iri time. A detailed deaaription follows in the ZO section below.
A simple ~xample should show a differentiating factor for locally relevant sCates:
a) A branching of the type "if () then ... else ..." fits completely into a single configuration, i.e., both data paths (branches) are mapped jointly completely within Lhe configuration. The state that is obtained in Lhe comparison is rele~rar~t, but locally, because it is no longer needed in the following configurations.
b) The same branching is too large to fit completely into a single corifiguraLivn. Multiple ooafiguratioris are necessary to map the complete data paths. In this case, the statue is globally relevant and murst be stored az~.d assigned to the pafLicular data because the following configurations requa.xe the particular state of the comparison in further processing of the data.
2.6 Task switchi~a The possible use of an operating system has an additional influence on th~ consideration and handling of states.
Operating systems use task Schedulers, for example, to manage multiple tasks to make multitasking available.
NV01 667128 v 1 25 Task schedulers interrupt teaks at a certain point in time, and stare other tasks, and after their processing, they return to further processing of the interrupted task.
If it is ensured that a configuration - which may correspond here to processing a task - is terminated only after being completely processed, i.e., when all Lhe data and states co be prooessed within this Configuration cycle hare been stored -then locally relevant stator may r~main unstored, This meirhod, l0 i.e., complete processing of a configuration and the subsequent task switch is the preferred method for operating reaon~igurable processors and corresponds essentially to the sequence in a normal processor which first completes the assignments currently being processed anc~ then changes tasks.
For some applications, however, a particular~,y short response to a task change request is necessary, e.g., in realtime applications. It may be expedient here to abort configurations before they are completely processed.
If the task scheduler aborts configurations before they are completely proaeeeed, local states and/or data must be stored.
In addition, this is advantageous when the processing time o~
a configuration is not predictable. In conjunction with the known holding problem and the risk that a configuration may not terminate at all, e.g., due to an error, this also seems appropriate so that a deadlock of the entire system may be pr~vented. Therefore, in Ghe present case, relevant states that are necessary for a task ohange and a renewed correct resumption of the data processing are also to be regarded as such, taking into account task switching.
Thus in task switching, th~ memory is to be secured for results and, if necessary, the memory for vper~snds ie alco to NY01 6fi7128 v 1 2 6 be secured and restored at a Later point in time, i.e., when returning to this task. This may be performed in a manner comparable to that used with push/pop instructions and methods according to Lhe related art. In addition, the state of data processing must be s~cured, i.e., the pointer ac the last operand processed completely. Reference is made here to PACT18 in particular.
Depending on the optimisation of task switching, there arc two possibilities here, for example:
a) The aborted configuration is 'reconfigured and only the operands are loaded. Data processing thus begins anew as if processing of the Configuration had not yet begun. In other words, all data computations are simply executed from Lhe beginning, with computations already being performed in adwanae, if necessary. This possibility is simple but net efficient.
b) The aborted configuration is~reconfigured, with the operands and the results already computed being loaded into the particular memory. Data processing continues with the operands that have r~o longer been computed Completely. This method is much more efficient but it presupposes that it necessary addltiorial states, which occur during processing o~ Lhe configuration, may be rele~rant, it necessary, e.g., if at least one pointer, pointing to the last operands completely computed, must be secured, ev that after a successful new configuration, the pointer may be reset at their succes9ore.
2.7 Conr~ext switching A particularly preferred variant for managing relevant data is NYoi ear~zs ~ ~ 27 made availab7.e through the context switch described belvwr. xn task switching and/or in execution of configurations and in changing them (see, for example, DE x.02 06 653.1, which is herewith incorporated to the full extent for disclosure s purposes), it may be necessary to save data or states which are typically not stored tagether with the operating data in the memory, because they only mark an ~nd value, for a subsequent configuration.
1o The context switch which is implemented pxeferentia7.ly according to the present invention ie performed in such a way that a first configuration is removed and r.he data to be aave~d remains in the corresponding memories (REG) (memories, registers, counters, ete.).
A second configuration which connects the REG to one or more global memories in a suitable manner and in a defined sequence may then be leaded.
2o The configuration may use address generators, for example, to access the global memory (memories). Thus it is not necessary to first have each individual memory site defined by the compiler and/or to access REG embodied as memories.
26 According to the configured connection betwe~n the R~Gs, the contents of the REC3s are written in a defined sequence into the global memory, the particular addresses be~,ng preselected by address generators. The address generator generates the addresses for the global memory (memories) in such a way that the memory regions written (POSH AREA) may be unambiguously assigned to the first configuration that has been removed.
Different address spaces are thus provided preferentially for different configurations. The configuration hare aorr~spond to NV01 687128 v 1 2 $

a PUSH Of ordinary processors.
Other configurations then use the resources.
Now the first configuration is to be started again. First, a third configuration which connects the REGs of the first configuration, in a defined sequence is started.
The configuration may in turn use address generators, for example, to access the global memox'y or memories and/or to access REGe embod~.ed as memories .
An address generator preferably gen~rates addresses in such a way that a correct access to the PUSH AREA assigned to the first configuration takes place. The generated addresses and the configured seguence of the REC3e are such that the data in the REG is written from the memories into the REDS in the original order. The configuration corresponds to a POP of ordinary processors.
ao The first configuration ie then started again.
In summary, a context switch is pregexably performed in such a way that by loading particular configurations which operate like pUSH/POP of known procecaor architectures, the data to ba saved i9 exchanged with a global memory. This data exchange via global memorie6 using FUSH/POp exchange configurations is regarded as being particularly relevant.
This function is to be illustrated in an example: A function adds two rows of number's; the length of the rows is not known at translation t~.me, and is only known at run time.
pros example NY01 887128 v 1 2 9 while i<length do x [iJ - a [iJ + b [i7 i = i + 1 The function is th~n interrupted during its exeCUtlon, e.g., by a task switch or because the memory provided for x is full;
a, b, x are at Chis point :Ln time in memories according to tho present invention; however, i and length, if necessary, must be saved.

To do so. the eXample confrgurativn is terminated, with the register contents being saved and a push configuration being started, reading i and length out of the registers and writing them to a memory_ proC push .
mem [<pueh adr examplea] ~ i push adr ~xample++
memf<push_adr_example>) -. length After execution, push i~ terminated and the register contents may be deleted.
Other configurations are ex~auted. After a period of time, the e~cample configuration is started again.
First a pop configuration is started, reading the register contents out of the mernory~again.
proc pop 1 = mem [<push_adr examp7.e~]
push~adr example++ ~,~ ..
length = mem L~push adr~s:xample>J
NY01 66712A v 1 v, After execution, pop ie terminated and the register contents are kept. The example configuration is started again.
2 _ ~ 2~laorichmic optimization Control structures are separated from algorithmic structures by the translation method described here. For example, a loop is broken down into a body:(WHILE) and an algorithmic structure (instructions).
The algorithmic structures'~re preferably optionally optimiZable by an additional tool downstream from the separati.an .
For example, a downstream algebra eo~tware may optimize and minimize the programmed algorithms. Such tools arc known bY
such names as AXIOM, MARBLE, etc., for example. By minimization it is possible to achieve a more rapid execution of the algorithm and/or a ~ubstant~.ally reduc~d space requirement.
The result of the optimiza,'tion is Lhen sent back to the compiler and processed further accordingly.
z5 It should also be pointed'out that modern compilers (front ends) have already implemented a number of optimizations for algorithms (including algebraic optimizations in some cases) which are of ceurse atill:ueable as part of the method described here.
It should be pointed out explicitly that the method described here, in particular section 2.2.7 "Handling of time" and section 2.3 "Macros," may also be u3ed On compilers according to PACT20. PACT20 is here~rith included to the full extent iri IVY01 6OW Ze v y . 31 the present patent applicaition for disclosure purposes.
3 npplicabilit~fcr ,.ode~ssors accordincLto th,~related art in nart'~lar pith a 'TLIVP arc~E~tecty~rd It should be pointed out in particular that inetead.of a PAE
matrix, a Configuration of~arithmetic logic units (ALLTs) according to the related art may bo used, such as Chose conventionally used in VLIW processors and/or a configuration of complete processors suCT~i as thvsc conventionally used in multiprocc5sor systems, fob exampl~. The use of a single ALU
constitutes a special eas~~here, so that this method is also applicable for normal CPLTs~~
A method which permits trapslation of tha WHILE language into semantically correct finit's automatons was developed in the dissertation (reference dissertation Armin Nttckel] . In addition, a finite automaton may be used as a "subprogram" and vice-versa. This yields tk~ possibility of mapping a configuration onto differexzt implementation technologies such as CPUs, symmetrical multiprocessors. FEGAS. AS=Cs, VPLy'a.
In particular, it is posai:ble to assign the optimally suitable hardvu~are to parts of an application and/or to determine a particular suitability and assign optimum hardware on the basis of the more or lessgood suitability, Preferably temporary distribution and reservation of resources may also be detectable. Tn other words, for example, a data flow structure would be assigned to a data flow architecture, while a sequential st:ruCture is;mapped onto a sequences if these are available and/or present.
the corresponding problems of Lhe resource assignments ~or the individual algorithms may; be solved, e.g., by a "job assignment" algorithm for~managing th~ assignment.
NY01 887128 v 1 ~ 3 2 4. Implementation The implementation of a compiler aCCOrc3irig to the present invention should start rocll a "normal" sequential programming language such as C or P saal. These languagec hav~ the property that due to th ir;,sequentia7. character, a chronological sequence s implicitly and a~'tifieial7.y generated Sue Co the la gu~ge definition per se.
Example A:
Line 1: i++ .
Line 2: a = 1 % b Line 3: x = p - a .
Line ~: j - i ~ i The language definition f~.~edly predetermines that line ~. is executed before line 2 e~Qre line 3 before line ~. Howevex, line 4 could also be ~x cued directly after line 1 and thus in parallel with lines arid 3.
In ether words, through sequential langue.gee, additional artificial and not algo ithmically determined states are incorporated_ only the orrect chronological sequence of the computations in example A 'is important. Line ~ must not be computed until i is cor scaly defined, i.e., after the processing of line 1. L'ne~2 also must not be processed until after the correct defiri'tiOri of i (1.e., after processing line 1). Line 3 require zhe results of lines 2 (variable a) and therefore may not b computed until aftor correct definition of variable . ,This results in a data dependence but no particular state On the basis of the dat dependencies of variable a in lines 2 and 3 (line 3 uses a as a1'i.operand; a is the result of line NYO~ oo»2e v ~ ~ . 33 2), the following transf rmation may thus be performed automatically by the com filer for representation of the parallelizability and/or ectorizability (Parvea transformation) Line z. VEC{a = i * b;
Line 3: x = p - a}
VfiC means that each term sep;3rated by ';' is processed in l0 succession; the terms wi hin Lhe braces may basically be pipelined. All computati s must preferably be performed at the end of VEC{} and eon luded so that data processing is continued downstream fro the 'U'EC.
In an internal represent tion of the data structures in the compiler, the two comput tions are preferably marked as a vector:
VLC{s=iwb; x=p-q~
Line 4 yields a simple ~cr~cto:r V'ECtj = iwi}
Since line 4 may be comp ted at the same time as lines 2 and 3, the parallelism may b~ expressed as follows:
PAR{ {vEC{a..i*b; x-p-a} ; U~EC{;j=i*i} }
3o PAR means that each Lerm separated by '{..}' may be processed at the same time. All co putations mue~t preferably be performed and eonclud~d t th~ end o~ PAR{} so that data processing is continued ownstream from PAR.
NY01 667128 v 1 1 3 4 If line 1 is included, t~.is yields;
VEC{i++; PAR{{VEC{a~i*b:jx=p-a}}{VEC{j=i*i}}}}
since VEC{j=i*i} represents a vector with only one element, it may also be written as f flows: ' VEC{i++= pAR{{VEC{a=i*b;~x=p-a}}{j=~*~}}}
Another example shows a deal state.
Example 8:
Line 1: i++
Line 2: a = 1 * b Line 3: if a c 100 {
Line 4: x = p - a Line 5: } else {
Line 6: j - i * i }
Line 6 may now be exevut d only after the computation of lines 2 arid 3. The computation of lines ~ arid 6 takes place alternatively. Thus, the state of line 3 is relevant fvr further data processing relevant state?~
Conditional states may b~ expressed by IF in the case of a transformation 1I:
Lime 1-2: VEC{i+*; =i*b}
Line 3: IF{{a<100 {line4}{line 6}}
Line 4: 'fEC{x=p-a Line 6; 'tl'EC{j=i*i In summary, this yields VEC{i+~r; a=i*b; IF{{acl0(~}{vEC{x=p-a}}{VEIC{j=i*i}}?~
Nvo~ ss~t 2a ~ t I 3 5 Other relevant states ark generated by looping:
Example C:
Line 1: for (i ~- 1, i 7.00, i++) Line 2: a = a * i Line 3: q = p / a Line 3 may be executed o 1y after th~ loop has been terminated. Thus there a a relevant states at conditional 7. 0 j ump s .
A first transformation o~ the loop yields:
Line 1: i=1;
Line 2 : loop : if i >= 7. o then exit Line 3. a = a * i Line 4: i++
Line 5: jump loop Line 6: exit: q = p / a Lines 3 and 4 may be commuted in parallel because line 4 does not depend on the result of line 3:
PAR{{a=awi~ti++?}
Line 5 yields a vector w th the generated PAR because only after complete computati n of the values is it pvssi,ble to jump back into the loop so thore is a time dependence her~).
VEC{PAR{{a=a*iy{i++~); j mp loop3 Th~,s yields the follora~inc~ for the condition:
loop: IF{{i>=100}{jump a it}{VEC{PAR{{a=a*i~~i++}}; jump lcvp}~}
NY01 B8712B v ~ 3 6 Line 1 is a vector with he condition since it must be executed before the cond Lion (=F uses i as an operand, i is the result of line 1).
Line 6 is in turn a vector with the condition since a is used as the operand and a is qhe result of the condition.
This yields the following (in simplified notation):
v~~{
zo i++;
loop: IF{
{i~.=100}
{jump exit}
{vEC{
~s PAR{
{a=a i}
{i++}~
}.
j ump loop }1 exit: q=p/a The contents of VEC{} an~ PAR{} may be considered as purely combinatory networks.
VEC and PAR are preferab y designed ae a Petri network to control further processi g after complete processing of the particular content, as 1 preferred.
Due to the possibility o considering VEC and PAR as z pur~ly combinatory network, it ecomes necessary to ensure the state NY01 687128 v 1 3 7 of the loop, zn other woods, in this case it is in fact necessary to create a fi its automaton.
The instruction REG(} st res variables to a register to this end. Thus, a flnlte auto aton which is constructed exactly according to the algorit m'is obtained due to the use of the combinatory networks VEC and PAC in conjunction with the register REG:
'V'E C
i++;
loop: IF{
{i~~100}
{jump exit {VEC{
PAR{
{ a=a i.}
l.~h+~
REG(~a: i) dump loop, };
exit: q=p/a It should be painted out in particular that in the applicant's VPU technology (see PACT Z) input and/or output registers are provided on the PAEs and the correctness over time and the availability of data are ensured by an integrated handshake protocal (RDY/ACK). To t is extent, the requirement i9 preferably met on leavi.n VFC(} or PAR{} whose internal data processing is concluded utomazically for all variables to be NY01 8BT128 v 1 used subsequently (i,f da a,~rocessing were not concluded, the following computation sz ps would wait for the end and the arrival of the data). Os illatory Feedback is else ruled out through the integrated r gisters.
To this extent, the Poll wing term i9 correct for this technology:
VEC{PAR{{a=a*i}{i++}}; j mp loop}
lo For other technologies w ,ich do not have the aforementioned embodiments or have them only partially, the term should be formulated as follows:
VEC{PAR{{a=a*i}{i++}}; R G{a,i}; jump loop) It should be pointed out that this form results in correct and optimum mapping of the a gorithm onto the reconfigurable processor even in the ap licant's VPU technology, REC3 may be used within t a combinatory networks VEC and PAR.
Strictly considered V8C rr,d 8AR thus lose the property of the combinatory networks. Ab tractly, however, REG may be regarded as a complex element (RE d ement) of a combinatory network vn which its own processing time is based. Processing of the following ~lsrnents is ma ~ dependent on the termination of the ~5 computaeion of the REG a ement.
.._._.__~,~~_..a~ ..awarene~e--of ~Fi~._.
~aoz~,"c.e'p"'~uaT~...iriaccura'v'y~"'.use~..o~......RE.G....., ..
....___.._..........
within VFC and PAR is al owed in the remaining course and is also necessary in partic lar.
Ae already mentioned abo e, the use of REG ie typically not necessary within a confi uratian of a VPU of the applicant, buz is expliciciy always necessary when the computation results of a eonfigurati n.are stored ao that RECD in fact NY01 667128 v 1 ~ ~ 3 9 corresponds to the explicit register of a finite automaton 1n i this application case.
In addition to synth~si~ of finite automatons for loo8s, finite automatons are nE~cessary in particular in another case:
Tf an algorithm is too large to be processed completely within the pAEs of a reconfigu~able processor, it must b~ broken down into several partial al orithms. Each partial algorithm represents a configurat'on for the recvnfigurable processor.
The partial algorithms re configured in succession, i.e., sequentially on the processor, the results of the preceding configurations) being used as operands for the particular new configuration.
i In other words, the reconfiguration rmsults in a finite automaton which process's and stores data at a point in time t, and at a point in ci a t + 1, possibly after a configuration, the etordd data is proaeseed differently, if rieeessary and stoxed aglin. It is essential that t is not defined in the traditio al sense by clock pulses or instructions but instea by configurations. Reference is made here in particular to thJe processor model presentation (PACT, October 2000, San Jose).I~
In other words, a cvr~figluration includes a combinatory network of VEC and/or PAR the results of which are stored (REG) for further use in the next ~on:Eiguration:
Configuration 1: 'VECtIOperanda;fVEC~PAR};RECi(Resultsl}~
Configuration 2: V'EC{~Rasultsl~(VEC~PAR)~REG{Resulte2?~
For the sake of simplieijty, the above examples and NY01 687128 v 1 ~ 4 0 descriptions have iritrod ed the Constructs VEC, FAR and REG
into high-level language nd structured them accordingly.
Typically and preferably, this structuring is only performed at the le~rel of the inter ediat~ language (see Principles of Ccunpiler Design (Red Drag~onJ , Aho, Sethi, Ullmann) .
It should be pointed out lin particular that the structuring of algorithms with VEC, PAR nd RECD is typically performed completely automatically y the compiler by methods such as graph analysis, for example.
In particular, however, i is also conceivable and to some extent advantageous to allow the programmer himself the structuring in the high-1 vel language so that VEC, PAR and REG are still writable d1 ectly in the high-level language as described above.
Automatic generation of C, PAR, and REG may be performed at different levels of a to ilation opexation. The initially most enlightening is desc ibed during a preprocessor run on the basis of the source c de as in the preceding examples.
I-Towever, a specially adap ed compiler ie then necessary for further compilation.
Another aspect is that compilers usually perform automatic optimization of code (e. g., in loops). Therefore an efficient breakdown of the code is easonable only after the optimization runs, in par icular when compilers (such as SUIF, L7niversity of Stanford) a e.already optimizing the code for parallelization and/or ~reptorization.
The particularly prelerre~l method is therefore to tie in the NY01 667128 v 1 I 41 i analyses into the back er~d of a compiler. The back end translates a compiler-infernal data structure to the commands of the target processor.
Usually pointer structurgs such as DAGs/GRGs, trees or 3-address codes are usually used a8 the compiler-internal data structures (see Pz~jnc~ples of Compiler Design (Red Dragon), Aho, Sethi, Ullmann). To;some extent stack machine cod~s are also used (see "Compiler ieelf-tailored," C'T, 1986 1-5) . Since to the data formats are equivalent in principle and may be transformed into one another, the preferred method according to the present invention~is based on further processing of graphs' preferably trees Data dependencies and possible parallelisms corresponding to the method described aboire may ba discerned automatically within trees simply on t~e basis of the structure_ To do so, for e~cample, known and e~aablished methods of graphic analysis may be used, for exampl~. Alternatively or optionally, an algorithm may be analyzed for data dependencies, loops, jumps, etc. through appropriate~.y adapted parsing methods. A method similar to the method used for analyzing terms in compilers may be used.
Magpina Further transformaCion o~ the algorithm now depends greatly on the target architecture.~for example, the present applivant's processor architecture (IVPU, XPP) offers automatic data synchronization in the hardware. This means that the correct chronological sequence of data dependencies is automatically handled in the hardwara.Other architectures in some cases additionally require synl~hesis of suitable state machines for controlling the data tra~lsfer.
NY01 BB7~28 v 1 i Handling of conditional jumps is particularly interesting. For example, the present app~ieant's processor architecture makes a plurality of mechanisms ~or mapping and execution available:
1. Reconfiguration of the processor or parts of the processor by a higher-level configuration unit (see the patent applicat,ion(~) pAC201, 04, 05, 1,0, 13, 17) 2. ~.olling the function into the array of PAEs (see the pat~nt application ~Ifl~CT08), where both possible bxanches of a comparison are~mapped onto the array at the same to time, for example.
3. Wave reconfiguratior~ according to the patent application(e) PACTO!'8, 13, 17~ a token is given to the data which is to be~processed differently, and the token selects each valid cjonfigura>rion.
It should be pointed out~that mechanism number ~, is in general the case typically to beiused. Mechanism 2 is already very complex with most r,echnol;ogies or it is not even implemcntablc, and case 3~ is ao far known only from the VPU
technology of the precenli applicant.
The execution method to ~e selected in each case will depend on the complexity of the algorithm, the required data throughput (performance) 'and the prec~.s~ embodiment of the I
target processor (e. g., the number of PAEs).
Examples: I
A simple comparison shoulid compute the following:
if i < 0 then a=a* (-i) el!se a=a*i Reconfiguration of the px?oces9or (mechanism 1) depending on the result of th~ comparison se~ms to bw 1~ss appropriat~. It is basically possible to ,roll both branches into the array (mechanism 2). Depending on the result of the comparison, the NY01 667128 v 1 4 3 PAEs computing either a=a*(-i) or a~axi are triggered (see PACT08).
It is particularly efficient in terms of space to superimpose the two computations (mechanism 3) so that after the comparison, the same PAEs continue to process the data regardless of the result, hut the data ie provided with a token which then selects either the function a=a*(-i) or a=a*i locally in the data processing PASS which follow the latter as z0 a function of the (results of the] comparison (see pAC208, 13, 17) .
According to mechanism 1, there is a globally relevarit scale because the complete following configuration depends on it.
According to mechanisms 2 and 3, there is only a locally relevant state, because this is no longer needed beyond the compuzazion - which is completely implemented.
2o zn other words, the local or global r~lavance of states ma.y also depend on selected mapping on the processor architecture.
A state which is relevant beyond a configuration and thus beyond the combinatory network of the finite automaton representing a configuration (i.e.. needed by following finite automatons) may basically be considered as global. The diffuse terminology of the term catllbinatory network as used here should also b~ pointed out.
3o Instruction model of the resultinc~os~ssor According to the present invention, there ie a processing model for reconfigurable prooessors which includes all tho essential instructions: ' NY01 867128 v 1 Arithmetic/logic instructions are mapped directly into the combinatory network.
~7um~s (jump/call) are either rolled directly into the combinatory network or are implemented by reconfiguration.
Condition and control flow ~natruetions (if, etc.) are either completely resolved and processed in the combinatory network or are relayed to a higher~level configuration unit which then executes a reconfiguration. according to the resulting status.
Load/store operating are preferably mapped into separate configurations and are implemented by address generators similar to the known DMAs which write internal memories 1.5 (REG{}) into external memories by using address generators or which load them from exterxial memoxies and/or peripherals.
However, they may also be configured and operated together with the data processing configuration.
~?gg~s z~~,r on. a.ions are implemented in the combinatory network by buses ber,ween the internal memories (REG{}).
Puah/Pop operations are implemented by separate configurations which optionally write certain internal registers in the combinatory network and/or the internal memories (REt3{}) by address generators into external memories or read them out of external memories and are pref~rably executed befoxe or after the actual data processing configurations_ 5. Description of the f~,gZ~,~~
The following figures show,implementation e~ramples and exemplary embodiments of the compiler.
NY01 6fi7128 v 1 . 4 5 Bigure la shows the designs of an ordinary finite automaton in which a combinatory netwo k (0101) is linked to a register (0102). Data may be sent irectly to 0101 (0103) and 0102 (0104). Froceseing of a a ate ae a function of the preceding states) is possible by f edback (0105) from the register to the combinatory network. he processing results are represented by olos.
~'xguro 1b shows a repress tation of the finite automaton by a reconfigurable architeCtu a aCCOrding to PACTO1 and PACT04 (PACT04 Figures 12 throug 15). The Combinatory network from Figure la (0101) is reply ed (0101b) by a configuration of PAE9 0107. Register (0102 is implemented by a memory (0~.02b) which is capable of stori g a plurality of cycles. The feedback aCCOrding Lo 010 Lakes place via 0105b. Inputs (01013b and/or 0104b) are equivalent to 0103 and 0104, respectively. Direct acce s to 07.02b may be implemented via a bus through array OlOlb, 'f necessary. Output 0106b 1s in turn equivalent to 0106.
Figure 2 shows the mappin' of a finite automaton onto a reconfigurable architectu e_ 0201(x) represents the combinatory network (whic may be designed as PASS according to Figure 1b). There exis one or more memories for operands 2S (0202) and one or more me ories~for results (0203). Additional data inputs according to 103b, 0104b, 0106b are not shown for the sake of simplicity. address generator (0204, 0205) is assigned to the memorise.
Operand memory and results memory (0202, 0203) are physically or virtually interlinked ri such a way that the results of a function and/or an operat'on may be used for something other than an operand, for exam 1~, and/or both results and newly added operands of one fun Lion may be used as operand for NY01 887128 v 1 ~ 4 6 another function. Such a oupling may be established, for exam8le, by bus systems o by (re)configuration through which the function and networki g of the memories with the 0201 is newly configured.
Figure 3 shows di.fterent spects for handling variable~. zn ~'igttre 3a, 0301, 0302 and 0303 indicate different stages of computation. These stage9 may be purely combinatory or they may be separated via regi ters, where t1, f2, f3 are functions, x1 is a variab a according to Che patent description.
Figure 3b shows the use o~ a variable x1 in the function x1 , x1 + 1.
Figure 3C shows the behav yr of a finite automaton for computation of x2 . x1 t 1 within a configuration. In the next configuration, 0306 nd 0304 are to be exchanged in order to obtain a complete fini a automaton. 0305 represents the ZO address generators for me vries 0304 and 0306.
Figure 4 shows implements ions of loops. The modules shown hatched may be generated y macros (0420, 0421). 0421 may be inserted by analysis of t a graph into undelayed feedback.
Figure 4a shows the imple entati.on of a simple loop of the type WHILE TRUE DO
x1 . x1 t 1;
?fit the core of the loop is counter +1 (0401). 040a is a multiplexer which at the eginning carries the starting value of x1 (0403) to 0401 and then causes feedback (0404a, 0404b) with ~aeh iteration. ~ r gister (see REG(}) (0405) to prevent NY01 887128 v 1 I 4 7 undelayed and thus uncontrolled feedback of the output of 040.
to its input. 0405 is pul d with the working clock pulse of the V'PU and thus determin the number of iterations per unit of time. The particular c nter coat~nt could be picked up at 040aa or 0404b. Depending n the definition of the high-level language, however, the to does not terminate. For example, Lhe signal at 0404 would useful in an HDL (according to the related art (e. g., vH~T~, rilog)), whereas in a sequential programming language (e.g., C) 0404 it is noC usable Since the loop does not terminate a thus does not yield any exit value.
Multiplexer 0402 implemen a macro formed from the loop construct. This macro is stantiated by the translation of WHILE.
Register 0405 is either a o part of the macro or is inserted according to a graphic 1ri ~lysis according to the related art exactly at the location a time where ther~ is undelayed feedback in order to thus ule out the vibration tendency.
Figure 4b shows the desi of a true loop of the type WIiIhE x1 c 10 DO
x1 a - x1 + 7. p This design corresponds seritially to Figure 4a which ie why the same references have eon used.
In addition, a cirouit l shown which checks on the validity of the results (0410) an relays the signal of o40~a to the downstream functions (04 ) only when the abortion criterion of the loop has been rea ed. The abortion criterion is determined by the compar' on x1 < 10 (comparlaon stage 0412).
As a result of the comps son, Lhe particular status flag (0413) is sent to a mult' tier means 0402 for controlling the NY01 887128 v 1 leep and the function 041 Ifor controlling the forwarding of results. status flag 0413 nay be implemented, for example, by triggers according to DE 7 04 726.9. Likewise, otatus flag means 0413 may be sent to. CT which then recognizes that the loop has been terminated d it performs a reconfiguration.
Figure 5a shows the iterative computation of FOR i : ..1 TO 10 x1 .~ x1 * x1;
Essentially the basic fun ion correspond to that in Figure 4b, which is why the refs ~nces have also been used here.
Function block 0501 compu s the multiplivation. The FOR loop is then implemented by an her loop according to Figure 4b and is indicated only by bloc 0503. Block 0503 supplies the state of the comparison of the ortion criterion. The state is used direotly for triggering t~ iteration, thus largely eliminating means 0412 (r resented by 0502).
Figure Sb shows the rolli of the computation of FOR i:=~ TO 10 x1 . x1 * x1;
Since the number of itera ons at translation time is known exactly, the computation y be mapped into a sequence of i multiplexing actions (051 Figure 6 shows the execut n of a WHILE lOOp aCCOZ'ding LO
Figure 4b over several co igurativns. The state of the loop (0601) here is a relevant torte because it has a signifiea.nt influence on the function n th~ subsequent configurations.
The computation extends o r four configurations (0602. 0603.
0604, 0605). Between conf ~uracions, the data is stored in memories (see REC3~}) (060 0607). 0607 also replaces 0405.
The filling level of the armories may be used as a NY01 667128 v 1 reconfiguration aricerion a indicated by 0606, 0607: memory full/~mpty and/or 0601 whit indicates abortion of the loop.
In other words, the fillip level of the memories generates triggers (see PACT01, PACT05, PACT08, PACT10) which arc sent to the configuration unit pd trigger a reconfiguration. The state of the loop (0601) m y also be sent to the CT. The CT
may then configure the sub equent algorithms on reaching the abortion criterion and/or ay first process the remaining parts of the loop (0603, 0 04, 0605) if necessary and then load the subsequent config rations.
6. Parallelizabilibv Figure 6 shows potential l~.mits to paralleliza.bility~.
is It computation of the open nde ie independent of feedback 0608, the loop may be comp fed in blocks, i.e., by tilling memory 0606/0607 ~ach time This achieves a high degree of parallelism_ If the computation of an o exand depends on the result of the previous computation, i.e. feedback 0608 or the like ex~tere into the computation. the ethod is less efficient because it is possible to compute onl one op~rand arithin the loop.
If the usable ILP (instrut ion level parallelism) within the loop is high and the time or x-econfiguration is low (see PACT02, PACT04, PACT13, P T27), a computation, yelled oilt0 PAEs may also be efficien on a VPU.
If this is not the case, t is appropriate to map th~ loop onto a Sequential archite tune (pros~ceox separate from the PA
or implementation within he PA according to DE 196 51 075.9-53, DE l9 54 8$6.2-53 and in particular DE
NY01 007120 v 1 ~ 5 0 199 26 538.0 (Figures 5, 11~, 16, 17, 23, 30. 31, 33)).
.Analysis of the computatio times may be performed either in the compiler at tz'anslatio time according to the following section and/ox it may be m asured empirically at the same or a run time to induce a subse cent optimization. which results in a compiler capable of lean ing, in particular a self-learning compiler.
Analytical methods and par~lleli2ation methods are important for the present invention.
Various methods according o the related art are available ~or analysis and implementation. of the parallelization.
A preferred method is described below.
Functionp to be mapped are represented by' graphs (sec PACTI3f DE 199 26 538.01, an appli ation may be composed of any number of different functions. Th graphs are analyzed for the parallelism they contain; 1,1 methods of optimization may be used in advance.
For example, Lhe fvllowing~testa are to b~ performed:
ILP expresses which instr ci:ions may be executed simultaneously (see PAR~~). Such an analysis is easily possible on the basis of he consideration of dependencies of nodes in a graph. Corresp nding methods axe known sufficiently well from Che related art and mathematics per se, and retercnce shall be made t VLIW compilers and synthesis tools as examples.
NY01 66712H v 1 ~ 51 8owever, conditional execu ions (IF), interlinked if neccasary, for example, de erve special attention because a correct statement regardiri Lhe paths to be executed in parallel is often impossib a or hardly possible because there 1s a strvag dependeaee on he value space of the individual parameters, which is often known only inadequately or not at all. Furthermore, an exact analysis may take eo much computation time that it m y no longer be performed, reasonably.
In such cases, the arialysi~ may be simplified by iaetxuctions from the programmer, fvr a ample, and/or it is possible to operaCe on the basis of oo ~eepondixzg compiler swltChes so that in case of doubt eith ,r high paxallelizability is to be assumed (possibly squander ng resources) or low parallelizabiliLy is to be assumed (possibly squandering performance) .
Likewise in these cases, n empirical analysis may be performed at run time. Ac ording to PACT10, pACTl7, methods are knvwzz which make it p ssibl~ to compile statistics regarding the program beh vior at run time. It is thus possible to initially ass me maximum parallelizability, for example. The individual p the return messages to a statistics z5 unit (e.g., implemented i a CT or some other stage, see PACT10, FACT17, but basic lly also units aCCOrdiag to PACT04 may be used) regarding ca h run cycle. t7rsing dtatistieal measures, it is now pvssi lc to analyse which paths are actually run through. In dditien, there is also the 30 possibility of evaluating which paths are run through in parallel frequently yr ra ely yr never, based on the data at run time. This type of pa h usage message is not obligatory but is advantageous.
NY01 8671 Z6 v 1 ~ 5 2 Accordingly, the execution may be optimized in a next program call. It should be pointed out that the statistics information may not be written away in parLlCUlar in a nonvolatile manner as An a hard drive. IL is nvwn from PRCT22, PACT24 that a plurality of configuration may either be configured simultancoualy and then tr ggered by triggers (PACT06) or only a subset may be configured and Lhe remaining configurations being reloaded as necessar by mer~.dir~,g the corresponding triggers so a loading unit (CT, PACT10).
to The value PAx2(p) which is ~.sed below indicates for the purpose of clarificaClon which par llelism is achievable on an instruction level, i.e., h w many ILPs may be achieved at a certain stage (p) within t a data flow graph transformed trom 1.5 the function (Figure 7a).
Vec>~or parallelism is also important (see VEC{}l. VeCLOr parallelism ie useful whe large volumes of daCa must be processed. In this case, lii~ear sequencca of opexationra are 20 veetorizable, i.e., all o erations may process data at the same time, typically with each separate operation processing a separate data word.
This procedure is impossik~le to come ext~nt within loops.
25 Therefore analyses and optimizations are necessary. For example, the graph of a f netiori may be expressed by a Petri network. Petri networks h ve the property that the forwarding of results is controlled y nodes so that: loops may be modeled, for example. The data throughput is determined by the 30 Feedback of the result in a loop. Examples:
~ The result of com~tation ri is needed fvr Computation n + 1: only one computation may be executed within th~
1008.
The result of computation n ie needed for coc~utaLion NY01 687128 v 1 ~ 5 3 I
n + m: m - ~. computations may be executed within the loop.
~ The result determin~~a the abort Of the loop, but it does not enter intothe computation of the results: no i f~edback is neCessa~y. Although incorrect values (too many values) enter the loop under some circumstances, the output of results may be interrupted. directly on reaching the final Fondition at the end of th~ loop.
Before the analysis of loos, they may be optimized according to the related art. For example, certain inStruCCions may be extracted from the loop anqi placed before or after the loop.
I
Value VEC which is used be~.ow ~or clarification may illustrate i5 the degree of vectorizabil~.ty of a function. In other words, VEC indicates how marry data words may be processed simultaneously within a qujantity of operations. Fox example, VEC may be Computed from the number of required aritrimetic means for a function n"oa~e end data na,~~ calculable within the 2 0 veCZOr at the same time , ~. g . , by VEC = nnod88~n~~~a If a function may be mapped onto five arithmetic means (n~dQfi =
5), for examp7,e, and if d~.ta may be processed at the same time in each of the arithmetic ~meanS (na~e~ = 5) , then VEC - 1 25 (Figure 7b).
However, if a function ma~ be mapp~ad onto five arithmetic means (n"o~e - 5) and datamaY be processed in only one arithmetic means, e.g.. based on feedback of the results of 3 o the piper ins back Lo the ~,nput 1 naat~ = s ) , then VEC = 1 / 5 (Figure 7c). i vEC may be computed for an entire function and/er for partial details of a function. Foir the compiler according to the NY01 667128 V ~ . s 4 i i .I .

present invention, both variants may be advantageous, as is the case in general for determination and analysis of vEC.
According to Figure 7a, PAR(p) is determined for each line of a graph, as is possible in an advantageous manner. A line of a graph is defined in that it is execuCed within a clock unit.
The number of operations del7ends on the implementation of the particular VPU.
If PAR(p) corresponds to the number of nodes in line p, then all the nodes may be executed in parallel.
Tt PAR(p) is smaller, certain nodes are executed only alternatively. The alternative executions of one node are each combined in one PAE. A selection device permits activation of the alternative corresponding to the status of data processing at run time as described in PACT08, for example.
VEC is also assigned to e3Ch line of a graph. If vEC - 1 ~or one line, this means that the line xemains as a pipeline stage. If one line is smaller than 1, then all the following lines which are also smal~.er than 1 are Combined because pipelining is impossible. According to the saquence~of operations, they are combined to yield a sequence which is then Corifigured,into a PAE and is processed Sequeritial7.y at run time. Corresponding methods are known from pCT/DE 97/02949 and/or PCT/DE 97/02998, for example.
With the method described. here, any complex parallel processor models may be constructed by groups of sequencers. Tn particular, sequencer structures rnay be generated fox mapping reentrant code.
The eynchronizations required to accomplish this may be NY01 869128 v 1 5 S

implemer~tad, for example, by the time stamg method described in PACT18 ox preferably by the triggering method described in PACT08.
If a plurality of sequencer~ or sequential parts is mapped veto a PA, then it ie pref~rable to coordinate the power of individual sequencers for power consumption reaeor~s. This may be accomplished in a particularly pr~ferable manner by adapting the working frequenci~s of the sec~ueriCers Co one another. For example, methods which allow individual vlocking of individual PAEs Or groups Of FAEs are known from PACT25 and PACT18.
The frequency of a sequencer may be determined on the basis of the number of Cycles typically needed for procearaing the function assigned to it.
~'ar example, if it needs five clock cycles for processing its function while the remaining system needs exactly cne clock cycle to process assigned tasks, then its clocking should be five times higher Lfaotor?~ than the clocking of the remaining system. In the case of a pluraJ~ity of sequencers, different aleck cycles are possible. A clock multiplication and/or a clock division may also be provided.
Functions are partitioned according to the aforementioned method. zn partitioning, memories for data and relevant status are inserted accordingly. Additional alternative and/or moxe extensive methods are known from PAC2'13 and PACT16.
According to PACT01, PACT10, PACT13, PACT17, PACT22, PACT24, some VPUs offer the possibility of difgerential reconfiguration. Thin may be used if only relatively few changes are necessary within the Configuration of PAEe in a NY01 687128 v ~ 5 6 :z reconfiguration. In other words, only the changes in a configuration ~.n comparison with the instantaneous configuration are reconfigured. The partitioning in this case may be such that the configuration (dif~erential) following a configuration contains only the required reconfiguration data arid does not constitute a complete Configuration. The compiler of the present invention is preferably designed to recognize and support this.
Scheduling of the reconfiguration may take place through the status which reports functions) to a loading unit (CT) which in turn selects on the bas~.s of the incoming status the next configuration or partial configuration and Configures it. In detail, such methods are known froltv PACT01, PACT05, PACT7.0, PACT13, 8ACT17.
In addiCion, the scheduling may support the possibility of preloading configuratior~.s during run time of another configuration. A plurality of configurations may also be preloaded speculatively, i.e., wi.thout being sure that the configurations are needed at all. This is particularly preferred when the CT is at least largely unleaded and in particular has little or no task load, e.g., in the case of lengthy data streams that are proceeaable without configuration . Due to the eel~ction mechanisms such as those described according to DE 197 04 728.9, for example, the configurations to be used are then selected at run time (see also the example NLS in PACT22/24).
Likewise the local sequancer may be controlled by the statue of their data processing as is known, for example, ~rom DE 195 51 075.9-53, DE 196 54 A46.2~53, DE 199 26 538Ø To perform their recon.figurat~.on, another dependent yr independent status may be reported to the CT (ees pACT04, LLBACK, far example).
The present invention will now be described with reference to additional figures, where the follov~ring characters are used below to simplify the notation: V or, n and.
Figure 8a shows the graph according tv Figure 7a mapped on a group of PREs with maximum achievable parallelism. All operations (iriSLruCLionB i1-i12) are mapped into Individual PAEs.
Figure 8b shows the same graphs, e.g., with maximum usable vectoxizability, but Lhe quantity of operations V2~{i1, i3}, V3=ti4, 15, 16, i'7, i9~, v~4r{i9, i10, ix1} is not parallel par (par({2,3,4}) ~ 1). This makes it possible to save on resources by assigning a quantity P2, P3, P4 of op~rations to one H~-2E_ A status signal tvr each data word in each stage selects the operation to be exeouted in the particular BAE.
The PAEs arc networked as a pipeline (vector) and each PAE
executes one operation via a different data word per each clock cycle.
This yields the following sequence:
PAE1 computes data arid serid5 1L to PAE2. Together with the data, it forwards a staLUS signal vu~hich indiaatea whether il or i2 is tv be executed.
PAL2 Continues to eomput~ the data of PAE1. According to the incoming statue signal, the operation Lo be executed (il, i2) is selected and computed. According to the computation, FAE2 3o forwards a staCuC signal to PAE3, indicating whether (i4 V i5) V (i6 V i7 V i8) ie to be executed.
PAE3 continues to compute the data of PAE2. According to the incoming status signal. the opexativn to be executed (i4 V i5l V (i6 V i7 V ie) is Selected and eemputed. According NY01667128 v 1 ~ 5 8 to the computation, PAE3 forwards a statue signal to PAE~, indicating whether i9 V i10 V 111 is to be executed.
PF1E4 continues to computer the data of PAE3. According to the incoming statue signal, the operation to be executed i9 V i10 V ill is selected and computed.
P1~E5 continues to compute the data of PAE4.
one possible corresponding method and hardware which allow a parlricularl.y favorable implem~ntation of the method described 1o here are described in DE 197 0~ 728.9 (Figures 5 and 6).
PACT04 and PACT10, PACT13 also describe methods that may be used in general but are more complex.
Figure Sc again shows the earns graph. In this example, s5 vectorization is impossible but pAR(p) is nigh which means that within one line a plurality of operativng may be execut~d at the same time. The operations to be implemented in parallel are P2 = {i1 ~ i2~. P3 - {i4 ~ i5 ~ i6 ~ i7 n 18), P4 =
{i9 n i10 n i11}. The PAEs are networked eo that they may 20 exchange any data among themselves in any way. The individual PAES perform operations only when there ins an ILP in the corresponding cycle; otherwise they behave neuCrally (NOP), and there may optionally be a clacking down and/or a clock shutdown and/or a current shutdown to minimize power loss.
The following sequence is provided:
In the first cycle, only PAE2 op~rates and forwards the data to PAE2 and PAE3.
In the second cycle, PAE2 and PAE3 arc operating in para11e1 and forward their data to PAE1, PAE2, PAE3, PAEa, 1~AE5.
NY01 807128 v 1 5 9 In Lhe third cycle, PAE1, PA~2, pAE3, PAE4 and PAE5 are operating and they forward the data to PAE2, pAF3 and PAES.
In the fourth cycle, PAE2, PAE3 ar~.d PAES are operatinc,~ arid forward the data to PAE2.
In the fifth cycle, only PAE2 is operating.
This function thus requires five cycles for the computation.
The corresponding gequencer should thus operate with a five-fold clvek cycle in r~lation tc its environment in order to achieve a corresponding perfox'mance.
A possible Corresponding method is described in PACT02 (Figures 19. 20 and 21). PACT04 and PACT10, PACT 13 also describe methods that may be used in general but are more complex. Additional methods and/or hardware may also be used.
Figure 8d shows they graphs according to Figure 7a in case there is no usable parallelism. 2o compute a data Word, eaoh stage must be run through in succession. Within the stages only one of the branches is always processed.
The function also requires five cycles fvr the computation:
oyl ~ (i1), cy2 = (i2 V i3), cy3 - (i4 V i5 V i6 V i~ V i8), cy4 = ( 1 9 v 1z o V i 11 ) , cy5 - ( ia.2 ) . Thus the corresponding sequencer should work with a clock cycle five times faster in relation to its en~rironment to achieve a correepondir~g performance .
Such a function is mappable, for example, as in figure Bc by using a simpl~r saquencer according to PACT02 (Figures 19, 20 arid 21), PACT04 and PACT10, 7.3 also describe mQthods that maY
be used in general but are more complex.
NY01 8671 Z8 v ~

The illustrations represented in Figure 8 may be combined and grouped at will.
In Figure 9a, for example, the same function is shown in which the paths (i2 ~ (i4 v i5) n i9) and (i3 ~ (i6 V i7 v i8) n (i9 v i10)) are executable in parallel. (i4 V i5), (i6 V i7 V i8), (i9 v i10) are each alternatives. The function ie also vectorizable. IL is thus poss~.ble to construct a pipeliile iri which the particular ~uncti.an to be executed is determined on the basis of etatug signals for three PAES (PAE4, PAES, PAE7).
8igure 9b shows a similar example in which vectorization is impossible. However, the paths (i1, ~ i2 ~ (i4 V i5) ~ i9 ~ 112) and (i3 n (i6 v i7 v is) n (i1o v ill)) are parallel. Thus the optimum performance may be achieved with the use o~ two PAEs which also process the parallel paths in parallel. The FREs are synchronized with one another by status signals which are generated preferably by pAE1 because it Computes the start (i1) and the ~1zd (i12) of the function.
It should be pointed out in particular that a symmetrically parallel processor model (SMP) or the like could yield the multiprocessor models in use today by a multiple configuration of sequenCers.
It should also be poir~.ted out that all configuration registers for scheduling may also be loaded with riew~con~igurations in the baekgrourid and/or duri~.g data processing.
This is possible, for example, if the hardware has the design known ~rom D8 196 51 075.9-53. Independent memory areas or registers are then available and may be addressed independently. It is possible to jump to certain locations due to incoming triggers; jump instruCtioris (JMP, CALL/RET) ar~

also possible and may al~o be executed conditionally, i~
necessary.
According to DE 196 54 846.2-53, indepera.dent xead and Write pointers are available eo that basically there is an independence and thus the possibility of access in the background. In particular, it is possible to segment the memories, which provides additional independence. Jumps are possible via jump instruetionc (JMP, CALL/RET) which may also be executed conditionally if necessary.
According to DE 197 04 728.9, the individual registers which may be selected by the triggers axe basically independent and therefore allow independent eonfigux'aLion in particular in the is background. Jumps within the register are not possible, and the choice is made exclusively via the trigger vectors.
An essential factor for evaluating the efficiency of PAR and vEC is th~ type of data processed by the particular structure.
For example, it is worthwhile to roll a structure, i.e., Lo pipeline or parallelize a structure which processes a large volume of data, ac is the Case w~,th video data or telecommunications data, for example. It is not worthwhile to roll structures which prowess low data volumes (e. g., keyboard inpu'G, mouse, etc.); on ths~ contrary, they would bleck resources for other algorithms.
Thus, on the basis of different instructions, it is proposed that only the algorithms, structures or parts of algorithms which appropriat~ly process large volumes of data shall be parallelized and vectorized.
Such instructions might be, for example:
NY01 987128 v 9 6 2 1. The data type (on the basis of the higher data volume, arrays, streams should be rolled sooner hart individual charactexs, for example).
2. The type of access (linear program sequences should be mapped into sequencers, fox example, whexeas it is worthwhile to roll loops because of the high number of xun-throughs, for example)_ l0 3. The type of source and/or the target (keyboard and mouse, ~or example, have a data rate that is too low to be rolled efficiently, but the data rate with a network, for example, and/or video sources or targets is much higher).
Any quantity of these instructions may be used for analysis.
7. Definition o:~ T~r.rme Locally relevant state State that is relevant only 2o within a certain configuration;
Globally relevant state State that is relevant in multiple cvnfiguratione and must be exchanged among configurations;
Relevant state State that is needed within an algorithm for correct execution th~reof and is thus described by 3a the algorithm and used by it;
Trrelevant state State that is not important for the actual algorithm and is also not described in the algorithm NYO~ Bsr~2a Y ~

but i~ needed by Lhe executing hardware. depending on the implementation.
NY01 667128 v 1

Claims (20)

What Is Claimed Is:
1. ~A method for translating high-level languages onto reconfigurable architectures, wherein a finite automaton is constructed for the computation in such a way that a complex combinatory network of individual functions is formed and memorial are assigned to the network for storage of the operands and the results.
2. ~A method for data manipulation and/or data processing having a multidimensional field using reconfigurable ALUs, wherein a high-level language code is provided and translated in such a way as to construct a finite automaton for the computation, a complex combinatory network being formed from individual functions and memories being assigned to the network for storage of the operands and/or results.
3. ~The method as recited in one of the preceding claims, wherein the complex combinatory network is constructed and/or broken down in such a way that the PAE matrix is operated for the longest possible period of time without reconfiguration.
4. ~The method as recited in the preceding claim, wherein complex instructions are determined to construct the complex combinatory network and/or break it down in such a way that the PAE matrix is operated for the longest possible period of time without reconfiguration.
5. ~The method as recited in one of the preceding claims, wherein a finite automaton is constructed directly from imperative source text.
6. ~The method as recited in one of the preceding claims, wherein a finite automaton is constructed from operations adapted to coarse-grained logic circuits and/or existing fine-grained elements (FPGA cells in the VPU, state machines, etc.), in particular being adapted exclusively to such elements.
7. The method as recited in one of the preceding claims, wherein a finite automaton is then broken down into configurations.
8. The method ae recited in one of the preceding claims, wherein generated configurations are mapped successively onto the PAE matrix and operating data and/or states that are to be transmitted among the configurations are stored in memories.
9. The method as recited in the preceding claim, wherein the memory is provided and/or determined by the compiler,
10. The method as recited in the preceding claim, wherein during a configuration, data from a VPU-external source and/or an internal memory is processed and written to an external source and/or an internal memory.
11. The method as recited in one of the preceding claims, wherein a memory is provided for an entire data record which is more extensive than a single data word.
12. The method as recited in one o~ the preceding Claims, wherein data is stored in the memories as determined by the compiler during the processing of a running configuration.
13. The method as recited in one of the preceding claims, wherein a memory for operands, a memory for results and a network of assignments and/or comparison instructions, i.e., conditions such as IF, CASE, loop (WHILE, FOR, REPEAT) and optional address generator(s) are provided for triggering the memories with the automaton.
14. The method as recited in one of the preceding Claims, wherein states are assigned to memories as necessary, differentiating here between algorithmically relevant and irrelevant states. in particular such relevant states which are necessary within the algorithm to describe its correct function and such irrelevant states which occur due to the hardware used and/or the selected mapping or for other secondary reasons.
15. The method as recited in one of the preceding claims, wherein load/store operations are provided, including external addressing, i.e., data transfers to external modules and/or internal addressing, i.e., data transfers between PAEs, in particular between RAM pAEe and ALU PAES.
16. The method as recited in one of the preceding claims, wherein a first configuration is removed during data processing and data to be saved remains in the corresponding memories (REG) (memories, registers, counters, etc.).
17. The method as recited in one of the preceding claims, wherein the first configuration ie reloaded and the previously saved data assigned to is is accessed.
18. The method as recited in one of the preceding claims, wherein a second configuration is loaded for access to previously saved data, this second configuration connecting the REGs in a suitable manner and in a defined order to one or more global memories, in particular to access the global memory (memories) using address generators, the address generator generating the address for the global memory (memories), preferably in such a way that the memory areas described (PUBH AREA) of the first configuration removed may be assigned unambiguously,
19. The method as recited in one of the preceding claims, wherein transformation for representing the parallelizability and/or vectorizability (par-vec transformation) is performed automatically and/or VEC and PAR components are designed as a Petri network to control further processing after complete processing of the particular contents as preferred.
20. The method as recited in one of the preceding claims, wherein arithmetic/logic instructions are mapped directly into the combinatory network and/or jumps (jump/call) are either rolled directly into the combinatory network and/or are implemented by reconfiguration and/or conditions and control flow instructions (if, etc.) are either completely resolved and/or processed in the combinatory network and/or relayed to a higher level configuration unit which then performs a reconfiguration according to the resulting status and/or load/store operations are mapped into separated configurations and/or are implemented by address generators which write internal memories (REG{}) to external memories via address generators and/or they load them from external memories and/or peripherals and/or register move operations are implemented in the combinatory network by buses between the internal memories (REG{}) and/or push/pop operations are implemented by separate configurations, writing certain internal registers in the combinatory network and/or writing the internal memories (REG{)) via address generators to external memories or reading from external memories and which are preferably executed before or after the actual data processing configurations.
CA002458199A 2001-08-16 2002-08-16 Method for the translation of programs for reconfigurable architectures Abandoned CA2458199A1 (en)

Applications Claiming Priority (19)

Application Number Priority Date Filing Date Title
DE10139170.6 2001-08-16
DE10139170 2001-08-16
DE10142903 2001-09-03
DE10142903.7 2001-09-03
DE10144732.9 2001-09-11
DE10144732 2001-09-11
DE10145792.8 2001-09-17
DE10145792 2001-09-17
US09/967,847 2001-09-28
US09/967,847 US7210129B2 (en) 2001-08-16 2001-09-28 Method for translating programs for reconfigurable architectures
DE10154260.7 2001-11-05
DE10154260 2001-11-05
DE10207225.6 2002-02-21
DE10207225 2002-02-21
PCT/EP2002/002398 WO2002071248A2 (en) 2001-03-05 2002-03-05 Methods and devices for treating and/or processing data
EPPCT/EP02/02398 2002-03-05
EPPCT/EP02/09131 2002-08-15
EP0209131 2002-08-15
PCT/EP2002/010065 WO2003017095A2 (en) 2001-08-16 2002-08-16 Method for the translation of programs for reconfigurable architectures

Publications (1)

Publication Number Publication Date
CA2458199A1 true CA2458199A1 (en) 2003-02-27

Family

ID=41210636

Family Applications (1)

Application Number Title Priority Date Filing Date
CA002458199A Abandoned CA2458199A1 (en) 2001-08-16 2002-08-16 Method for the translation of programs for reconfigurable architectures

Country Status (4)

Country Link
JP (1) JP2005508029A (en)
AU (1) AU2002340879A1 (en)
CA (1) CA2458199A1 (en)
WO (1) WO2003017095A2 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6542998B1 (en) 1997-02-08 2003-04-01 Pact Gmbh Method of self-synchronization of configurable elements of a programmable module
US9037807B2 (en) 2001-03-05 2015-05-19 Pact Xpp Technologies Ag Processor arrangement on a chip including data processing, memory, and interface elements
US7444531B2 (en) 2001-03-05 2008-10-28 Pact Xpp Technologies Ag Methods and devices for treating and processing data
US7996827B2 (en) 2001-08-16 2011-08-09 Martin Vorbach Method for the translation of programs for reconfigurable architectures
EP1514193B1 (en) 2002-02-18 2008-07-23 PACT XPP Technologies AG Bus systems and method for reconfiguration
US8914590B2 (en) 2002-08-07 2014-12-16 Pact Xpp Technologies Ag Data processing method and device
AT501479B8 (en) * 2003-12-17 2007-02-15 On Demand Informationstechnolo DIGITAL COMPUTER DEVICE
DE102005005073B4 (en) * 2004-02-13 2009-05-07 Siemens Ag Computer device with reconfigurable architecture for the parallel calculation of arbitrary algorithms
JP5141151B2 (en) * 2007-09-20 2013-02-13 富士通セミコンダクター株式会社 Dynamic reconfiguration circuit and loop processing control method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5794062A (en) * 1995-04-17 1998-08-11 Ricoh Company Ltd. System and method for dynamically reconfigurable computing using a processing unit having changeable internal hardware organization
US5966534A (en) * 1997-06-27 1999-10-12 Cooke; Laurence H. Method for compiling high level programming languages into an integrated processor with reconfigurable logic

Also Published As

Publication number Publication date
JP2005508029A (en) 2005-03-24
WO2003017095A3 (en) 2004-10-28
AU2002340879A1 (en) 2003-03-03
WO2003017095A2 (en) 2003-02-27

Similar Documents

Publication Publication Date Title
US8869121B2 (en) Method for the translation of programs for reconfigurable architectures
US7657877B2 (en) Method for processing data
US7210129B2 (en) Method for translating programs for reconfigurable architectures
US10031733B2 (en) Method for processing data
US20200057749A1 (en) Method for providing subapplications to an array of alus
Callahan et al. Instruction-level parallelism for reconfigurable computing
US20150074352A1 (en) Multiprocessor Having Segmented Cache Memory
US20030046607A1 (en) Method for debugging reconfigurable architectures
US20090113404A1 (en) Optimum code generation method and compiler device for multiprocessor
US20170364338A1 (en) Tool-level and hardware-level code optimization and respective hardware modification
CA2458199A1 (en) Method for the translation of programs for reconfigurable architectures
Zimmermann et al. An approach to machine-independent parallel programming
US8549466B2 (en) Tiered register allocation
Keßler et al. Integrating synchronous and asynchronous paradigms: The Fork95 parallel programming language
US20110161977A1 (en) Method and device for data processing
Haug et al. Reconfigurable hardware as shared resource in multipurpose computers
Brossard et al. A model for programming data-intensive applications on fpgas: A genomics case study
Su et al. Array prefetching for irregular array accesses in titanium
Kindratenko et al. Accelerating scientific applications with reconfigurable computing: getting started
Andersson et al. Automatic local memory architecture generation for data reuse in custom data paths
Saghir et al. A comparison of traditional and VLIW DSP architecture for compiled DSP applications
Willis et al. Use of embedded scheduling to compile VHDL for effective parallel simulation
Scipione Toward portable HPC applications with SYCL: a molecular docking case study
thi Huong et al. Support of cross calls between a microprocessor and FPGA in CPU-FPGA coupling architecture
Cancare et al. Dedicated hardware accelerators for the epistatic analysis of human genetic data

Legal Events

Date Code Title Description
FZDE Dead