WO2003017095A2 - Verfahren zum übersetzen von programmen für rekonfigurierbare architekturen - Google Patents
Verfahren zum übersetzen von programmen für rekonfigurierbare architekturen Download PDFInfo
- Publication number
- WO2003017095A2 WO2003017095A2 PCT/EP2002/010065 EP0210065W WO03017095A2 WO 2003017095 A2 WO2003017095 A2 WO 2003017095A2 EP 0210065 W EP0210065 W EP 0210065W WO 03017095 A2 WO03017095 A2 WO 03017095A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- memory
- configuration
- configurations
- memories
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/447—Target code generation
Definitions
- the present invention relates to the preamble claimed.
- the present invention addresses the question of how reconfigurable architectures can be optimally used and, in particular, how instructions in a given high-level language can be optimally executed in reconfigurable architectures.
- compilers In order to implement instructions for handling data (programs) written in so-called high-level languages in a respective architecture used for data handling, so-called compilers are known which translate the high-level language instructions into instructions which are better adapted to the architecture used. Compilers that particularly support highly parallel architectures are accordingly parallelizing compilers.
- Prior art parallelizing compilers typically use special constructs such as semaphores and / or other methods of synchronization.
- Technology-specific processes are typically used.
- Known methods are not suitable for combining functionally specified architectures with the associated time behavior and imperatively specified algorithms. Therefore, the methods used only provide satisfactory results in special cases
- Compilers for reconfigurable architectures in particular for reconfigurable processors, usually use macros that have been created specifically for the specific reconfigurable hardware, with hardware description languages such as Verilog, VHDL or System-C being mostly used for the creation of the macros. These macros are then called from an ordinary high-level language (eg C, C ++) from the program flow (instantiated).
- an ordinary high-level language eg C, C ++
- Compilers for parallel computers which map program parts onto several processors on a coarse-granular structure, mostly based on complete functions or threads.
- vectorizing compilers which support extensive linear data processing, e.g. Convert calculations of large expressions into a vectorized form and thus enable calculation on superscalar processors and vector processors (e.g. Pentium, Cray).
- VPUs basically consist of a multidimensional homogeneous or inhomogeneous, flat or hierarchical arrangement (PA) of cells (PAEs) that perform any functions, i.b. can perform logical and / or arithmetic functions and / or storage functions and / or network functions.
- a charging unit (CT) is typically assigned to the PAEs, which determines the function of the PAEs through configuration and, if necessary, reconfiguration.
- the method is based on an abstract parallel machine model which, in addition to the finite automaton, also integrates imperative problem specifications and enables an efficient algorithmic derivation of an implementation on different technologies.
- Vectorizing compilers build largely linear code that is tailored to special vector computers or highly pipelined processors. These compilers were originally available for vector computers such as CRAY. Due to the long pipeline structure, modern processors like Pentium require similar processes. Since the individual calculation steps are vectorized (pipe- lined), the code is much more efficient. However, the conditional jump poses problems for the pipeline. Therefore, a jump prediction makes sense that assumes a jump target. If the assumption is incorrect, the entire processing pipeline must be deleted. In other words, every jump is problematic for these compilers, parallel processing in the actual sense is not given. Jump predictions and similar mechanisms require a considerable amount of additional hardware.
- Coarse-grained parallel compilers hardly exist in the actual sense, the parallelism is typically marked and managed by the programmer or the operating system, for example with MMP computer systems such as different IBM architectures, ASCI Red, etc. mostly carried out at thread level.
- a thread is a largely independent program block or even another program. Coarsely granular threads are therefore easy to parallelize. Synchronization and data consistency must be ensured by the programmer or the operating system. This is difficult to program and requires a significant proportion of the computing power of a parallel computer.
- this rough parallelization means that only a fraction of the parallelism that is actually possible can actually be used.
- Fine-granular parallel (eg VLIW) compilers try to map the parallelism in fine gray in VLIW arithmetic units, which can perform several arithmetic operations in parallel in one cycle but have a common register set.
- This limited register set is a major problem because it has to provide the data for all computing operations.
- data dependencies and inconsistent read / write operations make parallelization difficult.
- Reconfigurable processors have a large number of independent arithmetic units, which are typically arranged in one field. These are typically not linked together by a common register set, but rather by Bifsse. On the one hand, this makes it easy to set up vector arithmetic units, and on the other hand, simple parallel operations can also be carried out. Contrary to conventional register concepts, data dependencies are resolved by the bus connections.
- the object of the present invention is to provide something new for commercial use.
- a significant advantage is that the compiler does not need to be mapped to a fixed hardware structure, but the hardware structure can be configured with the erf indungswashen- process so that it is optimally suited for imaging the j ehyroid compiled ⁇ algorithm.
- the finite state machine is used as the basis for processing practically every methodology for specifying algorithms.
- the structure of a finite automaton is shown in FIG. 1 displayed.
- a simple finite state machine breaks down into a combinatorial network and a register stage for the temporary storage of data between the individual data processing cycles.
- a finite automaton performs a number of purely combinatorial (e.g. logical and / or arithmetic) data manipulations in order to then achieve a stable state, which is represented in a register (set). Based on this stable state, a decision is made as to which next state is to be reached in the next processing step and thus also which combinatorial data manipulations are to be carried out in the next step.
- purely combinatorial e.g. logical and / or arithmetic
- a processor or sequencer represents a finite state machine.
- data can be subtracted.
- the result is saved.
- a jump can be made based on the result of the subtraction, which, depending on the sign of the result, leads to a further processing.
- the finite state machine enables complex algorithms to be mapped to any sequential machine, as shown in FIG. 2.
- the complex finite automaton shown consists of a complex combinatorial network, a memory for storing data and an address generator for addressing the data in the memory.
- any sequential program can be interpreted fundamentally as a finite automaton, but mostly a very large combinatorial network is created.
- the combinatorial operations are therefore given in a sequence of individual, simple, fixed instructions Operations (OpCodes) broken down into internal CPU registers.
- This decomposition creates states for controlling the combinatorial operation broken down into a sequence, but these are not present per se within the original combinatorial operation or are not required. Therefore, the states of a Neumann machine to be processed must be fundamentally distinguished from the algorithmic states of a combinatorial network, i.e. the registers of finite automata.
- VPU technology (as essentially defined by some or all of PACT01, PACT02, PACT03, PACT04, PACT05, PACT08, PACT10, PACT13, PACT17, PACT18, PACT22, PACT24) is defined by reference fully integrated), in contrast to the rigid OpCodes of CPUs, enables complex instructions to be compiled in according to an algorithm to be mapped, as in flexible configurations.
- the compiler furthermore generates the finite automaton preferably from the imperative source text in such a way that it is particularly well adapted to the respective PAE matrix, that is to say operations are provided which contain the typically coarse-grained logic circuits (ALUs, etc.) and possibly also fine-grained ones Use elements (FPGA cells in the VPU, state machines etc.) particularly efficiently.
- the compiler-generated finite state machine is then broken down into configurations.
- Processing interpreting the finite automaton takes place on a VPU in such a way that the generated configurations are successively mapped onto the PAE matrix and the work data and / or states that are to be transferred between the configurations are stored in memory
- the method or the corresponding architecture known from PACT04 can be used.
- This memory is determined or provided by the compiler.It represents a configuration, a plurality of instructions, and a configuration simultaneously determines the mode of operation of the PAE matrix for a large number of cycles, during these cycles
- a large amount of data is processed in the matrix and comes from a VPU external source and / or an internal memory and is written to an external source and / or an internal memory, the internal memories replacing the register set of a CPU according to the prior art Technology such that, for example, a register major ch represents a memory, not one data word per register being saved, but an entire data record per memory.
- a significant difference from compilers that parallelize on an instruction basis is that the method configures and reconfigures the PAE matrix in such a way that a configured sequence of combinatorial networks is emulated on a VPU, while conventional compilers load sequences of instructions (OpCodes). combine, one Instruction can be viewed as a combinatorial network.
- An instruction or a sequence of instructions can be mapped onto a combinatorial network using the compiler method described.
- Figure 3a shows a combinatorial network with the associated variables.
- address generators For addressing, reading the operands and storing the results, address generators can now be synchronized with the combinatorial network of the assignment. Corresponding new addresses for operands and results are generated with each processed variable (FIG. 3c).
- the type of address generator is arbitrary and depends on the addressing scheme of the compiled application. Common, combined or completely independent address generators can be implemented for operands and results.
- a plurality of data are processed within a specific configuration of the PAEs.
- the compiler is therefore preferably designed for the simple FIFO mode which is possible in many if not most applications and which is at least applicable to the data memories which, within this description, are used for storing data and data processing states (as a replacement for an ordinary register set, as it were) conventional CPUs).
- memories are used to temporarily store variables between configurations.
- a configuration is similar to an instruction of a normal processor and the memories (in particular a plurality) are comparable to the register set of a normal processor.
- WHILE xl ⁇ 10 DO xl: xl + 1;
- the figure generates an additional PAE for processing the comparison.
- the comparison result is represented by a status signal (see trigger in PACT08), which is generated by the the instruction processing PAEs and the address generators are evaluated.
- WHILE generally and obviously other instructions such as IF; CASE can be implemented
- a status is generated which can be made available to the subsequent data processing (DE 197 04 728.9) and / or to the CT or a local Charging control (DE 196 54 846.2) can be sent, which derives information about the further program flow and possibly upcoming reconfigurations.
- each program is mapped into a system that is structured as follows:
- Optional address generator (s) for controlling the memories according to 1 and 2.
- states are usually represented by status signals (e.g. trigger in PACT08) and / or handshakes (e.g. RDY / ACK in PACT02).
- states can be determined by any signals, Signal bundles and / or registers are shown.
- the compilation method disclosed can also be applied to such, although essential parts of the description preferably focus on the VPU.
- Irrelevant states are only necessary locally and / or temporally locally and therefore do not have to be saved.
- a sequential divider is created, for example, by mapping a division command onto hardware that only supports sequential division. This creates a state that identifies the calculation step within the division. This state is irrelevant since the algorithm only requires the result (ie the division performed). In this case, only the result and the time information (i.e. availability) are required.
- the compiler differentiates detects such relevant and irrelevant states preferably from each other.
- the time information is available, for example, in the VPU technology according to PACT01, 02, 13 through the RDY / ACK handshake.
- the handshake also does not represent a relevant state, since it only signals the validity of the data, which in turn reduces the remaining relevant information to the existence of valid data.
- the implicit time information of sequential languages is mapped in a handshake protocol in such a way that the handshake protocol (RDY / ACK protocol) transmits the time information and in particular guarantees the sequence of the assignments.
- Bus transfers are broken down into internal and external transfers, btl) External read accesses (load operation) are separated, in one possible version also preferably translated into a separate configuration.
- the data is transferred from an external memory to an internal one.
- External write accesses (store operation) are separated, in a preferred possible embodiment also translated into a separate configuration.
- the data is transferred from an internal memory to an external one.
- btl, bt2, bt3 - i.e. the loading of the data (Load), the processing of the data (data processing and bt2) and the writing of the data (bt3) - can be translated into different configurations and, if necessary, into one run at different times.
- example # dload Loads the data from external (memory, peripherals, etc.) and writes it to internal memory. Internal memories are marked with r # and the name of the original variable.
- exampletprocess Corresponds to the actual data processing. This reads the data from internal operands and writes the results back to internal memory.
- example # dstore writes the results from the internal memory to external (memory, peripherals, etc.).
- a procedure can be used that reloads the operands or calls the results externally if necessary using subroutine calls.
- the states of the FIFOs can be queried: 'empty' if the FIFO is empty, and 'fill' if the FIFO is full.
- the program flow reacts according to the conditions. It should be noted that certain variables (e.g. ai, bi, xi) are globally defined. To optimize performance, a scheduler can already execute the configurations example # dloada, example # dloadb before calling example # process, so that data is already preloaded.
- Example # dstore (n) can also be called after scheduling example # process to empty r # x.
- the subroutine calls and the management of the global variables are comparatively complex for reconfigurable architectures. Therefore, in a preferred embodiment, the subsequent optimization can be carried out, in which all configurations run largely independently and terminate after complete processing (terminate). Since the data b [j] are required several times, example # dloadb must be repeated several times. For example, two alternatives are presented:
- example # dloadb terminates after each run and is reconfigured by exampletprocess for each restart.
- configurations can also be scheduled as soon as they can no longer carry out their task temporarily.
- the corresponding configuration is removed from the reconfigurable module, but remains in the scheduler.
- the 'reenter' command is used for this below.
- the relevant variables are saved before scheduling and restored during repeated configuration:
- More complex high-level language functions such as loops, are typically implemented using macros.
- the macros are specified by the compiler and instantiated at translation time. (see Figure 4).
- the macros are either made up of simple language constructs of the high-level language or at assembly level. Macros can be parameterized to a simple adaptation to the beschie- surrounded algorithm allows t s, (see FIG. 5, 0502).
- the macros are also to be integrated here.
- Imaging an algorithm in a combinatorial network can result in undelayed feedback that oscillates in an uncontrolled manner.
- instantaneous feedback can be determined by graph analysis of the resulting combinatorial network.
- Registers for decoupling are then inserted into the data paths in which there is an undelayed feedback.
- the compiler can thus manage register or storage means.
- the compilation described does not execute an OpCode sequentially, but rather complex configurations. While an opcode typically processes a data word in CPUs, a plurality of data words (a data packet) are processed by one configuration in VPU technology. This increases the efficiency of the reconfigurable architecture through a better relationship between reconfiguration effort and data processing.
- VPU technology a memory is used instead of a register, since it is not data words but data packets that are processed between the configurations.
- This memory can be designed as a random access memory, stack, FIFO or any other memory architecture, with a FIFO typically giving the best and easiest to implement option.
- Data is now processed by the PAE matrix in accordance with the configured algorithm and saved in one or more memories.
- the PAE matrix is reconfigured and the new configuration takes the intermediate results from the memory (s) and continues the execution of the program.
- new data from external memories and / or the peripherals can also be included in the calculation, and results can also be written to external memories and / or the peripherals.
- the typical course of data processing is the reading out of internal RAMs, the processing of the data in the matrix and the writing of data into the internal memories, and any external sources or destinations for data transfers in addition to or instead of the internal memories for data processing can be used.
- the information when and / or how to sequence, i.e. Which next configuration is to be configured can be represented by various information that can be used individually or in combination.
- the following strategies for deriving the information alone and / or in combination or alternatively make sense: a) defined by the compiler at translation time; b) defined by the event network (trigger, DE 197 04 728.9), the event network being able to represent internal and / or external states; c) by the filling level of the storage
- a simple example is intended to show a distinguishing feature for locally relevant states: a) A branching of the type "if () then ... eise " fits completely into a single configuration, ie both data paths (branches) are together completely within the configuration displayed. The state that results from the comparison is relevant, but local, since it is no longer required in the subsequent configurations. b) The same branch is too large to fit completely in a single configuration. Several configurations are necessary to map the complete data paths. In this case, the status is globally relevant and must be saved and assigned to the respective data, since the following configurations are used for further processing. the current state of the comparison.
- operating systems use task schedulers to manage multiple tasks (tasks) to provide multitasking.
- Task schedulers cancel tasks at a certain point in time, start other tasks and return after processing to continue processing the canceled task. If it is ensured that a configuration - which here can correspond to the processing of a task - only after complete processing - i.e. if all data and states to be processed within this configuration cycle are saved - terminated, locally relevant states can remain unsaved.
- This procedure ie the complete processing of a configuration and the subsequent task switch, is the preferred method for the operation of reconfigurable processors and essentially corresponds to the sequence in a normal processor, which also processes the instructions currently being processed and then changes the task.
- a particularly short reaction to a task change request is required, for example in real-time applications. It can make sense to cancel configurations before they are completely processed. If the task scheduler cancels configurations before they are fully processed, local states and / or data must be saved. This is also advantageous if the processing time of a configuration cannot be predicted. In connection with the known halting problem and the risk that a configuration (e.g. due to an error) This does not seem to be a good idea at all, in order to prevent the entire system from being deadlocked. Therefore, taking into account task changes, relevant states are also to be regarded as a surface that are necessary for a task change and a new correct start of data processing.
- the memory for results and possibly also the memory for the operands must be saved and restored at a later point in time, i.e. when the task is returned. This can be done comparable to the PUSH / POP commands and methods according to the prior art.
- the state of the data processing must also be saved, i.e. the pointer to the last operands that were completely processed. Special reference is made to PACT18.
- a particularly preferred variant for managing relevant data is provided by the context switch described below.
- changing tasks and / or when executing configurations and changing them see, for example, patent application DE 102 06 653.1, which is fully incorporated for disclosure purposes
- it may be necessary to store data or states that are typically not stored together with the working data are stored, since they merely mark a final value, for example, assure for subsequent 'configuration.
- the context switch which is preferably implemented according to the invention is carried out in such a way that a first configuration is removed and the data to be backed up remain in corresponding memories (REG) (memories, registers, counters, etc.).
- REG memories, registers, counters, etc.
- a second configuration can then be loaded, this connects the REG with one or more global memories in a suitable manner and in a defined sequence.
- the configuration can use address generators to access the global memory (s). It is therefore not necessary to have each individual memory location determined in advance by the compiler and / or to access REG configured as a memory.
- the contents of the REG are written into the global memory in a defined order, the respective addresses being specified by address generators.
- the address generator generates the addresses for the global memory (s) in such a way that the described memory areas (PUSHAREA) of the remote first configuration can be clearly assigned.
- the configuration corresponds to a PUSH of ordinary processors.
- a third configuration is started beforehand, which connects the REG of the first configuration with one another in a defined sequence.
- the configuration can in turn use address generators, for example, to access the global memory (s) and / or to access REGs configured as memories.
- An address generator preferably generates addresses in such a way that correct access to the PUSHAREA assigned to the first configuration takes place.
- the generated addresses and the configured sequence of the REG are such that the data of the REG are written from the memories into the REG in the original order.
- the configuration corresponds to a POP of ordinary processors.
- a context switch preferably such carried out that work by loading particular configurations which operate like PUSH / POP known processor architectures exchanged data to be backed with a 'global store.
- This data exchange via global storage using push / pop exchange configurations is considered to be particularly relevant.
- the function should be illustrated in an example:
- a function adds 2 rows of numbers, the length of the rows was not started at translation time, but only at runtime.
- a, b, x are at this time in accordance with the invention in memories, i and possibly. length must, however, be saved.
- the configuration example is terminated, the register contents are retained and a configuration push is started, which reads i and length from the registers and writes them to a memory.
- Control structures are separated from algorithmic structures by the described translation process. For example, a loop breaks down into a body (WHILE) and an algorithmic structure (instructions).
- WHILE body
- instructions instructions
- the algorithmic structures can now preferably be optionally optimized by an additional tool connected after the separation.
- a downstream algebra software can optimize and minimize the programmed algorithms.
- Such tools are e.g. known as AXIOM, MARBLE, etc. By minimizing, a faster execution of the algorithm and / or a significantly reduced space requirement can be achieved.
- Line 4 may only be calculated if i is correctly defined, that is after line 1 has been processed.
- Line 2 must also only be defined after i has been correctly defined (ie after processing) from line 1) are processed.
- Line 3 requires the results from line 2 (variable a) and can therefore only be calculated after it has been correctly defined. This results in a data dependency but no special conditions.
- VEC means that everyone through '; 'Separate expression is processed one after the other, whereby the expressions within the curly brackets can in principle be pipelined. Preferably, all calculations must be carried out and completed at the end of VEC ⁇ so that the data processing can continue after VEC.
- Line 4 gives a simple vector:
- VEC ⁇ j i * i ⁇
- PAR means that every expression separated by ' ⁇ .. ⁇ ' can be processed at the same time. Preferably, all calculations must be carried out and completed at the end of PAR ⁇ so that the data processing continues after PAR.
- Example B shows a real state.
- line 6 can only be executed after the calculation of line 2 and line 3. Alternatively, rows 4 and 6 are calculated. So the state of row 3 is relevant for further data processing (relevant state).
- Conditional states can be expressed in a transformation by IF:
- Line 3 may only be executed after the loop is terminated. So there are relevant conditions for conditional jumps.
- a first transformation of the loop results in:
- Lines 3 and 4 can be calculated in parallel since line 4 does not depend on the result of line 3: '
- Line 5 results in a vector with the generated PAR, since it is only allowed to jump back into the loop after the values have been fully calculated (there is a time dependency here).
- VEC ⁇ PAR ⁇ a a * i ⁇ ⁇ i ++ ⁇ ; jump loop ⁇
- Line 6 is again a vector with the condition that a as
- ⁇ i> 100 ⁇ ⁇ jump exit ⁇ ⁇ VEC ⁇
- VEC ⁇ and PAR ⁇ can be viewed as purely combinatorial networks.
- VEC and PAR ale Petri network are preferably designed to control the further processing after complete processing of the respective contents, as preferred.
- VEC and PAR as a purely combinatorial network creates the need to secure the loop status. In this case, it is actually necessary to create a finite automaton.
- the REG ⁇ instruction saves variables in a register.
- the use of the combinatorial networks VEC and PAC in conjunction with the register REG thus creates a finite state machine that is constructed exactly according to the algorithm:
- ⁇ i> 100 ⁇ ⁇ jump exit ⁇ ⁇ VEC ⁇
- VPU technology provides for input and / or output registers on the PAEs and that the correctness in time and the availability of data are provided by an integrated handshake protocol (RDY / ACK). is ensured.
- RY / ACK integrated handshake protocol
- the request to leave VEC ⁇ or PAR ⁇ to have completed their internal data processing is automatically fulfilled for all subsequent variables used (if the data processing had not ended, the subsequent calculation steps would have to do with the termination and the Wait for the data to arrive).
- the integrated registers also prevent oscillating feedback.
- VEC ⁇ PAR ⁇ a a * i ⁇ ⁇ i ++ ⁇ ; jump loop ⁇
- VEC ⁇ PAR ⁇ a a * i ⁇ ⁇ i ++ ⁇ ; REG ⁇ a i ⁇ ; jump loop ⁇
- REG can be used within the combinatorial networks VEC and PAR. Strictly speaking, VEC and PAR lose the property of combinatorial networks. In abstract terms, however, REG can be viewed as a complex element (REG element) of a combinatorial network that is based on its own processing time. The processing of the subsequent elements is made dependent on the completion of the calculation of the REG element.
- REG element complex element
- Each sub-algorithm represents a configuration for the reconfigurable processor.
- the sub-algorithms are configured successively, that is to say sequentially, on the processor, the results of the previous configuration (s) serving as operands for the respective new configuration.
- the reconfiguration results in a finite state machine that processes and stores data at a time t and, at time t + 1, possibly after a configuration, processes the stored data differently and stores it again if necessary. It is essential that t is not defined in the classic sense by measures or commands, but by configurations.
- the processor model presentation (PACT, October 2000, San Jose) is particularly referenced here.
- a configuration consists of a combinatorial network of VEC and / or PAR, the results of which are stored (REG) in order to be used in the next configuration:
- Configuration 1 VEC ⁇ Operands; ⁇ VEC
- VEC, PAR and REG can be carried out on different levels of a compilation process. The most obvious at first is during a preprocessor run based on the source code as described in the previous examples. However, a specially adapted compiler is required for further compilation.
- compilers mostly optimize code automatically (e.g. in loops).
- An efficient decomposition of the code only makes sense after the optimization runs, especially if compilers (such as SUIF, Stanford University) are already optimizing the code for parallelization and / or vectorization.
- the method that is particularly preferred is the integration of the analyzes into the backend of a compiler.
- the backend translates an internal compiler data structure to the commands of a target processor.
- Pointer structures such as DAGs / GAGs, trees or 3-address codes are mostly used as compiler-internal data structures
- the method preferred according to the invention is based on the further processing of graphs, such as preferably trees.
- Data dependencies and possible parallelism in accordance with the method described above can easily be recognized automatically based on the structure within Trees.
- Known and established methods of graph analysis can be used for this purpose, for example.
- an algorithm can be examined for data dependencies, loops, jumps, etc. using appropriately adapted parsing methods.
- a method similar to evaluating expressions in compilers can be used.
- VPU processor architecture
- XPP X-programmable gate array
- mechanism 1 is the generally typical case to be used.
- Mechanism 2 is already very complex or cannot be implemented in most technologies and case 3 is only known from the applicant's VPU technology.
- the execution method to be selected depends on the complexity of the algorithm, the required data throughput (performance) and the exact design of the target processor (e.g. number of PAEs). Examples:
- Mechanism 1 creates a globally relevant state, since the complete configuration that follows depends on it.
- Mechanisms 2 and 3 only result in a locally relevant state, as this is no longer required beyond the calculation - which is fully implemented.
- the local or global relevance of states can also depend on the chosen mapping to the processor architecture.
- a state that is relevant beyond a configuration and thus beyond the combinatorial network of the finite machine representing a configuration (that is, is required by subsequent finite machines) can in principle be regarded as global.
- the diffuse terminology used in the term combinatorial network should be pointed out again.
- a processor model for reconfigurable processors is created, which includes all essential commands:
- Arithmetic / logical commands are mapped directly into the combinatorial network.
- Jumps (Jump / Call) are either rolled out directly into the combinatorial network or realized through reconfiguration.
- Condition and control flow commands (if, etc) are either completely resolved and processed in the combinatorial network or forwarded to a higher-level configuration unit, which then carries out a reconfiguration according to the status that has arisen.
- Load / store operation ions are preferably mapped in separate configurations and implemented by address generators similar to the known DMAs, which write internal memories (REGO) to external memories by means of address generators or load them from external memories and / or peripherals. However, they can also be configured and work together with the data processing configuration.
- REGO internal memories
- Register-Move-OperatJonen are implemented in the combinatorial network by buses between the internal memories (REGO).
- Push / pop operations are implemented by separate configurations that, if necessary, write certain internal registers in the combinatorial network and / or the internal memories (REG ⁇ ) to external memories or read from external memories using address generators, and preferably before or after the actual ones data processing configurations.
- REG ⁇ internal memories
- Figure la shows the structure of a conventional finite state machine, in which a combinatorial network (0101) is linked to a register (0102). Data can be sent directly to 0101 (0103) and 0102 (0104). Feedback (0105) of the register to the combinatorial network enables the processing of a state depending on the previous the conditions possible. The processing results are represented by 0106.
- Figure 1b shows a representation of the finite state machine by a reconfigurable architecture according to PACT01 and PACT04 (PACT04 Fig. 12-15).
- the combinatorial network from FIG. 1a (0101) is replaced by an arrangement of PAEs 0107 (0101b).
- the register (0102) is executed by a memory (0102b) which can store several cycles.
- the feedback according to 0105 is carried out by 0105b.
- the inputs (0103b or 0104b) are equivalent to 0103 or 0104. Direct access to 0102b can be realized via a bus through the array 0101b.
- Output 0106b is again equivalent to 0106.
- FIG. 2 shows the mapping of a finite automaton onto a reconfigurable architecture.
- 0201 (x) represent the combinatorial network (which can be designed as PAEs according to FIG. 1b).
- An address generator (0204, 0205) is assigned to each of the memories.
- the operand and result memories (0202, 0203) are physically or virtually coupled to one another in such a way that, for example, the results of one function or an operation of another can serve as operands and / or both results and newly added operands of a function of another as operands can serve.
- Such a coupling can be established, for example, by bus systems or by a (re) configuration by which the function and networking of the memories with the 0201 is reconfigured.
- Figure 3 shows various aspects of dealing with variables.
- 0306 and 0304 must be interchanged to get a complete finite state machine.
- 0305 represents the address generators for memories 0304 and 0306.
- Figure 4 shows implementations of loops.
- the hatched modules can be generated by macros (0420, 0421). 0421 can also be inserted by analyzing the graph for undelayed feedback.
- 0402 is a multiplexer which initially leads the start value from xl (0403) to 0401 and then causes the feedback (0404a, 0404b) for each iteration.
- a register (see REG ⁇ ) (0405) is inserted in the feedback to prevent an undelayed and thus uncontrolled feedback of the output from 0401 to its input.
- 0405 is clocked with the work cycle of the VPU and thus determines the number of iterations per time. The respective meter reading would be available on 0404a or 0404b.
- the loop does not terminate.
- the multiplexer 0402 realizes a macro that was created from the loop construct. p * as macro is instantiated by the translation of WHILE.
- Register 0405 is either also part of the macro or, according to a graph analysis according to the prior art, is inserted exactly where and where there is undelayed feedback in order to switch off the tendency to oscillate.
- a circuit is shown that checks the validity of the result (0410) and only forwards the signal from 0404a to the subsequent functions (0411) when the abort criterion of the loop has been reached.
- the termination criterion is determined by comparison xl ⁇ 10 (comparison level 0412).
- the relevant status flag (0413) is passed to a multiplier 0402 for controlling the loop and the functions 0411 for checking the continuation of the result.
- the status flag 0413 can be implemented, for example, by triggers according to DE 197 04 728.9.
- the status flag means 0413 can also be sent to a CT, which then recognizes the termination of the loop and carries out a reconfiguration.
- the basic function corresponds essentially to FIG. 4b, which is why the references have been adopted.
- Function block 0501 calculates the multiplication.
- the FOR loop is replaced by a 4b implemented further loop and is only indicated by block 0503.
- Block 0503 supplies the status of the comparison to the termination criterion. The status is used directly to control the iteration, which means that the average 0412 (represented by 0502) is largely omitted.
- Figure 6 shows the execution of a WHILE loop acc. Figure 4b over several configurations.
- the state of the loop (0601) is a relevant state here, since this has a significant influence on the function in the following configurations.
- the calculation spans 4 configurations (0602,
- 0603, 0604, 0605 The data are stored in memories (see REGO) (0606, 0607). 0607 also replaces 0405.
- the fill level of the memories can be used as a reconfiguration criterion, indicated via 0606, 0607: memory full / empty and / or 0601, which indicates the termination of the loop.
- the fill level generates the memory triggers (cf. PACT01, PACT05, PACT08, PACT10), which are sent to the configuration unit and trigger a reconfiguration.
- the state of the loop (0601) can also be sent to the CT.
- the CT can then configure the following algorithms, or, if necessary, initially the remaining parts of the loop (0603,
- FIG. 6 shows potential limits of the parallelizability.
- the loop can be carried out in blocks, i.e. can be calculated by filling the memory 0606/0607. This ensures a high degree of parallelism.
- the analysis of the calculation times can either take place in the compiler at the translation time in accordance with the following section and / or be measured empirically at or during a runtime in order to bring about a subsequent optimization, which leads to a learnable, in particular self-learning compiler. Analysis and parallelization methods are important for the invention.
- Functions to be mapped are represented by graphs (cf. PACT13; DE 199 26 538 .0), whereby an application can be composed of any number of different functions.
- the graphs are examined for their parallelism, whereby all methods of optimization can be used in advance.
- ILP expresses which commands can be executed at the same time (see PAR ⁇ ). Such an analysis is easily possible based on the dependency of nodes on a graph. Corresponding methods are sufficiently known per se in the prior art and in mathematics; for example, reference should be made to VLIW compilers and synthesis tools.
- an empirical analysis can also be carried out at runtime.
- PACT10, PACT17 methods are known which allow statistics on program behavior to be generated at runtime. So z. For example, a maximum parallelizability can be assumed.
- the individual paths return messages to a statistical unit (e.g. implemented in a CT or another stage, see PACT10, PACT17, but units according to PACT04 can also be used in principle) about each run.
- a statistical unit e.g. implemented in a CT or another stage, see PACT10, PACT17, but units according to PACT04 can also be used in principle.
- This type of path usage notification is not mandatory, but is advantageous.
- the value PAR (p) used in the following indicates to clarify which parallelism at instruction level, i.e. H . how much ILP can be reached at a certain level (p) within the data flow graph transformed from the function (FIG. 7a).
- Vector parallelism is also important (cf. VEC ⁇ ).
- Vector parallelism can be used when larger amounts of data have to be processed.
- linear sequences of operations can be vectorized, i.e. all operations can process data simultaneously, typically each separate operation processing a separate data word.
- the graph of a function can be expressed by a Petri net.
- Petri nets have the property that the results are passed on by nodes in a controlled manner, which means that loops can be modeled, for example.
- the data throughput is determined by the feedback of the result in a loop. Examples:
- the result determines the termination of the loop, but is not included in the calculation of the results: feedback is not necessary. Possibly. If wrong (too many) values run into the loop, the output of the results can be interrupted as soon as the end condition is reached at the end of the loop.
- VEC used in the following for clarification can illustrate the degree of vectorizability of a function.
- VEC shows how many data words can be processed simultaneously within a set of operations.
- VEC 1 (FIG. 7b).
- VEC can be calculated for an entire function and / or for partial sections of a function. Both variants can be advantageous for the compiler according to the invention, as is generally advantageous for determining and evaluating VEC.
- PAR (p) is determined for every line of a graph, as advantageously as possible.
- a line of a graph is defined by executing it within a clock unit. The number of operations depends on the implementation of the respective VPU.
- PAR (p) corresponds to the number of nodes in line p, all nodes can be executed in parallel. If PAR (p) is smaller, certain nodes are only executed alternatively.
- the alternative versions of a node are summarized in a PAE.
- a selection device enables the activation of the alternative corresponding to the status of the data processing at runtime, as described for example in PACT08.
- sequencer structures for mapping reentrant code can be generated.
- the synchronizations required for this can be carried out, for example, using the TimeStamp method described in PACT18 or preferably using the trigger method described in PACT08.
- sequencers or sequential parts are mapped to a PA, it is preferred for power consumption reasons to coordinate the performance of the individual sequencers. This can be done particularly preferably in such a way that the operating frequencies of the sequencers are matched to one another. Methods are known from PACT25 and PACT18, for example, which allow individual clocking of individual PAEs or PAE groups. The frequency of a sequencer can be determined on the basis of the number of cycles it typically needs to process the function assigned to it.
- clock For example, if it needs 5 clock cycles to process its function while the rest of the system needs exactly one clock cycle to process assigned tasks, its clock should be 5 times higher than the clock of the rest of the system. Different clock cycles are possible with a large number of sequencers. A clock multiplication and / or a clock division can be provided.
- some VPUs offer the option of differential reconfiguration. This can be used if only relatively few changes within the arrangement of PAEs are necessary for a reconfiguration. In other words, only the changes in a configuration compared to the current configuration are reconfigured. In this case, the partitioning can be such that the (differential) configuration following a configuration only contains the necessary reconfiguration data and does not represent a complete configuration.
- the compiler of the present invention is preferably designed to recognize and support this.
- the reconfiguration can be scheduled by the status, which reports the function (s) to a loading unit (CT), which in turn selects and configures the next configuration or partial configuration based on the incoming status. riert.
- CT loading unit
- the scheduling can support the possibility of preloading configurations during the runtime of another configuration.
- Several configurations can possibly also be preloaded speculatively, ie without ensuring that the configurations are needed at all. This is particularly preferred if With longer data streams that can be processed without configuration, the CT is at least largely unloaded and, in particular, is not or only slightly burdened by tasks. ,
- the local sequencers can also be controlled by the status of their data processing, as is known, for example, from DE 196 51 075.9-53, DE 196 54 846.2-53, DE 199 26 538.0.
- Another dependent or independent status can be reported to the CT (see, for example, PACT04, LLBACK).
- FIG. 8a shows the mapping of the graph according to FIG. 7a to a group of PAEs with the maximum achievable parallelism. All operations (instruction il-il2) are shown in individual PAEs.
- FIG. 8b shows the same graph, for example with maximum usable vectorizability.
- a status signal for each data word in each stage selects the operation to be carried out in the respective PAE.
- the * PAEs are networked as a pipeline (vector) and each PAE carries out an operation on different data words for each cycle.
- PAE1 calculates data and passes it on to PAE2. Together with the data, it passes on a status signal that indicates whether il or i2 should be executed.
- PAE2 further calculates the data from PAE1.
- the operation to be performed is based on the incoming status signal
- PAE2 forwards a status signal to PAE3, which indicates whether (i4 v i5) v (i6 v i7 v i8) should be executed.
- PAE3 further calculates the data from PAE2.
- the operation (i4 v i5) v (i6 v i7 v i8) is selected and calculated in accordance with the incoming status signal. According to the calculation, PAE3 passes a status signal to PAE4, which indicates whether i9 v ilO v ill should be carried out.
- PAE4 further calculates the data from PAE3.
- the operation i9 v ilO v ill to be carried out is selected and calculated in accordance with the incoming status signal.
- PAE5 further calculates the data from PAE4.
- FIG. 8c again shows the same graph.
- PAR (p) is high, which means that a large number of operations can be carried out simultaneously within one line.
- the PAEs are networked in such a way that they can exchange any data with each other.
- the individual PAEs only carry out operations if there is an ILP in the corresponding cycle, otherwise they behave neutrally (NOP), whereby clocking down and / or a clock and / or current shutdown can take place to minimize the power loss.
- NOP neutrally
- PAE2 works in the first cycle and passes the data on to PAE2 and PAE3.
- PAE2 and PAE3 work in parallel and pass on their data to PAE1, PAE2, PAE3, PAE4, PAE5.
- PAE1, PAE2, PAE3, PAE4, PAE5 work and pass the data on to P ⁇ E2, PAE3, PAE5.
- PAE2, PAE3, PAE5 work and pass the data on to PAE2. Only PAE2 works in the fifth cycle.
- the function therefore requires 5 cycles for the calculation.
- the corresponding sequencer should therefore work at 5 times the beat in relation to its surroundings in order to achieve a corresponding performance.
- PACT02 Figures 19, 20 and 21
- PACT04 and PACT10, 13 also describe generally usable but more complex methods. Other methods and / or hardware can be used.
- FIG. 8d shows the graph according to FIG. 7a in the event that there is no usable parallelism. To calculate a data word, each stage must be run through one after the other. In- Only one of the branches is processed within the stages.
- the corresponding sequencer should therefore work at 5 times the beat in relation to its surroundings in order to achieve a corresponding performance.
- Such a function can be mapped, for example, similar to FIG. 8c, using a simple sequencer according to PACT02 (FIGS. 19, 20 and 21).
- PACT04 and PACT10, 13 also describe generally usable but more complex methods.
- FIG. 9a shows the same function in which the paths (i2 ⁇ (i4 v i5) ⁇ i9) and (i3 ⁇ (i ⁇ v i7 v 18) ⁇ (i9 v ilO)) can be carried out in parallel.
- (i4 v i5), i ⁇ v 17 v i8), (i9 v ilO) are alternative.
- the function can still be vectorized.
- a pipeline can thus be built up, in which the respective function to be carried out is determined for 3 PAEs (PAE4, PAE5, PAE7) on the basis of status signals.
- FIG. 9b shows a similar example in which vectorization is not possible.
- the paths (il ⁇ ⁇ 2 ⁇ (i4 v i5) ⁇ 19 ⁇ il2) and (i3 ⁇ (i ⁇ v i7 v i8) ⁇ (ilO v ill)) are parallel.
- the PAEs are synchronized with one another by status signals, which are preferably generated by PAE1, since this calculates the start (il) and the end (il2) of the function.
- SMP symmetrically parallel processor model
- the individual registers that can be selected by the triggers are basically independent and therefore allow independent configuration, especially in the background. Jumps within the registers are not possible, the selection is made exclusively via the trigger vectors.
- An essential factor for evaluating the efficiency of PAR and VEC is the type of data processed by the respective structure. For example, it is worth rolling out a structure, i.e. pipelining * and or parallelizing, that processes a large amount of data; as is the case with video data or telecom data, for example. Structures that process little data (eg keyboard input, mouse, etc.) are not worth rolling out, on the contrary, they would only block the resources of other algorithms.
- the data type (arrays, streams, for example, should rather be rolled out than individual characters, for example due to the large amount of data).
- the type of source and / or destination (keyboard and mouse, for example, have a data rate that is too low to be rolled out efficiently, whereas, for example, the data rate for network and / or video sources or destinations is significantly higher).
- Irrelevant state State that is irrelevant for the actual algorithm and is also not described in the algorithm, but which is required by the executing hardware depending on the implementation
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Devices For Executing Special Programs (AREA)
- Executing Machine-Instructions (AREA)
- Stored Programmes (AREA)
Abstract
Description
Claims
Priority Applications (26)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/486,771 US7996827B2 (en) | 2001-08-16 | 2002-08-16 | Method for the translation of programs for reconfigurable architectures |
EP02774585A EP1493084A2 (de) | 2001-08-16 | 2002-08-16 | Verfahren zum übersetzen von programmen für rekonfigurierbare architekturen |
CA002458199A CA2458199A1 (en) | 2001-08-16 | 2002-08-16 | Method for the translation of programs for reconfigurable architectures |
AU2002340879A AU2002340879A1 (en) | 2001-08-16 | 2002-08-16 | Method for the translation of programs for reconfigurable architectures |
JP2003521938A JP2005508029A (ja) | 2001-08-16 | 2002-08-16 | リコンフィギュアラブルアーキテクチャのためのプログラム変換方法 |
AU2003223892A AU2003223892A1 (en) | 2002-03-21 | 2003-03-21 | Method and device for data processing |
EP03720231A EP1518186A2 (de) | 2002-03-21 | 2003-03-21 | Verfahren und vorrichtung zur datenverarbeitung |
US10/508,559 US20060075211A1 (en) | 2002-03-21 | 2003-03-21 | Method and device for data processing |
PCT/DE2003/000942 WO2003081454A2 (de) | 2002-03-21 | 2003-03-21 | Verfahren und vorrichtung zur datenverarbeitung |
AU2003286131A AU2003286131A1 (en) | 2002-08-07 | 2003-07-23 | Method and device for processing data |
EP03776856.1A EP1537501B1 (de) | 2002-08-07 | 2003-07-23 | Verfahren und vorrichtung zur datenverarbeitung |
PCT/EP2003/008081 WO2004021176A2 (de) | 2002-08-07 | 2003-07-23 | Verfahren und vorrichtung zur datenverarbeitung |
US10/523,764 US8156284B2 (en) | 2002-08-07 | 2003-07-24 | Data processing method and device |
PCT/EP2003/008080 WO2004015568A2 (en) | 2002-08-07 | 2003-07-24 | Data processing method and device |
AU2003260323A AU2003260323A1 (en) | 2002-08-07 | 2003-07-24 | Data processing method and device |
JP2005506110A JP2005535055A (ja) | 2002-08-07 | 2003-07-24 | データ処理方法およびデータ処理装置 |
EP03784053A EP1535190B1 (de) | 2002-08-07 | 2003-07-24 | Verfahren zum gleichzeitigen Betreiben eines sequenziellen Prozessors und eines rekonfigurierbaren Arrays |
US12/570,943 US8914590B2 (en) | 2002-08-07 | 2009-09-30 | Data processing method and device |
US12/621,860 US8281265B2 (en) | 2002-08-07 | 2009-11-19 | Method and device for processing data |
US12/729,090 US20100174868A1 (en) | 2002-03-21 | 2010-03-22 | Processor device having a sequential data processing unit and an arrangement of data processing elements |
US12/729,932 US20110161977A1 (en) | 2002-03-21 | 2010-03-23 | Method and device for data processing |
US12/947,167 US20110238948A1 (en) | 2002-08-07 | 2010-11-16 | Method and device for coupling a data processing unit and a data processing array |
US13/177,820 US8869121B2 (en) | 2001-08-16 | 2011-07-07 | Method for the translation of programs for reconfigurable architectures |
US14/162,704 US20140143509A1 (en) | 2002-03-21 | 2014-01-23 | Method and device for data processing |
US14/540,782 US20150074352A1 (en) | 2002-03-21 | 2014-11-13 | Multiprocessor Having Segmented Cache Memory |
US14/923,702 US10579584B2 (en) | 2002-03-21 | 2015-10-27 | Integrated data processing core and array data processor and method for processing algorithms |
Applications Claiming Priority (18)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE10139170.6 | 2001-08-16 | ||
DE10139170 | 2001-08-16 | ||
DE10142903 | 2001-09-03 | ||
DE10142903.7 | 2001-09-03 | ||
DE10144732.9 | 2001-09-11 | ||
DE10144732 | 2001-09-11 | ||
DE10145792 | 2001-09-17 | ||
DE10145792.8 | 2001-09-17 | ||
US09/967,847 US7210129B2 (en) | 2001-08-16 | 2001-09-28 | Method for translating programs for reconfigurable architectures |
US09/967,847 | 2001-09-28 | ||
DE10154260.7 | 2001-11-05 | ||
DE10154260 | 2001-11-05 | ||
DE10207225.6 | 2002-02-21 | ||
DE10207225 | 2002-02-21 | ||
PCT/EP2002/002398 WO2002071248A2 (de) | 2001-03-05 | 2002-03-05 | Verfahren und vorrichtungen zur datenbe- und/oder verarbeitung |
EPPCT/EP02/02398 | 2002-03-05 | ||
EPPCT/EP02/09131 | 2002-08-15 | ||
EP0209131 | 2002-08-15 |
Related Child Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10486771 A-371-Of-International | 2002-08-16 | ||
US13/177,820 Continuation US8869121B2 (en) | 2001-08-16 | 2011-07-07 | Method for the translation of programs for reconfigurable architectures |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2003017095A2 true WO2003017095A2 (de) | 2003-02-27 |
WO2003017095A3 WO2003017095A3 (de) | 2004-10-28 |
Family
ID=41210636
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2002/010065 WO2003017095A2 (de) | 2001-08-16 | 2002-08-16 | Verfahren zum übersetzen von programmen für rekonfigurierbare architekturen |
Country Status (4)
Country | Link |
---|---|
JP (1) | JP2005508029A (de) |
AU (1) | AU2002340879A1 (de) |
CA (1) | CA2458199A1 (de) |
WO (1) | WO2003017095A2 (de) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AT501479B1 (de) * | 2003-12-17 | 2006-09-15 | On Demand Informationstechnolo | Digitale rechnereinrichtung |
EP2043000A2 (de) | 2002-02-18 | 2009-04-01 | PACT XPP Technologies AG | Bussysteme und Rekonfigurationsverfahren |
US9037807B2 (en) | 2001-03-05 | 2015-05-19 | Pact Xpp Technologies Ag | Processor arrangement on a chip including data processing, memory, and interface elements |
US9075605B2 (en) | 2001-03-05 | 2015-07-07 | Pact Xpp Technologies Ag | Methods and devices for treating and processing data |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6542998B1 (en) | 1997-02-08 | 2003-04-01 | Pact Gmbh | Method of self-synchronization of configurable elements of a programmable module |
US7996827B2 (en) | 2001-08-16 | 2011-08-09 | Martin Vorbach | Method for the translation of programs for reconfigurable architectures |
US8914590B2 (en) | 2002-08-07 | 2014-12-16 | Pact Xpp Technologies Ag | Data processing method and device |
DE102005005073B4 (de) * | 2004-02-13 | 2009-05-07 | Siemens Ag | Rechnereinrichtung mit rekonfigurierbarer Architektur zur parallelen Berechnung beliebiger Algorithmen |
JP5141151B2 (ja) * | 2007-09-20 | 2013-02-13 | 富士通セミコンダクター株式会社 | 動的再構成回路およびループ処理制御方法 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1999000731A1 (en) * | 1997-06-27 | 1999-01-07 | Chameleon Systems, Inc. | Method for compiling high level programming languages |
US6058469A (en) * | 1995-04-17 | 2000-05-02 | Ricoh Corporation | System and method for dynamically reconfigurable computing using a processing unit having changeable internal hardware organization |
-
2002
- 2002-08-16 CA CA002458199A patent/CA2458199A1/en not_active Abandoned
- 2002-08-16 WO PCT/EP2002/010065 patent/WO2003017095A2/de active Application Filing
- 2002-08-16 JP JP2003521938A patent/JP2005508029A/ja active Pending
- 2002-08-16 AU AU2002340879A patent/AU2002340879A1/en not_active Abandoned
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6058469A (en) * | 1995-04-17 | 2000-05-02 | Ricoh Corporation | System and method for dynamically reconfigurable computing using a processing unit having changeable internal hardware organization |
WO1999000731A1 (en) * | 1997-06-27 | 1999-01-07 | Chameleon Systems, Inc. | Method for compiling high level programming languages |
Non-Patent Citations (5)
Title |
---|
ATHANAS P M ET AL: "An adaptive hardware machine architecture and compiler for dynamic processor reconfiguration" PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN - VLSI IN COMPUTERS AND PROCESSORS. CAMBRIDGE, MA., OCT. 14 - 16, 1991, LOS ALAMITOS, IEEE. COMP. SOC. PRESS, US, 14. Oktober 1991 (1991-10-14), Seiten 397-400, XP010025243 ISBN: 0-8186-2270-9 * |
BAUMGARTE V ET AL: "PACT XPP - A Self-Reconfigurable Data Processing Architecture" ., 25. Juni 2001 (2001-06-25), XP002256066 * |
CARDOSO J M P ET AL: "Macro-based hardware compilation of Java<TM> bytecodes into a dynamic reconfigurable computing system" FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES, 1999. FCCM '99. PROCEEDINGS. SEVENTH ANNUAL IEEE SYMPOSIUM ON NAPA VALLEY, CA, USA 21-23 APRIL 1999, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 21. April 1999 (1999-04-21), Seiten 2-11, XP010359161 ISBN: 0-7695-0375-6 * |
WEINHARDT M: "ÜBERSETZUNGSMETHODEN FÜR STRUKTURPROGRAMMIERBARE RECHNER" DISSERTATION UNIVERSITY KARLSRUHE, XX, XX, Juli 1997 (1997-07), Seiten 1-134, XP002254220 * |
YE Z A ET AL: "A C COMPILER FOR A PROCESSOR WITH A RECONFIGURABLE FUNCTIONAL UNIT" FPGA'00. ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD PROGRAMMABLE GATE ARRAYS. MONTEREY, CA, FEB. 9 - 11, 2000, ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD PROGRAMMABLE GATE ARRAYS, NEW YORK, NY : ACM, US, Bd. CONF. 8, 9. Februar 2000 (2000-02-09), Seiten 95-100, XP000970736 ISBN: 1-58113-193-3 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9037807B2 (en) | 2001-03-05 | 2015-05-19 | Pact Xpp Technologies Ag | Processor arrangement on a chip including data processing, memory, and interface elements |
US9075605B2 (en) | 2001-03-05 | 2015-07-07 | Pact Xpp Technologies Ag | Methods and devices for treating and processing data |
EP2043000A2 (de) | 2002-02-18 | 2009-04-01 | PACT XPP Technologies AG | Bussysteme und Rekonfigurationsverfahren |
AT501479B1 (de) * | 2003-12-17 | 2006-09-15 | On Demand Informationstechnolo | Digitale rechnereinrichtung |
AT501479B8 (de) * | 2003-12-17 | 2007-02-15 | On Demand Informationstechnolo | Digitale rechnereinrichtung |
Also Published As
Publication number | Publication date |
---|---|
CA2458199A1 (en) | 2003-02-27 |
JP2005508029A (ja) | 2005-03-24 |
WO2003017095A3 (de) | 2004-10-28 |
AU2002340879A1 (en) | 2003-03-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP2224330B1 (de) | Verfahren und gerät zum partitionieren von grossen rechnerprogrammen | |
EP1228440B1 (de) | Sequenz-partitionierung auf zellstrukturen | |
DE102018005181B4 (de) | Prozessor für einen konfigurierbaren, räumlichen beschleuniger mit leistungs-, richtigkeits- und energiereduktionsmerkmalen | |
US7996827B2 (en) | Method for the translation of programs for reconfigurable architectures | |
DE69826700T2 (de) | Kompilerorientiertes gerät zur parallelkompilation, simulation und ausführung von rechnerprogrammen und hardwaremodellen | |
DE102018005172A1 (de) | Prozessoren, verfahren und systeme mit einem konfigurierbaren räumlichen beschleuniger | |
DE102018126650A1 (de) | Einrichtung, verfahren und systeme für datenspeicherkonsistenz in einem konfigurierbaren räumlichen beschleuniger | |
DE102018130441A1 (de) | Einrichtung, Verfahren und Systeme mit konfigurierbarem räumlichem Beschleuniger | |
DE102018006735A1 (de) | Prozessoren und Verfahren für konfigurierbares Clock-Gating in einem räumlichen Array | |
DE102018006889A1 (de) | Prozessoren und Verfahren für bevorzugte Auslegung in einem räumlichen Array | |
DE102018005216A1 (de) | Prozessoren, Verfahren und Systeme für einen konfigurierbaren, räumlichen Beschleuniger mit Transaktions- und Wiederholungsmerkmalen | |
DE102018005169A1 (de) | Prozessoren und verfahren mit konfigurierbaren netzwerkbasierten datenflussoperatorschaltungen | |
DE10028397A1 (de) | Registrierverfahren | |
US20070271381A1 (en) | Managing computing resources in graph-based computations | |
DE19815865A1 (de) | Kompiliersystem und Verfahren zum rekonfigurierbaren Rechnen | |
DE102005021749A1 (de) | Verfahren und Vorrichtung zur programmgesteuerten Informationsverarbeitung | |
DE202008017916U1 (de) | Virtuelle Architektur und virtueller Befehlssatz für die Berechnung paralleler Befehlsfolgen | |
DE19926538A1 (de) | Hardware und Betriebsverfahren | |
EP1518186A2 (de) | Verfahren und vorrichtung zur datenverarbeitung | |
WO2003017095A2 (de) | Verfahren zum übersetzen von programmen für rekonfigurierbare architekturen | |
WO2000017772A2 (de) | Konfigurierbarer hardware-block | |
EP1483682A2 (de) | Reconfigurierbarer prozessor | |
EP1493084A2 (de) | Verfahren zum übersetzen von programmen für rekonfigurierbare architekturen | |
WO2003071418A2 (de) | Übersetzungsverfahren | |
Luk | Customising processors: design-time and run-time opportunities |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ CZ DE DE DK DK DM DZ EC EE EE ES FI FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GM HR HU ID IL IN IS JP KE KG KP KZ LC LK LR LS LT LU LV MA MD MK MN MW MX MZ NO NZ OM PH PT RO RU SD SE SG SI SK SL TJ TM TN TR TZ UA UG US UZ VC VN YU ZA ZM |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LU MC NL PT SE SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ UG ZM ZW AM AZ BY KG KZ RU TJ TM AT BE BG CH CY CZ DK EE ES FI FR GB GR IE IT LU MC PT SE SK TR BF BJ CF CG CI GA GN GQ GW ML MR NE SN TD TG AE AG AL AM AT AZ BA BB BG BR BY BZ CA CH CN CO CR CZ DE DK DM DZ EC EE ES FI GB GD GE GM HR HU ID IL IN IS JP KE KG KP KR KZ LK LR LS LT LU LV MA MD MG MK MN MX MZ NO NZ OM PH PL PT RO RU SD SE SI SK SL TJ TM TN TR TT TZ UA UG UZ VC YU ZA ZM ZW GH GM KE LS MW MZ SL SZ TZ UG ZM ZW |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
REEP | Request for entry into the european phase |
Ref document number: 2002774585 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2002774585 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2458199 Country of ref document: CA Ref document number: 2003521938 Country of ref document: JP |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 10486771 Country of ref document: US |
|
WWP | Wipo information: published in national office |
Ref document number: 2002774585 Country of ref document: EP |