EP3555760A1

EP3555760A1 - Parallel processing on demand using partially dynamically reconfigurable fpga

Info

Publication number: EP3555760A1
Application number: EP17818135.0A
Authority: EP
Inventors: Jean-Luc Dekeyser
Original assignee: Centre National de la Recherche Scientifique CNRS; Universite Lille 2 Droit et Sante
Current assignee: Centre National de la Recherche Scientifique CNRS; Universite Lille 2 Droit et Sante
Priority date: 2016-12-19
Filing date: 2017-12-19
Publication date: 2019-10-23
Also published as: WO2018114957A1

Abstract

Method for performing parallel processing within at least one FPGA chip which comprises one dynamically reconfigurable master unit and at least two dynamically reconfigurable slave units, each slave unit comprising a pool of interconnected slave elements (slave 0, slave 1, slave 2, slave 3), each slave element comprising a plurality of slave IP cores, one slave softcore processor integrating them and an instruction memory, said one master unit comprising one reconfiguration IP core (14), at least one master IP core, one master softcore processor (1) integrating said at least one master IP core and a memory (3) containing a master program for execution by said at least one master IP core, wherein at runtime of the master program, the reconfiguration IP core (14) executes at least one instruction causing the reconfiguration of the pool of slave elements (slave 0, slave 1, slave 2, slave 3) by dynamically varying their number and/or interconnections according to a given configuration defined in said master program and then by storing into each of the instruction memories (102, 202, 302, 402) of the slave elements(slave 0, slave 1, slave 2, slave 3) a corresponding slave program for parallel execution by the slave IP cores.

Description

PARALLEL PROCESSING ON DEMAND USING PARTIALLY

DYNAMICALLY RECONFIGURABLE FPGA

The present invention relates to methods for performing parallel processing on demand through dynamic reconfiguration using a dynamically reconfigurable softcore processor for FPGA chips.

High performance reconfigurable systems are being used today in adaptive and intensive application domain. The efficient packing of high logic density, the inherent parallelism of the hardware, and the essential feature of dynamic partial reconfiguration (DPR) make Field Programmable Gate Arrays (FPGA) a highly attractive solution, as explained in the article of Tredennick et al. "The inevitability of reconfigurable systems ", Queue, vol. 1, n° 7, pages 34-43, October 2003. Indeed, FPGAs have the benefits of high speed and adaptability to the application constraints with an improved performance-per- watt compared to General Purpose Processors (GPP), as presented for example in the article of Asano et al. "Performance comparison of FPGA, Graphics Processing Unit (GPU) and Central Processing Unit (CPU) in Image Processing", IEEE International Conference on Field Programmable Logic and Applications, 2009.

FPGAs offer inexpensive and fast programmable hardware on some of the most advanced fabrication processes. Methods using reconfiguration technologies of FPGA chips are for example known from WO 2012/100500, US 5 931 959, CN 103677917, CN 103677837, CN 104008024, CN 102819818, WO 2014/120157 and US 7 509 617.

As a 2D technology fact, reconfiguring such high density devices in a linear and sequential way renders the DPR feature of the FPGAs inefficient, especially for architectures featuring massively parallel capabilities.

This challenge is tackled by the emergence of 3D-Stacked Integrated Circuits (3D SICs), described in the article of Van Olmen et al. "3D SIC demonstration using a through silicon via first approach ", IEEE International Electron Devices Meeting 2008, pages 1-4, that leads to a significant shift in the design of FPGA circuits and to a vast increase of their integration capabilities. As stated in the article of Duranton et al. "The HIPEAC vision for advanced computing in horizon 2020", HIPEAC network of excellence 2013, 3D stacking enables higher levels of integration and reduced costs for off-chip communications, the overall complexity being easily managed due to the separation in different and independently designed matrices. Due to their structural sophistication, compared to 2D FPGAs, monolithic 3D FPGAs allow not only working with parallel architectures at very high speeds but also working with several logic blocks in parallel, as presented in the article of Lin et al. "Performance benefits of monolithically stacked 3D FPGA ", Proceedings of the 2006 International Symposium on FPGA, pages 113-122 and the one of Cevrero et al. "3D configuration caching for 2D FPGAs " in the proceedings of the ACM/SIGDA International Symposium on FPGA 2009, pages 286-290.

As shown in Figure 1A, a 2D FPGA chip 100 consists in the duplication of basic building blocks 101, called "tiles", interconnected through regular queues. Each tile is composed of three types of resources, two configurable blocks (CB), a logic block (LB) and a routing block (SB). The objective of the architectural 3D transformation is to reduce the horizontal wired connections to avoid synchronization errors, optimize energy consumption, reduce the needed silicon surface, and improve routing delays. This is achieved by the organization of each type of resource block in a silicon layer, and by superposing them one over the other in the same chip using TSV interconnections (Through-Silicon- Via), as explained in the above-cited article of Lin et al.

Figure IB illustrates the general structure of a 3D layered FPGA 110. The stacking order plays an important role in ensuring good performances. The configuration memory layer 111 is thus located above the other layers and is built with Static Random Access Memory (SRAM) blocks of two types, LB-SRAM (Logic Block SRAM) and RR- SRAM (Routing Resource SRAM). The middle routing layer 1 12 is a mesh of nodes, each node being composed of four CBs and four SBs. The 3D layered FPGA 110 can include several routing layers, also called switch or alignment layers, as illustrated in Figure 2. The lower layer 113, called logic layer, is composed by uniformly interconnected LBs. As shown in Figure 2, in a 3D FPGA, several FPGA tiles can be configured in parallel while accessing different configuration memory banks simultaneously.

Incorporating the configuration memory on the top of the FPGA chip, with fast and numerous connections between said memory and elementary logic blocks, makes it possible to obtain dynamically reconfigurable computing platforms with a very high reconfiguration rate, as shown in the article of Sidiropoulos et al. "A novel 3D FPGA architecture targeting communication intensive applications ", Journal of Systems Architecture - Embedded Systems Design, 2014, vol. 60, n° 1, pages 32-39. Such a high dynamic reconfiguration rate was not possible with 2D FPGAs, due to the serial nature of the interface between the configuration memory and the design of the FPGA itself.

An IP (Intellectual Property) component refers to a reusable unit of logic, cell, or chip layout design that is a piece of the intellectual property of the chip where it is embedded. The manufacturer Xilinx demonstrated 3600 8-bit picoBlaze softcore processors, which can be replaced, for specific applications, by specialized hardware accelerators or other IP components, running simultaneously on the Virtex7 2000T FPGA.

Making it possible for software applications running on a hardware system to efficiently reconfigure said hardware system at runtime, when needed, allows achieving significant savings in circuit space, energy consumption, and execution time, as shown in the article of Dong et al. "Performance and power evaluation of a 3D CMOS/nanomaterial reconfigurable architecture ", International Conference on Computer- Aided Design 2007, pages 758-764.

As part of his research, the inventor developed a softcore processor dedicated to the integration of hardware components and Single Program Multiple Data (SPMD) execution model that is based on the master/slaves principle, wherein the program to be executed runs on the master IP cores and when large data is processed this can be done by the slave IP cores.

In the article published by the inventor on April 2, 2015, entitled "When

Hardware Meets Software HoMade softcore for massively parallel reflective programming", an IP is changed by another one in the same softcore, either in the master softcore or the slave softcore. IPs are sequentially replaced.

The master/slaves topology, i.e. the number of slaves and the way they are interconnected, used to be fixed for a given application and could not change during runtime for an FPGA. This fact limited optimal usage of resources for the benefit of massively parallel processing.

Besides, if they are not available in the compilation library, IP cores required by the application had to be manually written in VHDL by the designer and instantiated in the softcore. Only IP cores needed by the application should be instantiated, to satisfy what is called Hardware on Demand, and this is true for master IP cores as well as slave ones. There is thus a need to improve dynamic reconfiguration for parallel processing on demand.

One object of the invention, according to a first of its aspects, is a method for performing parallel processing within at least one FPGA chip which comprises one dynamically reconfigurable master unit and at least two dynamically reconfigurable slave units, each slave unit comprising a pool of interconnected slave elements, each slave element comprising a plurality of slave IP cores, one slave softcore processor integrating them and an instruction memory, said one master unit comprising one reconfiguration IP core, at least one master IP core, one master softcore processor integrating said at least one master IP core and a memory containing a master program for execution by said at least one master IP core, wherein at runtime of the master program, the reconfiguration IP core executes at least one instruction causing the reconfiguration of the pool of slave elements by dynamically varying their number and/or interconnections according to a given configuration defined in said master program and then by storing into each of the instruction memories of the slave elements a corresponding slave program for parallel execution by the slave IP cores.

What is meant by interconnection of slave elements is the switch connections established between slave IP cores by the application itself, ensuring communication between them.

Dynamic variation means changing, at runtime, the slave unit that is instantiated by the master program according to a given configuration.

The method according to the invention provides a massively parallel dynamically reconfigurable SPMD execution model adapted to both application requirements and hardware resources. This brings several benefits: a reduced needed silicon surface and a reduced power consumption, because only the IP cores in use are instantiated on the chip with the most optimal master/slaves topology. These savings are important for embedded systems that often need to operate with limited resources.

Thanks to the invention, IP-based systems can be customized at runtime using the DPR feature, and whose reconfiguration can benefit from 3D technology. Parallel dynamic reconfiguration model

A: IP to IP reconfiguration

Said at least one instruction causing the reconfiguration of the pool of slave elements preferably triggers a SPMD instruction. SPMD refers to a technique employed to achieve parallelism : tasks are split up and run simultaneously on multiple processors with different inputs in order to obtain results faster.

The SPMD instruction advantageously triggers a parallel programming preset of the pool of slave elements by loading an initialization program at the address 0 of the slave elements' memories.

The activation of said reconfiguration IP core has effect on the master unit. In fact, an IP core can be replaced by another IP core at runtime among the IP cores of the master.

The reconfiguration IP core can also replace one slave unit by another during runtime. Thanks to 3D technology and with parallel broadcast of bitstream, this replacement can be achieved without any sequentialization of bit transfert. It is especially well-suited for the execution principle of SPMD, able to reconfigure in parallel a subset of parallel computing nodes.

The reconfiguration IP core activated by an IP activation instruction may initiate a parallel dynamic reconfiguration in coordination with the slave IP cores instantiated on slave units for on-the-fly operations on the bitstream. Said reconfiguration IP core activated by the IP activation instruction may be considered as a master, and said instantiated slave IP cores may be considered as slaves. The set of slaves may depend on the application requirements and/or the available resources.

B: Massively parallel processing reconfiguration

Different configurations allowing different parallel processings and corresponding to different master/slaves topologies, i.e. the number of slave elements and their interconnection, are determined in said master program.

To change from one configuration to another, the reconfiguration IP of the master unit is solicited. According to the application, a single bitstream with a new topology replaces the previous one. A new program corresponding to the new topology has to be broadcasted to the memories of the slave elements. To insure data remanence in these memories, gathering/scattering of data from/to the master unit can be useful. C: Application specification

The master and slave programs may be advantageously written in a DSL (Domain Specific Language), notably in high level structured macro assembler.

The compiler preferably analyzes the programs in order to automatically generate in VHDL all required IP cores for the application if they do not already exist in a basic compilation library of any softcore, containing pre-defined IPs.

This feature avoids development of IP cores in VHDL by the programmer.

From a postfixed expression, the compiler generates the VHDL code according to a graph of dependency.

The generated VHDL code for an IP preferably matches a combinatorial or sequential circuit respecting constraints imposed on the entity structure to enable integration of the IP in the softcore.

Aggregation of all necessary IPs may be done by the compiler ensuring optimal design.

Other parameters than the configurations, e.g. sizes of instruction memories, may be evaluated at compile time.

D: Chip Technology

The level of parallel reconfigurations may depend on the technological features of the chip, mainly its number N of available Internal Configuration Access Ports (ICAPs).

The number N of ICAPs may be greater than the number of zones C to be reconfigured. One ICAP may thus be assigned to each reconfiguration zone, the remaining non assigned ICAPs being in an idle state. This embodiment is the simplest possible situation because all the partial reconfigurations can be performed in a parallel way for sure.

In a variant, the number N of ICAPs of the chip is smaller than the number of zones C to be reconfigured. Thus, the ICAPs cannot reconfigure all said zones in a parallel manner. In this case, X zones may be assigned to the N-l first ICAPs and C-(X*(N-1)) zones may be assigned to the last ICAP, X ox X + 1 depending on the values of C and N, with X = C div N, with div being the integer division. Such an assignment ensures a balanced repartition of reconfiguration workload on the available ICAPs. In this case, the reconfigurations assigned to a given ICAP are advantageously performed sequentially. E: Instructions and reconfiguration

The SPMD instruction advantageously triggers the executable code stored in the memories of the slave elements, with the same starting address for the program as the only constraint.

The set of instructions may also comprise a synchronization barrier instruction, called "wait" instruction, causing the master program to wait for each slave element running a slave program to validate termination of a processing.

The IP activation instruction may allow activating other IP cores of the master softcore processor in addition to the reconfiguration IP core.

The master and slave programs may each contain at least one reflection instruction, called WIM (Write In Memory) instruction, for overwriting in said memory and instruction memory to achieve reflective behavior of the master and slave softcore processors, respectively.

Software reflection is the ability of a program to manipulate as data something representing its state during its own execution.

Said WIM instruction advantageously writes in the instructions memory with strict constraints on the write addresses. It allows the executed program to change its content while it is running. It is especially useful for the dynamic reconfiguration "hardware/software" by allowing for example to replace a function call by a particular IP activation instruction. This change is a non-preemptive operation controlled by the softcore processor.

The reconfiguration IP core is advantageously activated based on internal monitoring of the master softcore processor. The internal monitoring of the master softcore processor may be based on the activation of at least one dedicated IP core. In a variant, the internal monitoring of the softcore processor is based on extraneous data, as for example the network topology, the power level of the chip battery or data from an obstacle detection sensor. The subset of slave elements activated in parallel before the activation of the reconfiguration IP core may as well be chosen based on internal monitoring of the slave softcore processors.

The master softcore processor may comprise an execution stack and reconfiguration files stored in a reconfiguration memory. To perform the reconfiguration, the reconfiguration IP core, or other master IP cores, may use the stack of the master softcore processor to retrieve its arguments, especially the address of the corresponding reconfiguration file that may be popped from the top of said stack.

The master and slave softcore processors may comprise a command and control unit (CCU) ensuring the execution flow of said master and slave programs, respectively.

The slave elements may start the execution when a signal is activated in the CCU. The master IP core may send one address, for example a 13-bit address, to the slave elements through a broadcast port, and may then wait for the end of execution of the slaves through the "wait" instruction.

The CCU may handle a number of interrupts equal to 2^P -1 with p an integer and p>l .

Interruption of the running of the master program is preferably caused by a single interrupt, and when said interrupt is in process, such interrupt is non-preemptive. This means it should be the interrupt with highest priority among all active ones and no processing of interrupts should be running.

At runtime, the pool of slave elements is advantageously reconfigured at least twice.

At the level of the slave elements, the received address advantageously represents the start of the SPMD model execution. At the end, a HLT (halt) instruction triggers the activation of an ORTREE signal. When all the slave elements assert their ORTREE signal, a signal is activated to allow the master unit to continue its execution.

During the execution of the parallel dynamic reconfiguration model, only one reconfiguration memory may be available. In this case, the IP reconfiguration core may transmit simultaneously a reconfiguration file, for example in the form of a bitstream, to several zones to be reconfigured, by modifying on the fly the bitstream itself with techniques similar to bitstream reallocation.

In a variant embodiment, several reconfiguration memories are available. Thanks to the parallel activation of IP cores, parallel access to said different reconfiguration memories allows reconfiguring several zones at the same time. The communications between the master and the slave units may be performed through dedicated IP cores.

FPGA chip

Another object of the invention is also an FPGA chip comprising one dynamically reconfigurable master unit and at least two dynamically reconfigurable slave units, each slave unit comprising a pool of interconnected slave elements, each slave element comprising a plurality of slave IP cores, one slave softcore processor integrating them and an instruction memory, said one master unit comprising one reconfiguration IP core, at least one master IP core, one master softcore processor integrating said at least one master IP core and a memory containing a master program for execution by said at least one master IP core, said master program containing at least one instruction executed by the reconfiguration IP core and causing at runtime the reconfiguration of the pool of slave elements by dynamically varying their number and/or interconnections according to a given configuration defined in said master program and then by storing into each of the instruction memories of the slave elements a corresponding slave program for parallel execution by the slave IP cores.

All features defined here above for the method for performing parallel processing apply to the FPGA chip.

Several softcores may be implemented on one FPGA chip, up to several hundreds of softcores.

In the case where the activation of the reconfiguration IP core causes parallel activation of other reconfiguration IP cores of similar softcore processors implemented on other zones of same chip, the chip may be a 3D FPGA. In a variant embodiment, the chip is a 2,5D FPGA.

The master and slave softcore processors of the FPGA chip according to the invention is preferably of Harvard type architecture, that is to say that they have separated memories for data and instructions.

The implementation of a softcore processor on a chip may instantiate and integrate the appropriate IP cores adapted to said application, for example IP cores playing the role of one or several registers, and/or one or several data memories, and/or one or several Arithmetic Logic Units (ALU). The IP cores may be conceived to achieve elementary operations, for example additions, data memory access, input/output operations. Some of the IP cores may also be conceived to achieve coarse-grained hardware operations, as for example Fast Fourier Transform (FFT), DSP calculations, convolutions, or low-pass or high-pass filtering. Indeed, as the softcore processor is preferably devoid of calculation units and data memory structures, and as the set of instructions is preferably only dedicated to instructions for control flow, the IP cores are configured to play these roles.

For a given application, the IP cores may be chosen and integrated depending on the required performance, the power budget and the quality of service. The softcore processor thus allows bringing heterogeneity to the chip, in addition to the parallel behavior. Almost all of the known solutions tackle this aspect by integrating different computing nodes which leads to more complexity in the development phase. The IP cores are advantageously all integrated and scheduled in the same way in the softcore processor whatever their granularity, having all a generic interface.

The master softcore processor may comprise a plurality of dynamically reconfigurable IP cores.

The master softcore processor may comprise a plurality of static IP cores, which cannot be reconfigured.

The slave softcore processor could be either identical or similar to the master softcore processor. When they are similar, they may differ by the list of IP cores instantiated. A slave processor should normally instantiate IP cores dedicated to communication exchange, and a master processor usually instantiates IP cores dedicated to communication management.

The FPGA chip according to the invention may comprise static zones, that is to say zones whose functionalities cannot be modified after the initialization of said chip, and reconfigurable zones, that is to say zones whose functionalities can be modified after the initialization of said chip.

The reconfiguration properties of each reconfigurable zone may be synthetized in reconfiguration files stored in a reconfiguration memory. Said reconfiguration memory may be embedded on the chip (on-chip) or may be embedded elsewhere (off-chip), for example on a double data rate (DDR) memory. Said reconfiguration memory may further comprise indirection tables of zones and IP cores. The chip may comprise at least one partial reconfiguration controller, especially a hardware component, in charge of controlling the reconfiguration of said reconfigurable zones at runtime. This controller may be a part of a dedicated IP core that allows the instantiation of IP cores at runtime.

The core of a softcore processor is preferably stack-based and may include a stack and a CCU.

All IP cores of a softcore processor may read and/or write from/to said stack, via some input or output ports. Said IP cores are preferably devoid of address registers for said input or output ports.

At least one IP core of a softcore processor, called "IP stack", may be dedicated to perform elementary operations on said stack.

The stack of a softcore processor may have at least three registers on its top to allow reading or writing in a same clock cycle. Therefore, an IP core may perform at least three "push" operations, that is to say an addition of a data, or three "pop" operations, that is to say the removal of a data, in a same clock cycle. The number of such operations is preferably determined by the CCU of the softcore processor according to the application itself.

The CCU of a softcore processor has preferably a specific stack for nested function calls.

Detailed description of figures

The invention will be better understood on reading the following detailed description of non-limiting exemplary embodiments thereof and on examining the appended drawings in which:

- Figure 1A is an example of a 2D FPGA according to the state-of-the-art and previously described;

- Figure IB is an example of a 3D mono lit hically stacked FPGA according to the state-of-the-art and previously described;

- Figure 2 is an example of a 3D FPGA according to the state-of-the-art and previously described;

- Figure 3 a is a schematic representation of a master unit;

- Figure 3b is an analogous view of Figure 3 a providing further details about the inputs/outputs of different blocks; - Figure 4 is a schematic representation of a parallel architecture based on softcore processors according to the invention;

- Figure 5 depicts an example of dynamic master/slaves reconfiguration; and

- Figure 6 schematically represents the memory mapping of a master core taking into account interrupts.

A master unit 21 is shown in Figure 3a. It may be embedded in an FPGA chip, especially a 3D or 2,5D FPGA chip and comprises preferably a softcore processor 1, a core 2 and a plurality of IP cores 1 1, 13, 14.

The master unit 21 comprises preferably a plurality of dynamically reconfigurable IP cores 13. The master unit 21 may comprise a plurality of static IP cores 11.

The core 2 of the softcore processor 1 is preferably stack-based and may include an instructions memory 3, a stack 4 and a CCU 5.

As shown in Figure 3 a, the softcore processor 1 is linked with an ICAP 17 of the chip in which it is embedded.

The CCU 5 ensures the execution flow of a master program. Its operating mode may be based on a two-stage pipeline: Fetch-Decode/Execute. Control flow instructions e.g. call, return, branch, may be managed by the CCU. As shown in Figure 3b, interrupts may also directly be taken into account by the CCU. Processing instructions, such as SPMD, IP activation instructions, may be decoded and transmitted to the entire set of deployed IPs. Only one will really start. At the same time, the execution stack is updated by push/pop operations.

The instructions memory 3 may be instantiated for each softcore, master or slave. This memory may be an on-chip PROM containing the program to run. On a Xilinx board, a new program (one for the master and one for all the slaves at the same time) may be fully loaded via an UART port at any time, as depicted in Figure 3b. When slave topology is reconfigured, a new program must be loaded in the memories of slave elements.

The execution stack 4 may store data exchanged by IP cores. This stack may be an on-chip memory of a few tens of 32-bit words. The pop and push actions on this stack are preferably separated from IP activation and the same IP can cause different changes on the stack. The top of the stack 4 can be used as a predicate for control flow instructions.

All the IP cores 11, 13 of the softcore processor 1 may at each cycle read and/or write from/to said stack 4, 3 words via some input ports 15 and/or 3 words via output ports 16, which can be joined to form communication buses, 4 Top Reads and 3 Top Writes 32-bit buses, as shown in Figure 3b. At least one IP core of the softcore processor 1, called "IP stack", may be dedicated to perform elementary operations on said stack 4.

As shown in Figure 3a, IP cores 11 use an "IP done" signal to notify their end of execution, while the stack 4 uses an "IP code" signal for the IP cores.

The chip on which the softcore processor 1 is embedded may comprise static zones, and reconfigurable zones which advantageously comprise at least one reconfiguration IP core. The reconfiguration properties of each reconfigurable zone may be synthetized in reconfiguration files stored in a reconfiguration memory which is preferably a local on-chip memory, used by the reconfiguration IP core 14.

The chip may comprise at least one partial reconfiguration controller, especially a hardware component, in charge of controlling the reconfiguration of said reconfigurable zones at runtime.

At least one IP core of the softcore processor 1, the reconfiguration IP core 14, is advantageously dedicated to the dynamic reconfiguration of the other IP cores 13.

The set of instructions of the softcore processor 1 comprises advantageously at least one IP activation instruction for activating said reconfiguration IP core 14. Said IP activation instruction may allow activating IP cores 11, 13 of the softcore processor 1 other than reconfiguration IP cores 14.

The activation of said reconfiguration IP core 14 causes advantageously parallel activation of slave IP cores of similar slave units implemented on other zones of the same chip, allowing deriving a parallel and partial SPMD dynamic reconfiguration model.

The set of instructions has preferably, including said IP activation instruction, at least twelve instructions, for example in 16-bit format. The set of instructions may comprise four "jump" instructions, including an absolute branch instruction, a relative branch instruction and two conditional relative branch instructions, two function instructions, including a "call" instruction and a "return" instruction, an "end of execution" instruction, a SPMD instruction and a "wait" instruction.

The set of instructions may comprise at least one reflection instruction for overwriting in the instructions memory 3 to achieve reflective behavior of the softcore processor 1, said instruction being called WIM (Write In Memory) instruction. Said WIM instruction advantageously contains explicitly the address of the instructions memory 3 where it has to be overwritten, especially the twelve most significant bits of the address.

As shown in Figure 4, a slave unit 20 comprises several slave elements slave 0, slave 1, slave 2, slave 3 interconnected via communication IP cores 18. The reconfiguration IP core 14 of the master unit 21, activated by the IP activation instruction may be considered as a master, and the slave IP cores 114, 214, 314, 414, that it activates may be considered as slaves. Through the SPMD instruction, the master IP core 14 may initiate a parallel dynamic reconfiguration on a set of active slaves 114, 214, 314, 414, depending on the number N of available ICAPs of the chip.

The SPMD instruction of the set of instructions advantageously triggers the executable code stored in the instructions memories 102, 202, 302, 402, of the slaves 114, 214, 314, 414, with the same starting address for the program as the only constraint. The "wait" instruction advantageously waits until all slaves 114, 214, 314, 414 validate the termination of their execution.

The active slaves 114, 214, 314, 414 may start the execution when a signal

"StartCPU" is activated in the CCU 5 of the master processor. The master IP core 14 may send an address, for example a 13-bit address, to the slaves 114, 214, 314, 414 through the "StartAddress" port as shown in Figure 4, and may then wait for the end of execution of the slaves through the "wait" instruction.

At the level of the active slaves 114, 214, 314, 414, the received address advantageously represents the start of the SPMD model execution. At the end, a HLT instruction triggers the activation of an ORTREE signal. When all the slaves 114, 214, 314, 414 assert their ORTREE signal, a SPMD done signal is activated to allow the master continuing its execution as shown in Figure 4.

Figure 5 depicts two different configurations of a master core having grids of

2x3 and 3x4 slaves. The slave cores are addressed by matrix coordinates. Running the first or second configuration as active configuration is managed by the assembler code below. Sentences preceded by "- - " are comments explaining the corresponding part of code.

-- IP declarations

-- IP1 IP2 and IP3 are either existent in basic library

-- or their functions are directly specified herein to

-- generate their VHDL codes

: IP IP1 ... <code implementing this IP>

: IP IP2 ... <code implementing this IP>

: IP IP3 ... <code implementing this IP>

: IP change_conf ... <code implementing this IP>

-- First slave unit declaration

slave configl [ 2 3 : 32 ]

this slave unit configl has 2x3 = 6 slave elements network is 2D torus grid

communication width is 32 bits -- Declaration of a parallel component myspmd that will be - -- used in the master core

PC myspmd := I PI

-- IP1 is one slave IP core for this slave unit -- Second slave unit declaration

slave config2 ] 3 4 : 8 [

this second slave unit config2 has 3x4 = 12 slave

elements

network is 2D grid (no torus)

-- communication width is 8 bits

-- Overloading the same parallel component myspmd

PC myspmd := { IP2 IP3 }

-- IP2 and IP3 are slave IP cores for this slave unit

...

-- Code of the master core

master

-- Triggering a master IP core to choose active slave unit -- Result of change_conf (True or False) being on the stack -- Selecting active slave unit by IF statement

loop

change_conf

if

configl -- configl is configurated

else

config2 -- config2 is configurated endif

-- Trigger PC myspmd on the active slave unit

myspmd again

"IP com" is automatically generated as soon as a grid of slaves is used, to handle communications between different slave nodes.

"IP Reconfig" is implicitly instantiated since a dynamic reconfiguration is used.

As is apparent from this example, a whole structure of slave elements is simultaneously changed. Changing the whole configuration of a slave unit provides massively parallel dynamically reconfigurable SPMD execution model adapted to both application requirements and hardware resources.

Another example of dynamic master/slaves reconfiguration using slave configurations X and Y corresponds to the assembler code below. Configuration X uses a 10x10 grid of slaves for a 32-bit addition and subtraction. Configuration Y uses a row of 10 slaves to perform a Fast Fourier Transform operation. slave X [10 10]

PC Fl:= {+ - } // parallel function declaration A+B-C <code implementing this IP>

slave Y [10 1]

PC F2 := FFT

master

start

If

X // new configuration X

Fl

else

Y // new configuration Y

F2

endif

Figure 6 represents an example of a program memory for a master IP core, storing 32-bit words. Such mapping may be proposed by the compiler with respect to material constraints related to WIM and SPMD instructions. The code to be executed on global reset is saved at Address 0. Addresses from 4 to 1C are reserved for 7 TRAPs that may be triggered by 7 interrupts. Processing statically or dynamically associated with each interrupt is placed at these addresses that can be accessed by a WIM instruction. There may be 4 words per TRAP, the fourth one corresponding to a HALT.

Addresses from 20 to 3FFC, accessed by a WIM instruction, are reserved for VC (Virtual Components) A VC is similar to a TRAP but is used explicitly by the application and is not interruption-dependent. VCs are ended by a RETURN. Definitions of some other words can also be stored at these addresses, if there is enough room for that.

Addresses superior to 4000 contain declarations of other words.

The invention is not limited to the examples that have just been described. In particular, features from the embodiments illustrated may be combined within embodiments that are not illustrated.

The method for parallel processing and the FPGA chip according to the invention and as defined above may be used in a lot of different intensive signal processing applications, as for example in Software Radio, aeronautics, drones, GPS navigation, applications implementing multigrid methods, or assistance applications. Such applications are very greedy in terms of power computation and work on systems whose complexity is still growing. Furthermore, they have a more and more dynamic behavior, since they need to adapt to the execution environment and to the features of the processed data. This behavior is present not only in the software of these applications but also in the hardware architecture forming their execution support.

The expression "comprising a" or "including a" must be understood as being synonymous with "comprising at least one" or "including at least one", unless specified otherwise.

Claims

1. Method for performing parallel processing within at least one FPGA chip which comprises one dynamically reconfigurable master unit and at least two dynamically reconfigurable slave units, each slave unit comprising a pool of interconnected slave elements (slave 0, slave 1, slave 2, slave 3), each slave element comprising a plurality of slave IP cores, one slave softcore processor integrating them and an instruction memory, said one master unit comprising one reconfiguration IP core (14), at least one master IP core, one master softcore processor (1) integrating said at least one master IP core and a memory (3) containing a master program for execution by said at least one master IP core, wherein at runtime of the master program, the reconfiguration IP core (14) executes at least one instruction causing the reconfiguration of the pool of slave elements (slave 0, slave 1, slave 2, slave 3) by dynamically varying their number and/or interconnections according to a given configuration defined in said master program and then by storing into each of the instruction memories (102, 202, 302, 402) of the slave elements(slave 0, slave 1, slave 2, slave 3) a corresponding slave program for parallel execution by the slave IP cores.

2. Method according to claim 1, said at least one instruction triggering a SPMD instruction.

3. Method according to claim 2, said SPMD instruction triggering a parallel programming preset of the pool of slave elements.

4. Method according to any one of the preceding claims, wherein said master and slave programs are written in DSL, notably in macro assembler.

5. Method according to any one of the preceding claims, wherein said master and slave programs are statically analyzed at compile time to automatically generate in VHDL all required IP cores for the application if they do not already exist in a basic compilation library of said unit.

6. Method according to any one of the preceding claims, wherein the master program contains a synchronization barrier instruction causing the master program to wait for each slave element running a slave program to validate termination of a processing.

7. Method according to any one of the preceding claims, wherein the master and slave programs each containing at least one reflection instruction (WIM) for overwriting in said memory (3) and instruction memory to achieve reflective behavior of the master and slave softcore processors, respectively.

8. Method according to any one of the preceding claims, the master softcore processor comprising a stack (4) and reconfiguration files stored in a reconfiguration memory, wherein the reconfiguration IP core (14) uses the stack (4) to retrieve the address of the selected reconfiguration file, being notably popped from the top of said stack (4).

9. Method according to any one of the preceding claims, the master and slave softcore processors each comprising a command and control unit (5) ensuring an execution flow of said master and slave programs, respectively.

10. Method according to the preceding claim, wherein the command and control unit (5) handles a number of interrupts equal to 2^P -1 with p an integer and p>l .

11. Method according to the preceding claim, wherein interruption of the running of the master program is caused by a single interrupt, and wherein when said interrupt is in process, such interrupt is non-preemptive.

12. Method according to any one of the preceding claims, wherein at runtime, the pool of slave elements (slave 0, slave 1, slave 2, slave 3) is reconfigured at least twice.

13. An FPGA chip comprising one dynamically reconfigurable master unit and at least two dynamically reconfigurable slave units, each slave unit comprising a pool of interconnected slave elements (slave 0, slave 1, slave 2, slave 3), each slave element comprising a plurality of slave IP cores, one slave softcore processor integrating them and an instruction memory, said one master unit comprising one reconfiguration IP core (14), at least one master IP core, one master softcore processor (1) integrating said at least one master IP core and a memory (3) containing a master program for execution by said at least one master IP core, said master program containing at least one instruction executed by the reconfiguration IP core (14) and causing at runtime the reconfiguration of the pool of slave elements (slave 0, slave 1, slave 2, slave 3) by dynamically varying their number and/or interconnections according to a given configuration defined in said master program and then by storing into each of the instruction memories (102, 202, 302, 402) of the slave elements (slave 0, slave 1, slave 2, slave 3) a corresponding slave program for parallel execution by the slave IP cores.

14. An FPGA chip according to the preceding claim, being a 2,5D or 3D

FPGA.

15. An FPGA chip according to claim 13 or 14, the master and slave softcore processors being of Harvard type architecture, having separated memories for data and instructions.