WO2021014017A1 - A reconfigurable architecture, for example a coarse-grained reconfigurable architecture as well as a corresponding method of operating such a reconfigurable architecture - Google Patents

A reconfigurable architecture, for example a coarse-grained reconfigurable architecture as well as a corresponding method of operating such a reconfigurable architecture Download PDF

Info

Publication number
WO2021014017A1
WO2021014017A1 PCT/EP2020/071042 EP2020071042W WO2021014017A1 WO 2021014017 A1 WO2021014017 A1 WO 2021014017A1 EP 2020071042 W EP2020071042 W EP 2020071042W WO 2021014017 A1 WO2021014017 A1 WO 2021014017A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
reconfigurable
fus
accordance
memory
Prior art date
Application number
PCT/EP2020/071042
Other languages
French (fr)
Inventor
Mark Wijtvliet
Henk CORPORAAL
Original Assignee
Technische Universiteit Eindhoven
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Technische Universiteit Eindhoven filed Critical Technische Universiteit Eindhoven
Publication of WO2021014017A1 publication Critical patent/WO2021014017A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • G06F15/7871Reconfiguration support, e.g. configuration loading, configuration switching, or hardware OS
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • a reconfigurable architecture for example a coarse-grained reconfigurable architecture as well as a corresponding method of operating such a reconfigurable architecture.
  • the present disclosure generally relates to microprocessor architectures. More specifically, it relates to a reconfigurable architecture, for example a coarse-grained reconfigurable architecture..
  • Modern devices such as smartphones, loT (Internet of Things) and wearable (e.g. medical) devices require high computational performance yet operate at a very constrained energy budget. At the same time, they require a large amount of programming flexibility of the processor. These requirements conflict with each other. A processor architecture that can support all these requirements at the same time will lead to new applications and products in various fields where energy efficiency is key.
  • the present disclosure is directed to reconfigurable architectures, or CGRAs.
  • the definition that is upheld in the present disclosure is that a CGRA uses hardware flexibility in order to adapt the data-path at run-time to the application. This requires that, before computation can be performed, some sort of configuration has to be processed by the architecture.
  • Such a configuration is typically static for a particular amount of time, for example for the duration of an application, or a loop kernel within an application.
  • an architecture is to be defined as a coarse-grained reconfigurable architecture whenever it fulfils two properties. First, a spatial reconfiguration granularity at functional unit level or above, and second, a temporal reconfiguration granularity at region/loop-nest level or above. Focus should be put on the dominant level when considering these properties. The dominant level is the level at which reconfiguration has the greatest influence on the functionality of the system.
  • Existing CGRAs are often implemented either as coarse-grained Field Programmable Gate Array, FPGAs, where execution of an algorithm is performed as a static systolic-array like structure or, as a network of lightweight processors.
  • Systolic arrays provide a grid of function units that are configured statically to perform a specific operation for the duration of the application. Although systolic arrays may lead to energy efficient implementations with good performance, these CGRAs are rather static and do not support operations that influence the control- flow of an application, such as branching. The flexibility of these architectures, therefore, is low. Due to the static nature of the implementation there is no time multiplexing of function units which may lead to a larger area than CGRAs that support function units which are programmable per cycle.
  • the second method uses a configurable network with multiple processing elements that operate independently or in lock-step. This method can be considered as very flexible but the power draw is much higher. This leads to a lower energy efficiency and is caused by the cycle-based operation of the processing elements and less or no support for spatial execution.
  • Execution control of these processing elements can either be performed by a global‘context’ that is switched out when the schedule requires it, or local decoding per processing element.
  • the execution schedule of local decoding can be controlled by a global program counter or by local execution control. The latter brings these architectures closer to a networked many-core than a CGRA.
  • a reconfigurable architecture for example a coarse-grained reconfigurable architecture, comprising:
  • an instruction memory comprising instructions to be executed by a reconfigurable architecture
  • a data memory comprising data on which the instructions are to be performed; a plurality of Function Units, FUs, wherein each FU is arranged to receive data from the data memory at an input, process the received data according to a configured function of the corresponding FU, to output the processed data at an output towards the data memory;
  • Instruction Decoders IDs, arranged for configuring functions of the plurality of FUs in accordance with the instructions comprised by the instruction memory;
  • a reconfigurable data-path network arranged for configuring data paths between inputs and outputs of one or more FUs of the plurality of FUs;
  • a reconfigurable control network separate from the data-path network, and arranged to enable the IDs for configuring functions of the plurality of FUs over the reconfigurable control network.
  • the above described architecture is able to support flexibility, high performance and high energy/area efficiency all at the same time by allowing run-time construction, by reconfiguration, of application specific processors.
  • a specialized processor can be realized by the architecture.
  • the architecture applies a unique separation between application data and application control over reconfigurable networks.
  • the constructed specialized processors are very energy efficient since the architecture allows good matching between application and processor properties.
  • processor architecture of the invention is contrasted with traditional processors, for example VLIW, SIMD and ARM Cortex M0, as well as traditional CGRA methods for reconfiguration, so as to highlight the advantages of the invention.
  • the processor architecture contains several function units that can perform computation. These units may be heterogeneous and may be able to perform various functions, such as: arithmetic and logic operations, multiplication, memory operations, etc. Function units are not restricted to these basic operations but can implement any complex (set of) operation(s).
  • the function units can be connected over a data network to form a specialized data-path.
  • the control for these functional units for example what function is performed at a certain moment in time, is performed over a secondary reconfigurable control network. This allows various forms of parallelisms in applications to be very efficiently exploited on the architecture, leading to a large energy reduction.
  • inventions may include a new type of CGRA with a unique reconfiguration flexibility allowing energy-efficient support for pipelining parallelism, operation chaining, data-level parallelism, instruction and operation level parallelism and task-level parallelism, and all their combinations, and a software framework supporting the architecture.
  • the processor architecture of the invention is useful in electronic wearable or portable devices such as, but not limited to, communication devices such as smart phones, watches, laptops, computers and brain computer interfaces. Other fields where energy efficiency is of importance, such as digital signal processing and telecommunication networks are also an application area for the invention.
  • the instruction memory may be prepared by a boot-loader, and may contain the instructions that are to be executed by the reconfigurable architecture.
  • the data memory may be considered as the global memory and may be a memory that can be either a large on-chip memory or an external memory interfaced to the architecture.
  • An architecture instance may contain one or more load- store units, which is explained in more detail later below.
  • One of the goals of the present architecture is to increase energy efficiency while keeping flexibility. Since energy efficiency is defined as performance divided by power, it can be increased in multiple ways: by speeding up the computation, and by reducing power consumption. Systolic arrays perform well on both of these aspects. Spatial mapping of the computation provides an efficient and pipelined path to compute results, resulting in high performance.
  • Separation between the data-path network and the control network within the presented architecture is one the features that allows improvements on both the performance and power aspects.
  • the specialized data-paths in the constructed processor architectures both reduce the number of cycles it takes to execute an application, and reduce power by avoiding memory and register file accesses due to increased bypass opportunities.
  • control network and the data-path network are configured as a static circuit-switched network.
  • the circuit-switched network is typically configured once per application or kernel.
  • Switch-boxes may be compared to how a phone network operates. In these network, the phone number represents a configuration for the various switches on the line to make a connection between two selected devices. This connection then becomes a direct connection between these devices. Switch-boxes in the presented architecture may work in a similar manner.
  • a bit-file may provide a configuration for each of the switch- boxes on the network, being either the reconfigurable data-path network and/or the control network. This configuration may contain specifications like: connect the left port of the switch-box to the bottom port of the switch-box.
  • any of the reconfigurable data-path and the control networks may comprise a plurality of switch boxes, wherein each switch box has a plurality of input and a plurality of outputs, and wherein each output is internally connected to an input.
  • the FUs may be heterogeneous meaning that all FUs do not need to perform the same function.
  • the plurality of FUs comprise any of:
  • ABU Accumulate and Branch Unit
  • LSU Load-Store Unit
  • An ALU may support operations for arithmetic and logic operations such as addition, subtraction, bit-wise operations, shifting, and comparisons.
  • An ABU supporting branching for evaluating conditions that may be calculated inside the data path and determine whether to branch based on those conditions.
  • the LSU is the interface to the data memory for all other function units on the architecture.
  • the MUL is arranged for performing multiplication operations. These multiplications can be both signed and unsigned and are performed on fixed point data representations.
  • each of the IDs are configured per application or kernel.
  • statically configured means that they are configured once per application or kernel, and keep this configuration for the duration thereof, similar to the networks used in Field Programmable Gate Arrays, FPGAs.
  • a dynamically configured network can change during program execution, an example thereof are routing networks. Instead of providing a direct path between a source and sink, the data travels over this type of network as a packet. The packet contains information about the source and destination of the packet which is processed by routers that provide the connection between network segments.
  • dynamic networks are more flexible, packets can move from any source to any destination, they provide a higher overhead in terms of power, area and latency (cycles).
  • control network and the data-path network may be a statically configured network.
  • At least one of the FUs comprises a Load-Store Unit, LSU, wherein the LSU is arranged for interfacing the data memory.
  • LSU Load-Store Unit
  • the reconfigurable architecture further comprises a memory arbiter placed in between the data memory and the at least one LSU, wherein the memory arbiter is arranged for orchestrating requests from the at least one LSU towards the data memory.
  • the presented architecture may comprise one or more load-store units. Each of these units can load data from the data memory. This means that there can be multiple load and store requests to the global memory within a single clock cycle.
  • an arbiter is placed between the global memory and the LSUs. When a load or store request arrives, the arbiter may check if these requests can be grouped. E.g., multiple reads to the same memory word can be coalesced into a single memory access. If multiple accesses are required the arbiter may stall the compute fabric until all data has been collected. Although stalling the computation degrades performance, it may be required to keep the memory accesses in sync with the operations in the compute fabric.
  • each LSU may be associated with a local memory for temporarily storing data to be reused.
  • any of the control network and the data-path network is a circuit-switched network.
  • the FUs are heterogeneous.
  • any of the data-path network and the control network comprises a plurality of switch boxes, wherein each switch box has a plurality of input and a plurality of outputs, and wherein each output is connected to an input.
  • any of the statically reconfigurable data-path network and the control network can change its configuration fast by changing between multiple, locally stored, configuration contexts.
  • the plurality of FUs comprise any of
  • ABU Accumulate and Branch Unit
  • LSU Load-Store Unit
  • each of the IDs are configured per application or kernel
  • Fig. 1 shows an example of the coarse-grained reconfigurable architecture in accordance with the present disclosure.
  • Fig. 2 shows an example of a function unit with four inputs and two output.
  • Fig. 3 shows an example of a switch box as may be used may any of the data-path network and the control network in accordance with the present disclosure.
  • the presented architecture comprises separate programmable, i.e. reconfigurable, control and data paths.
  • the architecture allows light-weight instruction fetcher & instruction decoder units (I FID) to be arbitrarily connected to one or more functional units (FU) over a statically, i.e. per application or kernel, configured interconnect, which can be reconfigured at run-time. Since IFIDs can be connected to more than one FU a true SIMD processor can be constructed, leading to energy reductions.
  • the FUs can also be connected to each other over a similar, but separate from the control, static network. This allows the architecture to provide spatial application layout, like FPGAs, as well as cycle-based VLIW-like instructions.
  • the fabric may be realized by two circuit switched networks operating on data buses instead of individual wires; both are usually configured once per application and are then static during program execution.
  • the data-path network allows the inputs and outputs of functional units to be connected to allow direct data transfer between FUs.
  • the second network is the control-network.
  • This network allows IFIDs to be connected to one or more functional units to create SIMD processors. By connecting multiple IFIDs to the same program counter, generated by the accumulate & branch unit (ABU), VLIW-SIMD processors can be instantiated. A mix between these types of processors is also possible. If there is more than one ABU present within an architecture there can be multiple independently operating processors instantiated on a single fabric.
  • Applications of the invention include any system with a small energy budget (battery powered devices) that has to perform a significant amount of computations can benefit from the presented architecture. Especially when flexibility is needed such that applications can change (updates) during the product lifetime, or when it has to run different applications.
  • the architecture shown in figure 1 may be realized by two FPGA-like networks operating on data buses instead of individual wires, both are configured once per application and are static during program execution.
  • the data-path network allows the inputs and outputs of functional units to be connected to allow direct data transfer between FUs.
  • the second network is the control-network.
  • This network allows IFIDs to be connected to one or more functional units to create SIMD processors. By connecting multiple IFIDs to the same program counter, generated by the accumulate & branch unit (ABU), VLIW-SIMD processors can be instantiated. If there is more than one ABU present within an architecture, there can be multiple independently operating processors, simultaneously running multiple processes or tasks, on the fabric.
  • ABU accumulate & branch unit
  • the instructions may have a total width of 16 bit; the upper four bits select the type of FU the I FID is controlling and are statically configured by the bit- stream. The lower 12 bits may be controlled per cycle and are loaded from the instruction memories.
  • the architecture may feature several types of functional units, each responsible for providing different functionality to the configured processor. Most functional units have multiple input ports and output ports connected to the data- network which are selected by the instruction. Function units may contain internal registers for buffering inputs and/or outputs. The instruction is provided by the I FI D connected to the FU over the control network.
  • Figure 1 shows an educational example of an architecture with different types of functional units, as well as the memory hierarchy around the architecture.
  • the architecture may comprise a large global 32-bit wide global data memory (GM), wherein the LSUs are connected to this memory via a memory arbiter.
  • the arbiter may serve memory requests on a round-robin basis and detects coalesced memory accesses.
  • every LSU has a local memory (LM). Due to the small size of these memories they may be implemented as low-power register file macros from a commercial vendor which internally consist of flip-flops.
  • the local memories are private to an LSU and therefore never cause processor stalls. When data reuse is possible inside a kernel, the local memories may be used to store intermediate results if they cannot be kept inside the processing pipeline.
  • the local memories may be used to store intermediate results if they cannot be kept inside the processing pipeline.
  • the instructions for each I FI D may be loaded, together with the bit-stream, from a memory shared between the host processor and the presented architecture.
  • a loader may move the instructions from the binary to the relevant instruction memories; each I FI D has its own small instruction memory, implemented as a low-power register file memory macro. This allows loading instructions in parallel during execution.
  • the presented architecture may feature several types of functional units, each responsible for providing different functionality to the configured processor. Most functional units have multiple input ports and output registers connected to the data-path network which are selected by the instruction. The instruction may be provided by the I FID connected to the FU over the control network.
  • each I FI D may be loaded (together with the bit-stream) from a memory shared between the host processor and Blocks.
  • a loader may move the instructions in the binary to the relevant instruction memories; each I FI D may have its own small instruction memory, implemented as a low-power register file memory macro. This is allows loading instructions in parallel during execution.
  • the structure of the presented architecture may require that every instruction decoder can control every type of function unit.
  • instruction decoders that are specialized per function unit. However, doing so may reduce the flexibility of the architecture and increases complexity of the networks. This is because the total number of required instruction decoders may be higher to provide the same configuration opportunities.
  • two generic instruction decoders can control an ALU and a multiplier, or in another configuration control an ALU and an load-store unit (LSU). To do so with specialized instruction decoders would require three decoders.
  • Fig. 2 shows an example of a function unit with four inputs and two outputs.
  • the function unit may have four inputs, wherein each input may be connected to the data-path network.
  • the function unit may perform any type of function on the received data, and may output the processed data via two outputs. The processed data is also outputted to the data-path network.
  • the function unit may comprise an instruction input, wherein the instruction input may be connected to the control network.
  • the instruction input is used for controlling the function unit, i.e. for providing instructions from the instruction memory, by the instruction decoder, to the function unit.
  • Fig. 3 shows an example of a switch box as may be used may any of the data-path network and the control network in accordance with the present disclosure.
  • the switch box may, for example, be utilized in a data-path network.
  • two switch boxes are shown, wherein each switch box may have a plurality of inputs, ad a plurality of outputs. Using multiplexers, a particular input may be connected to a particular output.

Abstract

A reconfigurable architecture, comprising an instruction memory comprising instructions to be executed by a reconfigurable architecture, a data memory comprising data on which the instructions are to be performed, a plurality of Function Units, FUs, wherein each FU is arranged for receiving data from the data memory at an input, processing the received data according to a configured function of the corresponding FU, and for outputting the processed data at an output to the data memory, Instruction Decoders, IDs, arranged for configuring functions of the plurality of FUs in accordance with the instructions comprised by the instruction memory, a reconfigurable data-path network arranged for configuring data paths between inputs and outputs of one or more FUs of the plurality of FUs, a reconfigurable control network, separate from the data-path network, and arranged for enabling the IDs for configuring functions of the plurality of FUs over the reconfigurable control network.

Description

Title
A reconfigurable architecture, for example a coarse-grained reconfigurable architecture as well as a corresponding method of operating such a reconfigurable architecture.
Technical field
The present disclosure generally relates to microprocessor architectures. More specifically, it relates to a reconfigurable architecture, for example a coarse-grained reconfigurable architecture..
Background
Modern devices such as smartphones, loT (Internet of Things) and wearable (e.g. medical) devices require high computational performance yet operate at a very constrained energy budget. At the same time, they require a large amount of programming flexibility of the processor. These requirements conflict with each other. A processor architecture that can support all these requirements at the same time will lead to new applications and products in various fields where energy efficiency is key.
The present disclosure is directed to reconfigurable architectures, or CGRAs. Many informal and conflicting definitions of CGRAs exist in the art. The definition that is upheld in the present disclosure is that a CGRA uses hardware flexibility in order to adapt the data-path at run-time to the application. This requires that, before computation can be performed, some sort of configuration has to be processed by the architecture.
Such a configuration is typically static for a particular amount of time, for example for the duration of an application, or a loop kernel within an application.
Further, an architecture is to be defined as a coarse-grained reconfigurable architecture whenever it fulfils two properties. First, a spatial reconfiguration granularity at functional unit level or above, and second, a temporal reconfiguration granularity at region/loop-nest level or above. Focus should be put on the dominant level when considering these properties. The dominant level is the level at which reconfiguration has the greatest influence on the functionality of the system. Existing CGRAs are often implemented either as coarse-grained Field Programmable Gate Array, FPGAs, where execution of an algorithm is performed as a static systolic-array like structure or, as a network of lightweight processors.
Systolic arrays provide a grid of function units that are configured statically to perform a specific operation for the duration of the application. Although systolic arrays may lead to energy efficient implementations with good performance, these CGRAs are rather static and do not support operations that influence the control- flow of an application, such as branching. The flexibility of these architectures, therefore, is low. Due to the static nature of the implementation there is no time multiplexing of function units which may lead to a larger area than CGRAs that support function units which are programmable per cycle.
The second method uses a configurable network with multiple processing elements that operate independently or in lock-step. This method can be considered as very flexible but the power draw is much higher. This leads to a lower energy efficiency and is caused by the cycle-based operation of the processing elements and less or no support for spatial execution. Execution control of these processing elements can either be performed by a global‘context’ that is switched out when the schedule requires it, or local decoding per processing element. The execution schedule of local decoding can be controlled by a global program counter or by local execution control. The latter brings these architectures closer to a networked many-core than a CGRA.
Traditional CGRAs have a lower energy efficiency and are less flexible than the proposed CGRA.
Summary
It is an object of the present disclosure to provide for a reconfigurable architecture that is efficient as well as power efficient.
In a first aspect, there is provided a reconfigurable architecture, for example a coarse-grained reconfigurable architecture, comprising:
an instruction memory comprising instructions to be executed by a reconfigurable architecture;
a data memory comprising data on which the instructions are to be performed; a plurality of Function Units, FUs, wherein each FU is arranged to receive data from the data memory at an input, process the received data according to a configured function of the corresponding FU, to output the processed data at an output towards the data memory;
Instruction Decoders, IDs, arranged for configuring functions of the plurality of FUs in accordance with the instructions comprised by the instruction memory;
a reconfigurable data-path network arranged for configuring data paths between inputs and outputs of one or more FUs of the plurality of FUs;
a reconfigurable control network, separate from the data-path network, and arranged to enable the IDs for configuring functions of the plurality of FUs over the reconfigurable control network.
The above described architecture is able to support flexibility, high performance and high energy/area efficiency all at the same time by allowing run-time construction, by reconfiguration, of application specific processors. In other words, for each application that needs to be executed, a specialized processor can be realized by the architecture. The architecture applies a unique separation between application data and application control over reconfigurable networks. The constructed specialized processors are very energy efficient since the architecture allows good matching between application and processor properties.
The processor architecture of the invention is contrasted with traditional processors, for example VLIW, SIMD and ARM Cortex M0, as well as traditional CGRA methods for reconfiguration, so as to highlight the advantages of the invention.
The processor architecture contains several function units that can perform computation. These units may be heterogeneous and may be able to perform various functions, such as: arithmetic and logic operations, multiplication, memory operations, etc. Function units are not restricted to these basic operations but can implement any complex (set of) operation(s). The function units can be connected over a data network to form a specialized data-path. In addition, the control for these functional units, for example what function is performed at a certain moment in time, is performed over a secondary reconfigurable control network. This allows various forms of parallelisms in applications to be very efficiently exploited on the architecture, leading to a large energy reduction. Innovative features of embodiments may include a new type of CGRA with a unique reconfiguration flexibility allowing energy-efficient support for pipelining parallelism, operation chaining, data-level parallelism, instruction and operation level parallelism and task-level parallelism, and all their combinations, and a software framework supporting the architecture.
The processor architecture of the invention is useful in electronic wearable or portable devices such as, but not limited to, communication devices such as smart phones, watches, laptops, computers and brain computer interfaces. Other fields where energy efficiency is of importance, such as digital signal processing and telecommunication networks are also an application area for the invention.
In accordance with the present disclosure, the instruction memory may be prepared by a boot-loader, and may contain the instructions that are to be executed by the reconfigurable architecture.
The data memory may be considered as the global memory and may be a memory that can be either a large on-chip memory or an external memory interfaced to the architecture. An architecture instance may contain one or more load- store units, which is explained in more detail later below.
One of the goals of the present architecture is to increase energy efficiency while keeping flexibility. Since energy efficiency is defined as performance divided by power, it can be increased in multiple ways: by speeding up the computation, and by reducing power consumption. Systolic arrays perform well on both of these aspects. Spatial mapping of the computation provides an efficient and pipelined path to compute results, resulting in high performance.
The static nature of the interconnect and operations of systolic arrays provides good power efficiency, which combined with high performance leads to a low energy architecture. However, this only works when there are enough compute resources to spatially map the application to the hardware. Additionally, it is usually not possible to map applications with a complex, or data dependent, control flow due to the lack of flexibility at run-time.
Separation between the data-path network and the control network within the presented architecture is one the features that allows improvements on both the performance and power aspects. The specialized data-paths in the constructed processor architectures both reduce the number of cycles it takes to execute an application, and reduce power by avoiding memory and register file accesses due to increased bypass opportunities.
In an example, the control network and the data-path network are configured as a static circuit-switched network. The circuit-switched network is typically configured once per application or kernel.
It is further noted that it may be beneficial that these types of circuit- switched networks use switch-boxes to route signals over a network. Switch-boxes may be compared to how a phone network operates. In these network, the phone number represents a configuration for the various switches on the line to make a connection between two selected devices. This connection then becomes a direct connection between these devices. Switch-boxes in the presented architecture may work in a similar manner. A bit-file may provide a configuration for each of the switch- boxes on the network, being either the reconfigurable data-path network and/or the control network. This configuration may contain specifications like: connect the left port of the switch-box to the bottom port of the switch-box.
Thus, any of the reconfigurable data-path and the control networks may comprise a plurality of switch boxes, wherein each switch box has a plurality of input and a plurality of outputs, and wherein each output is internally connected to an input.
It is noted that, in accordance with the present disclosure, the FUs may be heterogeneous meaning that all FUs do not need to perform the same function.
In a further example, the plurality of FUs comprise any of:
an Arithmetic Logic Unit, ALU;
an Accumulate and Branch Unit, ABU;
a Load-Store Unit, LSU;
a multiplier unit, MUL;
a Register File, RF;
an Immediate Unit, IU.
An ALU may support operations for arithmetic and logic operations such as addition, subtraction, bit-wise operations, shifting, and comparisons. An ABU supporting branching for evaluating conditions that may be calculated inside the data path and determine whether to branch based on those conditions. The LSU is the interface to the data memory for all other function units on the architecture. The MUL is arranged for performing multiplication operations. These multiplications can be both signed and unsigned and are performed on fixed point data representations.
In a further example, each of the IDs are configured per application or kernel.
In accordance with the present disclosure, statically configured means that they are configured once per application or kernel, and keep this configuration for the duration thereof, similar to the networks used in Field Programmable Gate Arrays, FPGAs. A dynamically configured network can change during program execution, an example thereof are routing networks. Instead of providing a direct path between a source and sink, the data travels over this type of network as a packet. The packet contains information about the source and destination of the packet which is processed by routers that provide the connection between network segments. Although dynamic networks are more flexible, packets can move from any source to any destination, they provide a higher overhead in terms of power, area and latency (cycles).
Any of the control network and the data-path network may be a statically configured network.
In another example, at least one of the FUs comprises a Load-Store Unit, LSU, wherein the LSU is arranged for interfacing the data memory.
In an even further example, the reconfigurable architecture further comprises a memory arbiter placed in between the data memory and the at least one LSU, wherein the memory arbiter is arranged for orchestrating requests from the at least one LSU towards the data memory.
The presented architecture may comprise one or more load-store units. Each of these units can load data from the data memory. This means that there can be multiple load and store requests to the global memory within a single clock cycle. To manage the requests an arbiter is placed between the global memory and the LSUs. When a load or store request arrives, the arbiter may check if these requests can be grouped. E.g., multiple reads to the same memory word can be coalesced into a single memory access. If multiple accesses are required the arbiter may stall the compute fabric until all data has been collected. Although stalling the computation degrades performance, it may be required to keep the memory accesses in sync with the operations in the compute fabric. In a further example, each LSU may be associated with a local memory for temporarily storing data to be reused.
In a second aspect of the present disclosure, there is provided a method of operating a reconfigurable architecture in accordance with any of the previous examples, wherein the method comprises the steps of:
configuring, by the IDs, over the reconfigurable control network, functions of the plurality of FUs in accordance with the instructions comprised by the instruction memory;
routing, by the reconfigurable data-path network, data between the data memory and the inputs and outputs of one or more FUs of the plurality of FUs using configured data paths.
It is noted that the advantages as provided with respect to the first aspect, being the reconfigurable architecture, are also applicable for the second aspect of the present disclosure, being the method of operating the reconfigurable architecture.
In an example, any of the control network and the data-path network is a circuit-switched network.
In a further example, the FUs are heterogeneous.
In yet another example, any of the data-path network and the control network comprises a plurality of switch boxes, wherein each switch box has a plurality of input and a plurality of outputs, and wherein each output is connected to an input.
In yet another example, any of the statically reconfigurable data-path network and the control network, can change its configuration fast by changing between multiple, locally stored, configuration contexts.
In another example, the plurality of FUs comprise any of
an Arithmetic Logic Unit, ALU;
an Accumulate and Branch Unit, ABU;
a Load-Store Unit, LSU;
a multiplier unit, MUL;
a Register File, RF;
an Immediate Unit, IU.
In an example, each of the IDs are configured per application or kernel
The above mentioned and other features and advantages of the disclosure will be best understood from the following description referring to the attached drawings. In the drawings, like reference numerals denote identical parts or parts performing an identical or comparable function or operation.
Brief description of the Drawings
Fig. 1 shows an example of the coarse-grained reconfigurable architecture in accordance with the present disclosure.
Fig. 2 shows an example of a function unit with four inputs and two output.
Fig. 3 shows an example of a switch box as may be used may any of the data-path network and the control network in accordance with the present disclosure.
Detailed description
The presented architecture comprises separate programmable, i.e. reconfigurable, control and data paths. The architecture allows light-weight instruction fetcher & instruction decoder units (I FID) to be arbitrarily connected to one or more functional units (FU) over a statically, i.e. per application or kernel, configured interconnect, which can be reconfigured at run-time. Since IFIDs can be connected to more than one FU a true SIMD processor can be constructed, leading to energy reductions.
The FUs can also be connected to each other over a similar, but separate from the control, static network. This allows the architecture to provide spatial application layout, like FPGAs, as well as cycle-based VLIW-like instructions. The fabric may be realized by two circuit switched networks operating on data buses instead of individual wires; both are usually configured once per application and are then static during program execution.
The data-path network allows the inputs and outputs of functional units to be connected to allow direct data transfer between FUs. The second network is the control-network. This network allows IFIDs to be connected to one or more functional units to create SIMD processors. By connecting multiple IFIDs to the same program counter, generated by the accumulate & branch unit (ABU), VLIW-SIMD processors can be instantiated. A mix between these types of processors is also possible. If there is more than one ABU present within an architecture there can be multiple independently operating processors instantiated on a single fabric.
Applications of the invention include any system with a small energy budget (battery powered devices) that has to perform a significant amount of computations can benefit from the presented architecture. Especially when flexibility is needed such that applications can change (updates) during the product lifetime, or when it has to run different applications.
In most processor architectures instruction fetching and decoding contributes to a significant fraction of the total power. This is especially true for traditional CGRA architectures where each functional unit typically has a local instruction memory and decoder, or is connected to a global configuration context. Some CGRAs even have functional units that are approaching light-weight processor cores with all the corresponding instruction fetch and decoding overhead. CGRAs are often used to perform various forms of digital signal processing which contain a significant amount of data level parallelism and can therefore exploit SIMD programming models.
The architecture shown in figure 1 may be realized by two FPGA-like networks operating on data buses instead of individual wires, both are configured once per application and are static during program execution. The data-path network allows the inputs and outputs of functional units to be connected to allow direct data transfer between FUs. The second network is the control-network. This network allows IFIDs to be connected to one or more functional units to create SIMD processors. By connecting multiple IFIDs to the same program counter, generated by the accumulate & branch unit (ABU), VLIW-SIMD processors can be instantiated. If there is more than one ABU present within an architecture, there can be multiple independently operating processors, simultaneously running multiple processes or tasks, on the fabric.
The instructions may have a total width of 16 bit; the upper four bits select the type of FU the I FID is controlling and are statically configured by the bit- stream. The lower 12 bits may be controlled per cycle and are loaded from the instruction memories.
The architecture may feature several types of functional units, each responsible for providing different functionality to the configured processor. Most functional units have multiple input ports and output ports connected to the data- network which are selected by the instruction. Function units may contain internal registers for buffering inputs and/or outputs. The instruction is provided by the I FI D connected to the FU over the control network.
Figure 1 shows an educational example of an architecture with different types of functional units, as well as the memory hierarchy around the architecture.
The architecture may comprise a large global 32-bit wide global data memory (GM), wherein the LSUs are connected to this memory via a memory arbiter. The arbiter may serve memory requests on a round-robin basis and detects coalesced memory accesses. Besides the global data memory, every LSU has a local memory (LM). Due to the small size of these memories they may be implemented as low-power register file macros from a commercial vendor which internally consist of flip-flops. The local memories are private to an LSU and therefore never cause processor stalls. When data reuse is possible inside a kernel, the local memories may be used to store intermediate results if they cannot be kept inside the processing pipeline.
When data reuse is possible inside a kernel, the local memories may be used to store intermediate results if they cannot be kept inside the processing pipeline. During configuration, the instructions for each I FI D may be loaded, together with the bit-stream, from a memory shared between the host processor and the presented architecture. A loader may move the instructions from the binary to the relevant instruction memories; each I FI D has its own small instruction memory, implemented as a low-power register file memory macro. This allows loading instructions in parallel during execution.
The presented architecture may feature several types of functional units, each responsible for providing different functionality to the configured processor. Most functional units have multiple input ports and output registers connected to the data-path network which are selected by the instruction. The instruction may be provided by the I FID connected to the FU over the control network.
During configuration, the instructions for each I FI D may be loaded (together with the bit-stream) from a memory shared between the host processor and Blocks. A loader may move the instructions in the binary to the relevant instruction memories; each I FI D may have its own small instruction memory, implemented as a low-power register file memory macro. This is allows loading instructions in parallel during execution. It is noted that the structure of the presented architecture, as shown in figure 1 , may require that every instruction decoder can control every type of function unit. Alternatively, it would have been possible to implement instruction decoders that are specialized per function unit. However, doing so may reduce the flexibility of the architecture and increases complexity of the networks. This is because the total number of required instruction decoders may be higher to provide the same configuration opportunities. For example, two generic instruction decoders can control an ALU and a multiplier, or in another configuration control an ALU and an load-store unit (LSU). To do so with specialized instruction decoders would require three decoders.
Fig. 2 shows an example of a function unit with four inputs and two outputs.
The function unit may have four inputs, wherein each input may be connected to the data-path network. The function unit may perform any type of function on the received data, and may output the processed data via two outputs. The processed data is also outputted to the data-path network.
Finally, the function unit may comprise an instruction input, wherein the instruction input may be connected to the control network. The instruction input is used for controlling the function unit, i.e. for providing instructions from the instruction memory, by the instruction decoder, to the function unit.
Fig. 3 shows an example of a switch box as may be used may any of the data-path network and the control network in accordance with the present disclosure.
The switch box may, for example, be utilized in a data-path network. In the present scenario two switch boxes are shown, wherein each switch box may have a plurality of inputs, ad a plurality of outputs. Using multiplexers, a particular input may be connected to a particular output.
The present disclosure is not limited to the examples as disclosed above, and can be modified and enhanced by those skilled in the art beyond the scope of the present disclosure as disclosed in the appended claims without having to apply inventive skills and for use in any data communication, data exchange and data processing environment, system or network.

Claims

1. A reconfigurable architecture, comprising:
an instruction memory comprising instructions to be executed by a reconfigurable architecture;
a data memory comprising data on which the instructions are to be performed;
a plurality of Function Units, FUs, wherein each FU is arranged to receive data from the data memory at an input, process the received data according to a configured function of the corresponding FU, and output the processed data at an output to the data memory;
Instruction Decoders, IDs, arranged for configuring functions of the plurality of FUs in accordance with the instructions comprised by the instruction memory;
a reconfigurable data-path network arranged to configure data paths between inputs and outputs of one or more FUs of the plurality of FUs;
a reconfigurable control network, separate from the data-path network, arranged to enable the IDs for configuring functions of the plurality of FUs over the reconfigurable control network.
2. A reconfigurable architecture in accordance with claim 1 , wherein any of the control network and the data-path network is a circuit-switched network.
3. A reconfigurable architecture in accordance with any of the previous claims, wherein the FUs are heterogeneous.
4. A reconfigurable architecture in accordance with any of the previous claims, wherein any of the reconfigurable data-path network and the control network comprises a plurality of switch boxes, wherein each switch box has a plurality of inputs and a plurality of outputs, and wherein each output is connected to an input.
5. A reconfigurable architecture in accordance with any of the previous claims, wherein the plurality of FUs comprise any of:
an Arithmetic Logic Unit, ALU; an Accumulate and Branch Unit, ABU;
a Load-Store Unit, LSU;
a multiplier unit, MUL;
a Register File, RF;
an Immediate Unit, IU.
6. A reconfigurable architecture in accordance with any of the previous claims, wherein each of the IDs are configured per application or kernel.
7. A reconfigurable architecture in accordance with any of the previous claims, wherein at least one of the FUs comprises a Load-Store Unit, LSU, wherein the LSU is arranged for interfacing the data memory.
8. A reconfigurable architecture in accordance with claim 7, wherein the reconfigurable architecture further comprises a memory arbiter configured between the data memory and the at least one LSU, wherein the arbiter is arranged for orchestrating requests from the at least one LSU to access the data memory.
9. A reconfigurable architecture in accordance with any of the claims 7 - 8, wherein each LSU is associated with a local memory for temporarily storing data to be reused.
10. A method of operating a reconfigurable architecture in accordance with any of the previous claims, wherein the method comprises the steps of:
configuring, by the IDs, over the reconfigurable control network, functions of the plurality of FUs in accordance with the instructions comprised by the instruction memory;
routing, by the reconfigurable data-path network, data between the data memory and the inputs and outputs of one or more FUs of the plurality of FUs using configured data paths.
1 1. A method in accordance with claim 10, wherein any of the control network and the data-path network is a circuit-switched network.
12. A method in accordance with any of the claims 10 - 1 1 , wherein the FUs are heterogeneous.
13. A method in accordance with any of the claims 10 - 12, wherein any of the reconfigurable data-path network and the control network comprises a plurality of switch boxes, wherein each switch box has a plurality of input and a plurality of outputs, and wherein each output is connected to an input.
14. A method in accordance with any of the claims 10 - 13, wherein the plurality of FUs comprise any of
an Arithmetic Logic Unit, ALU;
an Accumulate and Branch Unit, ABU;
a Load-Store Unit, LSU;
a multiplier unit, MUL;
- a Register File, RF;
an Immediate Unit, IU.
15. A method in accordance with any of the claims 10 -1 4, wherein each of the IDs are configured per application or kernel.
PCT/EP2020/071042 2019-07-25 2020-07-24 A reconfigurable architecture, for example a coarse-grained reconfigurable architecture as well as a corresponding method of operating such a reconfigurable architecture WO2021014017A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962878406P 2019-07-25 2019-07-25
US62/878,406 2019-07-25

Publications (1)

Publication Number Publication Date
WO2021014017A1 true WO2021014017A1 (en) 2021-01-28

Family

ID=71842671

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2020/071042 WO2021014017A1 (en) 2019-07-25 2020-07-24 A reconfigurable architecture, for example a coarse-grained reconfigurable architecture as well as a corresponding method of operating such a reconfigurable architecture

Country Status (1)

Country Link
WO (1) WO2021014017A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170083313A1 (en) * 2015-09-22 2017-03-23 Qualcomm Incorporated CONFIGURING COARSE-GRAINED RECONFIGURABLE ARRAYS (CGRAs) FOR DATAFLOW INSTRUCTION BLOCK EXECUTION IN BLOCK-BASED DATAFLOW INSTRUCTION SET ARCHITECTURES (ISAs)
US20180189063A1 (en) * 2016-12-30 2018-07-05 Intel Corporation Processors, methods, and systems with a configurable spatial accelerator
US20180267929A1 (en) * 2017-03-14 2018-09-20 Yuan Li Reconfigurable Parallel Processing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170083313A1 (en) * 2015-09-22 2017-03-23 Qualcomm Incorporated CONFIGURING COARSE-GRAINED RECONFIGURABLE ARRAYS (CGRAs) FOR DATAFLOW INSTRUCTION BLOCK EXECUTION IN BLOCK-BASED DATAFLOW INSTRUCTION SET ARCHITECTURES (ISAs)
US20180189063A1 (en) * 2016-12-30 2018-07-05 Intel Corporation Processors, methods, and systems with a configurable spatial accelerator
US20180267929A1 (en) * 2017-03-14 2018-09-20 Yuan Li Reconfigurable Parallel Processing

Similar Documents

Publication Publication Date Title
US10445098B2 (en) Processors and methods for privileged configuration in a spatial array
US10380063B2 (en) Processors, methods, and systems with a configurable spatial accelerator having a sequencer dataflow operator
US11086816B2 (en) Processors, methods, and systems for debugging a configurable spatial accelerator
US10467183B2 (en) Processors and methods for pipelined runtime services in a spatial array
US10469397B2 (en) Processors and methods with configurable network-based dataflow operator circuits
US10416999B2 (en) Processors, methods, and systems with a configurable spatial accelerator
US10558575B2 (en) Processors, methods, and systems with a configurable spatial accelerator
US10445451B2 (en) Processors, methods, and systems for a configurable spatial accelerator with performance, correctness, and power reduction features
US10496574B2 (en) Processors, methods, and systems for a memory fence in a configurable spatial accelerator
US10387319B2 (en) Processors, methods, and systems for a configurable spatial accelerator with memory system performance, power reduction, and atomics support features
US10445234B2 (en) Processors, methods, and systems for a configurable spatial accelerator with transactional and replay features
US20190004878A1 (en) Processors, methods, and systems for a configurable spatial accelerator with security, power reduction, and performace features
US20190303263A1 (en) Apparatus, methods, and systems for integrated performance monitoring in a configurable spatial accelerator
US9760373B2 (en) Functional unit having tree structure to support vector sorting algorithm and other algorithms
US20200310994A1 (en) Apparatuses, methods, and systems for memory interface circuit allocation in a configurable spatial accelerator
US9910673B2 (en) Reconfigurable microprocessor hardware architecture
US7533244B2 (en) Network-on-chip dataflow architecture
US10853073B2 (en) Apparatuses, methods, and systems for conditional operations in a configurable spatial accelerator
JP2019506695A (en) Processor with pipeline core with reconfigurable algorithm and algorithm matching pipeline compiler
US11907713B2 (en) Apparatuses, methods, and systems for fused operations using sign modification in a processing element of a configurable spatial accelerator
US10445099B2 (en) Reconfigurable microprocessor hardware architecture
Tabkhi et al. Function-level processor (FLP): A high performance, minimal bandwidth, low power architecture for market-oriented MPSoCs
WO2021014017A1 (en) A reconfigurable architecture, for example a coarse-grained reconfigurable architecture as well as a corresponding method of operating such a reconfigurable architecture
Wijtvliet et al. Concept of the Blocks Architecture
Ferreira et al. Reducing interconnection cost in coarse-grained dynamic computing through multistage network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20746944

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20746944

Country of ref document: EP

Kind code of ref document: A1