WO2021014017A1

WO2021014017A1 - A reconfigurable architecture, for example a coarse-grained reconfigurable architecture as well as a corresponding method of operating such a reconfigurable architecture

Info

Publication number: WO2021014017A1
Application number: PCT/EP2020/071042
Authority: WO
Inventors: Mark Wijtvliet; Henk CORPORAAL
Original assignee: Technische Universiteit Eindhoven
Priority date: 2019-07-25
Filing date: 2020-07-24
Publication date: 2021-01-28

Abstract

A reconfigurable architecture, comprising an instruction memory comprising instructions to be executed by a reconfigurable architecture, a data memory comprising data on which the instructions are to be performed, a plurality of Function Units, FUs, wherein each FU is arranged for receiving data from the data memory at an input, processing the received data according to a configured function of the corresponding FU, and for outputting the processed data at an output to the data memory, Instruction Decoders, IDs, arranged for configuring functions of the plurality of FUs in accordance with the instructions comprised by the instruction memory, a reconfigurable data-path network arranged for configuring data paths between inputs and outputs of one or more FUs of the plurality of FUs, a reconfigurable control network, separate from the data-path network, and arranged for enabling the IDs for configuring functions of the plurality of FUs over the reconfigurable control network.

Description

Title

A reconfigurable architecture, for example a coarse-grained reconfigurable architecture as well as a corresponding method of operating such a reconfigurable architecture.

Technical field

The present disclosure generally relates to microprocessor architectures. More specifically, it relates to a reconfigurable architecture, for example a coarse-grained reconfigurable architecture..

Background

Modern devices such as smartphones, loT (Internet of Things) and wearable (e.g. medical) devices require high computational performance yet operate at a very constrained energy budget. At the same time, they require a large amount of programming flexibility of the processor. These requirements conflict with each other. A processor architecture that can support all these requirements at the same time will lead to new applications and products in various fields where energy efficiency is key.

The present disclosure is directed to reconfigurable architectures, or CGRAs. Many informal and conflicting definitions of CGRAs exist in the art. The definition that is upheld in the present disclosure is that a CGRA uses hardware flexibility in order to adapt the data-path at run-time to the application. This requires that, before computation can be performed, some sort of configuration has to be processed by the architecture.

Such a configuration is typically static for a particular amount of time, for example for the duration of an application, or a loop kernel within an application.

Further, an architecture is to be defined as a coarse-grained reconfigurable architecture whenever it fulfils two properties. First, a spatial reconfiguration granularity at functional unit level or above, and second, a temporal reconfiguration granularity at region/loop-nest level or above. Focus should be put on the dominant level when considering these properties. The dominant level is the level at which reconfiguration has the greatest influence on the functionality of the system. Existing CGRAs are often implemented either as coarse-grained Field Programmable Gate Array, FPGAs, where execution of an algorithm is performed as a static systolic-array like structure or, as a network of lightweight processors.

Systolic arrays provide a grid of function units that are configured statically to perform a specific operation for the duration of the application. Although systolic arrays may lead to energy efficient implementations with good performance, these CGRAs are rather static and do not support operations that influence the control- flow of an application, such as branching. The flexibility of these architectures, therefore, is low. Due to the static nature of the implementation there is no time multiplexing of function units which may lead to a larger area than CGRAs that support function units which are programmable per cycle.

The second method uses a configurable network with multiple processing elements that operate independently or in lock-step. This method can be considered as very flexible but the power draw is much higher. This leads to a lower energy efficiency and is caused by the cycle-based operation of the processing elements and less or no support for spatial execution. Execution control of these processing elements can either be performed by a global‘context’ that is switched out when the schedule requires it, or local decoding per processing element. The execution schedule of local decoding can be controlled by a global program counter or by local execution control. The latter brings these architectures closer to a networked many-core than a CGRA.

Traditional CGRAs have a lower energy efficiency and are less flexible than the proposed CGRA.

Summary

It is an object of the present disclosure to provide for a reconfigurable architecture that is efficient as well as power efficient.

In a first aspect, there is provided a reconfigurable architecture, for example a coarse-grained reconfigurable architecture, comprising:

an instruction memory comprising instructions to be executed by a reconfigurable architecture;

a data memory comprising data on which the instructions are to be performed; a plurality of Function Units, FUs, wherein each FU is arranged to receive data from the data memory at an input, process the received data according to a configured function of the corresponding FU, to output the processed data at an output towards the data memory;

Instruction Decoders, IDs, arranged for configuring functions of the plurality of FUs in accordance with the instructions comprised by the instruction memory;

a reconfigurable data-path network arranged for configuring data paths between inputs and outputs of one or more FUs of the plurality of FUs;

a reconfigurable control network, separate from the data-path network, and arranged to enable the IDs for configuring functions of the plurality of FUs over the reconfigurable control network.

The above described architecture is able to support flexibility, high performance and high energy/area efficiency all at the same time by allowing run-time construction, by reconfiguration, of application specific processors. In other words, for each application that needs to be executed, a specialized processor can be realized by the architecture. The architecture applies a unique separation between application data and application control over reconfigurable networks. The constructed specialized processors are very energy efficient since the architecture allows good matching between application and processor properties.

The processor architecture of the invention is contrasted with traditional processors, for example VLIW, SIMD and ARM Cortex M0, as well as traditional CGRA methods for reconfiguration, so as to highlight the advantages of the invention.

The processor architecture contains several function units that can perform computation. These units may be heterogeneous and may be able to perform various functions, such as: arithmetic and logic operations, multiplication, memory operations, etc. Function units are not restricted to these basic operations but can implement any complex (set of) operation(s). The function units can be connected over a data network to form a specialized data-path. In addition, the control for these functional units, for example what function is performed at a certain moment in time, is performed over a secondary reconfigurable control network. This allows various forms of parallelisms in applications to be very efficiently exploited on the architecture, leading to a large energy reduction. Innovative features of embodiments may include a new type of CGRA with a unique reconfiguration flexibility allowing energy-efficient support for pipelining parallelism, operation chaining, data-level parallelism, instruction and operation level parallelism and task-level parallelism, and all their combinations, and a software framework supporting the architecture.

The processor architecture of the invention is useful in electronic wearable or portable devices such as, but not limited to, communication devices such as smart phones, watches, laptops, computers and brain computer interfaces. Other fields where energy efficiency is of importance, such as digital signal processing and telecommunication networks are also an application area for the invention.

In accordance with the present disclosure, the instruction memory may be prepared by a boot-loader, and may contain the instructions that are to be executed by the reconfigurable architecture.

The data memory may be considered as the global memory and may be a memory that can be either a large on-chip memory or an external memory interfaced to the architecture. An architecture instance may contain one or more load- store units, which is explained in more detail later below.

One of the goals of the present architecture is to increase energy efficiency while keeping flexibility. Since energy efficiency is defined as performance divided by power, it can be increased in multiple ways: by speeding up the computation, and by reducing power consumption. Systolic arrays perform well on both of these aspects. Spatial mapping of the computation provides an efficient and pipelined path to compute results, resulting in high performance.

The static nature of the interconnect and operations of systolic arrays provides good power efficiency, which combined with high performance leads to a low energy architecture. However, this only works when there are enough compute resources to spatially map the application to the hardware. Additionally, it is usually not possible to map applications with a complex, or data dependent, control flow due to the lack of flexibility at run-time.

Separation between the data-path network and the control network within the presented architecture is one the features that allows improvements on both the performance and power aspects. The specialized data-paths in the constructed processor architectures both reduce the number of cycles it takes to execute an application, and reduce power by avoiding memory and register file accesses due to increased bypass opportunities.

In an example, the control network and the data-path network are configured as a static circuit-switched network. The circuit-switched network is typically configured once per application or kernel.

It is further noted that it may be beneficial that these types of circuit- switched networks use switch-boxes to route signals over a network. Switch-boxes may be compared to how a phone network operates. In these network, the phone number represents a configuration for the various switches on the line to make a connection between two selected devices. This connection then becomes a direct connection between these devices. Switch-boxes in the presented architecture may work in a similar manner. A bit-file may provide a configuration for each of the switch- boxes on the network, being either the reconfigurable data-path network and/or the control network. This configuration may contain specifications like: connect the left port of the switch-box to the bottom port of the switch-box.

Thus, any of the reconfigurable data-path and the control networks may comprise a plurality of switch boxes, wherein each switch box has a plurality of input and a plurality of outputs, and wherein each output is internally connected to an input.

It is noted that, in accordance with the present disclosure, the FUs may be heterogeneous meaning that all FUs do not need to perform the same function.

In a further example, the plurality of FUs comprise any of:

an Arithmetic Logic Unit, ALU;

an Accumulate and Branch Unit, ABU;

a Load-Store Unit, LSU;

a multiplier unit, MUL;

a Register File, RF;

an Immediate Unit, IU.

An ALU may support operations for arithmetic and logic operations such as addition, subtraction, bit-wise operations, shifting, and comparisons. An ABU supporting branching for evaluating conditions that may be calculated inside the data path and determine whether to branch based on those conditions. The LSU is the interface to the data memory for all other function units on the architecture. The MUL is arranged for performing multiplication operations. These multiplications can be both signed and unsigned and are performed on fixed point data representations.

In a further example, each of the IDs are configured per application or kernel.

In accordance with the present disclosure, statically configured means that they are configured once per application or kernel, and keep this configuration for the duration thereof, similar to the networks used in Field Programmable Gate Arrays, FPGAs. A dynamically configured network can change during program execution, an example thereof are routing networks. Instead of providing a direct path between a source and sink, the data travels over this type of network as a packet. The packet contains information about the source and destination of the packet which is processed by routers that provide the connection between network segments. Although dynamic networks are more flexible, packets can move from any source to any destination, they provide a higher overhead in terms of power, area and latency (cycles).

Any of the control network and the data-path network may be a statically configured network.

In another example, at least one of the FUs comprises a Load-Store Unit, LSU, wherein the LSU is arranged for interfacing the data memory.

In an even further example, the reconfigurable architecture further comprises a memory arbiter placed in between the data memory and the at least one LSU, wherein the memory arbiter is arranged for orchestrating requests from the at least one LSU towards the data memory.

The presented architecture may comprise one or more load-store units. Each of these units can load data from the data memory. This means that there can be multiple load and store requests to the global memory within a single clock cycle. To manage the requests an arbiter is placed between the global memory and the LSUs. When a load or store request arrives, the arbiter may check if these requests can be grouped. E.g., multiple reads to the same memory word can be coalesced into a single memory access. If multiple accesses are required the arbiter may stall the compute fabric until all data has been collected. Although stalling the computation degrades performance, it may be required to keep the memory accesses in sync with the operations in the compute fabric. In a further example, each LSU may be associated with a local memory for temporarily storing data to be reused.

In a second aspect of the present disclosure, there is provided a method of operating a reconfigurable architecture in accordance with any of the previous examples, wherein the method comprises the steps of:

configuring, by the IDs, over the reconfigurable control network, functions of the plurality of FUs in accordance with the instructions comprised by the instruction memory;

routing, by the reconfigurable data-path network, data between the data memory and the inputs and outputs of one or more FUs of the plurality of FUs using configured data paths.

It is noted that the advantages as provided with respect to the first aspect, being the reconfigurable architecture, are also applicable for the second aspect of the present disclosure, being the method of operating the reconfigurable architecture.

In an example, any of the control network and the data-path network is a circuit-switched network.

In a further example, the FUs are heterogeneous.

In yet another example, any of the data-path network and the control network comprises a plurality of switch boxes, wherein each switch box has a plurality of input and a plurality of outputs, and wherein each output is connected to an input.

In yet another example, any of the statically reconfigurable data-path network and the control network, can change its configuration fast by changing between multiple, locally stored, configuration contexts.

In another example, the plurality of FUs comprise any of

an Arithmetic Logic Unit, ALU;

an Accumulate and Branch Unit, ABU;

a Load-Store Unit, LSU;

a multiplier unit, MUL;

a Register File, RF;

an Immediate Unit, IU.

In an example, each of the IDs are configured per application or kernel

The above mentioned and other features and advantages of the disclosure will be best understood from the following description referring to the attached drawings. In the drawings, like reference numerals denote identical parts or parts performing an identical or comparable function or operation.

Brief description of the Drawings

Fig. 1 shows an example of the coarse-grained reconfigurable architecture in accordance with the present disclosure.

Fig. 2 shows an example of a function unit with four inputs and two output.

Fig. 3 shows an example of a switch box as may be used may any of the data-path network and the control network in accordance with the present disclosure.

Detailed description

The presented architecture comprises separate programmable, i.e. reconfigurable, control and data paths. The architecture allows light-weight instruction fetcher & instruction decoder units (I FID) to be arbitrarily connected to one or more functional units (FU) over a statically, i.e. per application or kernel, configured interconnect, which can be reconfigured at run-time. Since IFIDs can be connected to more than one FU a true SIMD processor can be constructed, leading to energy reductions.

The FUs can also be connected to each other over a similar, but separate from the control, static network. This allows the architecture to provide spatial application layout, like FPGAs, as well as cycle-based VLIW-like instructions. The fabric may be realized by two circuit switched networks operating on data buses instead of individual wires; both are usually configured once per application and are then static during program execution.

The data-path network allows the inputs and outputs of functional units to be connected to allow direct data transfer between FUs. The second network is the control-network. This network allows IFIDs to be connected to one or more functional units to create SIMD processors. By connecting multiple IFIDs to the same program counter, generated by the accumulate & branch unit (ABU), VLIW-SIMD processors can be instantiated. A mix between these types of processors is also possible. If there is more than one ABU present within an architecture there can be multiple independently operating processors instantiated on a single fabric.

Applications of the invention include any system with a small energy budget (battery powered devices) that has to perform a significant amount of computations can benefit from the presented architecture. Especially when flexibility is needed such that applications can change (updates) during the product lifetime, or when it has to run different applications.

In most processor architectures instruction fetching and decoding contributes to a significant fraction of the total power. This is especially true for traditional CGRA architectures where each functional unit typically has a local instruction memory and decoder, or is connected to a global configuration context. Some CGRAs even have functional units that are approaching light-weight processor cores with all the corresponding instruction fetch and decoding overhead. CGRAs are often used to perform various forms of digital signal processing which contain a significant amount of data level parallelism and can therefore exploit SIMD programming models.

The architecture shown in figure 1 may be realized by two FPGA-like networks operating on data buses instead of individual wires, both are configured once per application and are static during program execution. The data-path network allows the inputs and outputs of functional units to be connected to allow direct data transfer between FUs. The second network is the control-network. This network allows IFIDs to be connected to one or more functional units to create SIMD processors. By connecting multiple IFIDs to the same program counter, generated by the accumulate & branch unit (ABU), VLIW-SIMD processors can be instantiated. If there is more than one ABU present within an architecture, there can be multiple independently operating processors, simultaneously running multiple processes or tasks, on the fabric.

The instructions may have a total width of 16 bit; the upper four bits select the type of FU the I FID is controlling and are statically configured by the bit- stream. The lower 12 bits may be controlled per cycle and are loaded from the instruction memories.

The architecture may feature several types of functional units, each responsible for providing different functionality to the configured processor. Most functional units have multiple input ports and output ports connected to the data- network which are selected by the instruction. Function units may contain internal registers for buffering inputs and/or outputs. The instruction is provided by the I FI D connected to the FU over the control network.

Figure 1 shows an educational example of an architecture with different types of functional units, as well as the memory hierarchy around the architecture.

The architecture may comprise a large global 32-bit wide global data memory (GM), wherein the LSUs are connected to this memory via a memory arbiter. The arbiter may serve memory requests on a round-robin basis and detects coalesced memory accesses. Besides the global data memory, every LSU has a local memory (LM). Due to the small size of these memories they may be implemented as low-power register file macros from a commercial vendor which internally consist of flip-flops. The local memories are private to an LSU and therefore never cause processor stalls. When data reuse is possible inside a kernel, the local memories may be used to store intermediate results if they cannot be kept inside the processing pipeline.

When data reuse is possible inside a kernel, the local memories may be used to store intermediate results if they cannot be kept inside the processing pipeline. During configuration, the instructions for each I FI D may be loaded, together with the bit-stream, from a memory shared between the host processor and the presented architecture. A loader may move the instructions from the binary to the relevant instruction memories; each I FI D has its own small instruction memory, implemented as a low-power register file memory macro. This allows loading instructions in parallel during execution.

The presented architecture may feature several types of functional units, each responsible for providing different functionality to the configured processor. Most functional units have multiple input ports and output registers connected to the data-path network which are selected by the instruction. The instruction may be provided by the I FID connected to the FU over the control network.

During configuration, the instructions for each I FI D may be loaded (together with the bit-stream) from a memory shared between the host processor and Blocks. A loader may move the instructions in the binary to the relevant instruction memories; each I FI D may have its own small instruction memory, implemented as a low-power register file memory macro. This is allows loading instructions in parallel during execution. It is noted that the structure of the presented architecture, as shown in figure 1 , may require that every instruction decoder can control every type of function unit. Alternatively, it would have been possible to implement instruction decoders that are specialized per function unit. However, doing so may reduce the flexibility of the architecture and increases complexity of the networks. This is because the total number of required instruction decoders may be higher to provide the same configuration opportunities. For example, two generic instruction decoders can control an ALU and a multiplier, or in another configuration control an ALU and an load-store unit (LSU). To do so with specialized instruction decoders would require three decoders.

Fig. 2 shows an example of a function unit with four inputs and two outputs.

The function unit may have four inputs, wherein each input may be connected to the data-path network. The function unit may perform any type of function on the received data, and may output the processed data via two outputs. The processed data is also outputted to the data-path network.

Finally, the function unit may comprise an instruction input, wherein the instruction input may be connected to the control network. The instruction input is used for controlling the function unit, i.e. for providing instructions from the instruction memory, by the instruction decoder, to the function unit.

The switch box may, for example, be utilized in a data-path network. In the present scenario two switch boxes are shown, wherein each switch box may have a plurality of inputs, ad a plurality of outputs. Using multiplexers, a particular input may be connected to a particular output.

The present disclosure is not limited to the examples as disclosed above, and can be modified and enhanced by those skilled in the art beyond the scope of the present disclosure as disclosed in the appended claims without having to apply inventive skills and for use in any data communication, data exchange and data processing environment, system or network.

Claims

1. A reconfigurable architecture, comprising:

a data memory comprising data on which the instructions are to be performed;

a plurality of Function Units, FUs, wherein each FU is arranged to receive data from the data memory at an input, process the received data according to a configured function of the corresponding FU, and output the processed data at an output to the data memory;

a reconfigurable data-path network arranged to configure data paths between inputs and outputs of one or more FUs of the plurality of FUs;

a reconfigurable control network, separate from the data-path network, arranged to enable the IDs for configuring functions of the plurality of FUs over the reconfigurable control network.

2. A reconfigurable architecture in accordance with claim 1 , wherein any of the control network and the data-path network is a circuit-switched network.

3. A reconfigurable architecture in accordance with any of the previous claims, wherein the FUs are heterogeneous.

4. A reconfigurable architecture in accordance with any of the previous claims, wherein any of the reconfigurable data-path network and the control network comprises a plurality of switch boxes, wherein each switch box has a plurality of inputs and a plurality of outputs, and wherein each output is connected to an input.

5. A reconfigurable architecture in accordance with any of the previous claims, wherein the plurality of FUs comprise any of:

an Arithmetic Logic Unit, ALU; an Accumulate and Branch Unit, ABU;

a Load-Store Unit, LSU;

a multiplier unit, MUL;

a Register File, RF;

an Immediate Unit, IU.

6. A reconfigurable architecture in accordance with any of the previous claims, wherein each of the IDs are configured per application or kernel.

7. A reconfigurable architecture in accordance with any of the previous claims, wherein at least one of the FUs comprises a Load-Store Unit, LSU, wherein the LSU is arranged for interfacing the data memory.

8. A reconfigurable architecture in accordance with claim 7, wherein the reconfigurable architecture further comprises a memory arbiter configured between the data memory and the at least one LSU, wherein the arbiter is arranged for orchestrating requests from the at least one LSU to access the data memory.

9. A reconfigurable architecture in accordance with any of the claims 7 - 8, wherein each LSU is associated with a local memory for temporarily storing data to be reused.

10. A method of operating a reconfigurable architecture in accordance with any of the previous claims, wherein the method comprises the steps of:

1 1. A method in accordance with claim 10, wherein any of the control network and the data-path network is a circuit-switched network.

12. A method in accordance with any of the claims 10 - 1 1 , wherein the FUs are heterogeneous.

13. A method in accordance with any of the claims 10 - 12, wherein any of the reconfigurable data-path network and the control network comprises a plurality of switch boxes, wherein each switch box has a plurality of input and a plurality of outputs, and wherein each output is connected to an input.

14. A method in accordance with any of the claims 10 - 13, wherein the plurality of FUs comprise any of

an Arithmetic Logic Unit, ALU;

an Accumulate and Branch Unit, ABU;

a Load-Store Unit, LSU;

a multiplier unit, MUL;

- a Register File, RF;

an Immediate Unit, IU.

15. A method in accordance with any of the claims 10 -1 4, wherein each of the IDs are configured per application or kernel.