WO2016159765A1

WO2016159765A1 - Many-core processor architecture and many-core operating system

Info

Publication number: WO2016159765A1
Application number: PCT/NL2016/050214
Authority: WO
Inventors: Gerhardus Keimpe RAUWERDA
Original assignee: Recore Systems B.V.
Priority date: 2015-03-27
Filing date: 2016-03-29
Publication date: 2016-10-06

Abstract

Many-core processor architecture and many-core operating system comprising a plurality of homogeneous and/or heterogeneous clusters, wherein each of said clusters comprises a plurality of arithmetic cores, a local memory arranged to be accessed by each of said plurality of arithmetic cores, an input-output, IO, interface arranged for inter-connecting said plurality of clusters, and a control core which is arranged for locally scheduling of tasks over said arithmetic cores within a same cluster, and for controlling communication between said plurality of homogeneous and/or heterogeneous clusters via said IO interface.

Description

Title: MANY-CORE PROCESSOR ARCHITECTURE AND MANY-CORE

OPERATING SYSTEM

Description TECH NICAL FIELD

The present invention generally relates to semiconductor structures and more particularly to an architecture and an operating system for semiconductor processor architecture employing a distributed and heterogeneous or homogeneous many-core architecture.

BACKGROUND OF THE I NVENTION

The term multi-core processor is typically used for a single computing processor with or a few independent central processing units, i.e. cores, which are used for reading and executing program instructions. By incorporating multiple cores on a single chip, multiple instructions can be executed at the same time, increasing overall speed for programs amenable to parallel computing. These cores are typically integrated onto a single integrated circuit or onto multiple dies in a single chip package. Both cases are referred to as a processor chip.

The terms many-core, or massively multi-core, are sometimes used to describe multi-core architectures with an especially high number of cores. As such, a many-core processor is typically used for a single computing processor with many independent central processing units, i.e. cores, such as tens, hundreds or even thousands of cores.

Several limitations of multi-core processor chips exist which led to the development of many-core processors, such as imperfect scaling, difficulties in software optimization and maintaining concurrency over a number of cores. As such, a many-core processor is one in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient. In the course of time, the need for more processing power was initially fulfilled by increasing the speed with which instructions on the processor could be executed. Hence, increasing the clock speed of the processor. This speed, however, can not be increased indefinitely because then physical constraints play a major role and can even be problematic. Further, not of less importance, energy consumption of a processor with such a high clock speed increases rapidly.

A more recent development is therefore towards parallel execution of instructions by multiple processors or cores thereof. Among others Intel has played a large role in emerging of Symmetric multicore processing, SM P systems in which a processor comprises two or more identical, i.e. homogeneous, processor cores. These cores are treated equally and none of them have a specific purpose. These are all general purpose processor cores. One of the advantages of these SM P multicore processors when it comes to design and costs for example, is that they share resources. That is however, also a major drawback. Performance can degrade or at least not be at the level of the sum of the individual cores due to the sharing of memory for example. Memory latency and bandwidth can cause problems and these latency and bandwidth problems of one core can affect the other.

One of the challenges in many-core processor chips is directed to its architecture. These many-core architectures pose research problems varying from how to model, design, evaluate and verify many-core hardware to how to model, write, and verify software that has to execute on a variety of processing elements. Further, current multi-core architectures are putting pressure to write concurrent software and this trend will only get stronger with many-core. As such, there is a need for exploiting the parallel processing cores in a many-core architecture for data-parallelism.

One of the objects of the present invention is to provide for an efficient many-core processor chip architecture such that the above mentioned challenges can be effectively tackled. It is another object of the invention to provide for a many-core system comprising a many-core processor architecture according to the invention.

SUMMARY OF THE I NVENTION

In a first aspect of the invention, there is provided a many-core processor architecture having a distributed shared memory, wherein said distributed shared memory may be addressed as one logically shared address space, said architecture comprising a plurality of clusters, wherein each of said plurality of clusters comprises:

a plurality of arithmetic cores,

a private memory arranged to be accessed by each of said plurality of arithmetic cores comprised in a single, corresponding cluster,

a shared memory being part of said distributed shared memory, and arranged to be accessed by a plurality of arithmetic cores comprised by each of said plurality of clusters;

an input-output, IO, interface arranged for inter-connecting said plurality of clusters and for address space translation for translating a logic address to a physical address for access to a shared memory, being part of said distributed shared memory, of any of said clusters of said plurality of clusters, and

a control core which is arranged for locally scheduling of tasks over said arithmetic cores within a same cluster, for controlling communication between said plurality of clusters via said IO interface, and for controlling access to said private memory and said shared memory.

It was the insight of the inventors that the scheduling of tasks over the plurality of cores in a many-core processor architecture is made more efficient in case the cores are grouped in clusters, wherein each cluster comprises a control core for locally scheduling the tasks over the available arithmetic cores in that same cluster.

It was another insight of the inventor that the size of the address space for access to the shared memory is an issue when the number of clusters, or arithmetic cores, increase as is the case for many-core architectures. The overhead of using such a large sized address space for access to the shared memory would have a severe impact on the performance of the many-core processor architecture. As such, the inventors have found that each cluster, more particularly each 10 interface thereof, should be responsible for translating the logical addresses for accessing the shared memory to actual physical addresses. This reduces the overhead in accessing the shared memory.

It was noted that a cluster, typically, does not need to access the entire distributed shared memory. Usually, a cluster only needs to have access to the shared memory of clusters it communicates with. As such, the IO interface may be arranged for address space translation for translating a logic address to a physical address to a shared memory of a cluster with which said control core has set up, i.e. controlled, communication with. This reduces the size of the used address space significantly.

The units comprised in the many-core processor architecture, e.g. the IO interface, the control core, memory, etc. , are to interpreted in its broadest sense. These functional units may be present within the architecture in physically separated units, but may also be integrated into combined hardware units. For example, the IO interface, and its function of address space translation as indicated above, may be functionally implemented within the control core itself.

Following the invention, a many-core processor architecture is obtained construed of a plurality of multi-core subsystems, i.e. clusters, wherein each multi-core subsystem is responsible for distributing, or allocating, tasks within that particular cluster.

In order for the different arithmetic cores in a particular cluster to share data, each of the clusters comprise a private memory arranged to be accessed by each of the plurality of cores of the corresponding cluster. Each of the arithmetic cores may further comprise their own specific memory, for example cache memory, for data processing. Each cluster further comprises an 10 interface such that data as well as control signals can be communicated to other clusters, for example by using message queues implemented in either hardware or software.

The control core, for example a General Purpose Processor, is arranged for controlling access to the private memory and the shared memory. Access to the shared memory is, for example, controlled via the 10 interface such that arithmetic cores of a further cluster are able to access, i.e. read and write, into the shared memory of a first cluster. In such a way, it is made possible to exchange data between clusters.

One of the advantages of the present many-core processor architecture is that the challenge of parallel processing of tasks can be more easily tackled. In prior art systems, tasks may be executed in parallel and may thus be assigned to different cores in case these tasks do not comprise any mutual dependency. The architecture of the present invention provides the advantage that a set of tasks may be assigned to a particular cluster in case the tasks within that set do not comprise any mutual dependency with a task not belonging to that set. Mutual dependency between tasks in a same set of tasks assigned to a cluster will then be tackled by the control core comprised in the cluster, i.e. the control core will schedule these tasks over the plurality of arithmetic cores in that cluster.

The many-core processor architecture may further comprise a many-core operating system arranged for distributing tasks over the clusters, and wherein the many-core operating system comprises a plurality of cooperating microkernels distributed over each of the clusters, and arranged for running in a kernel space of the control core of each of the clusters for locally scheduling of the tasks over the arithmetic cores within a same cluster.

The control core may, for example, schedule the tasks over the different arithmetic cores in a corresponding cluster based on a first in first out, earliest deadline first, shortest remaining time, round-robin, multilevel queue, shorted job first, fixed priority pre-emptive, round-robin or multilevel queue scheduling of tasks over the arithmetic cores within said same cluster. The input-output, 10, interface may be arranged for inter-connecting said plurality of clusters, for communication between the plurality of clusters in the form of message queuing and for address space translation for access to the shared memory.

In an example of the invention, the 10 interface is arranged for address space translation for translating a logic address to a physical address for access to a shared memory of a cluster of said plurality of clusters, wherein said control core has controlled communication with said cluster of said plurality of clusters.

More specifically, the IO interface is arranged to perform said address space translation only for shared memories of clusters of said plurality of clusters with which said control core has already established communication with.

In an example of the invention, the many-core processor architecture comprises a distributed many-core operating system arranged for distributing tasks over said plurality of clusters, and wherein said operating system comprises distributed micro-kernels, each of said distributed micro-kernels running within a kernel space of said control core of a respective cluster and arranged for local scheduling of said tasks within said cluster by one or more of said plurality of said arithmetic cores, and arranged for scheduling of tasks outside said cluster by one or more of said plurality of arithmetic cores of a further cluster of said plurality of clusters.

Multicore processors like SM P and in particular processors with even more cores, i.e. many-core systems have the potential to provide high performance processing power with a large number of cores, however, increasing the amount of cores also increases complexity of software management thereof. Many-core processing chip having a plurality of clusters are able to provide such high processing power, but are also considered complex to manage and difficult to maximize its full potential. With the increase of cores comes an increase in complexity of efficient control or management of these cores and the tasks performed by each core. Many-core operating systems thus need to cope with this problem.

Applications running on multiprocessing architectures such as SMP architectures require different programming methods to utilize the maximum performance potential of the architecture. Some performance increase can be reached if applications which are not particularly programmed for these architectures, i.e. for uniprocessor systems, are run on SMP architectures. Hardware interrupts which suspend application tasks can be executed on an idle processor instead, hence, resulting in at least some performance increase.

However, the best performance increase is achieved if the application is programmed in a way it can easily be executed on multiprocessing architectures. There are even applications that run very well on multiprocessing architectures with a performance increase of nearly the sum of the number of individual processors.

The operating system managing the applications must support multiprocessing, otherwise the additional processors remain idle and the system in principle functions as a uniprocessor system. The complexity of both an operating system as well as the applications running thereon which are designed for multiprocessor architecture are considerably more complex as regard to instruction sets for example. Homogeneous processor systems can require different processor registers for dedicated instructions in order to be able to efficiently run such a multiprocessor system. Heterogeneous systems on the other hand can implement different types of hardware for handling different instructions, instruction-sets or uses.

As already briefly mentioned above, SM P systems share a single shared system bus or the like for accessing shared memory. Although there are alternatives in which for example different memory banks are dedicated for different processors and this increases memory bandwidth, there arise different problems, for example in inter process communication when data is to be transferred from one processor to another. With a clustered architecture in which a plurality of clusters exist and wherein each of the clusters comprise a plurality of arithmetic cores, a shared memory arranged to be accessed by each of said plurality of arithmetic cores, an input-output, 10, interface arranged for inter-connecting the plurality of clusters, and a control core arranged for locally scheduling of tasks over the arithmetic cores within a same cluster, in principle at least some of the above identified problems of prior art SMP architectures can be solved. For example due to the dedicated distributed shared memory the amount of intra-cluster data throughput can be decreased significantly, hence reducing memory bandwidth problems within the chip.

A different problem however then arises in managing the tasks within the architecture, i.e. over the whole of the clusters. The bottleneck of the system is most likely in managing these tasks, i.e. in efficient use of the processes of the operating system. In common implementations an operating system is running on a General Purpose Processor, GPP, outside of the clusters. This GPP comprises an operating system and most important, the kernel of the operating system. According to an example of the invention, the operating system can however also run within one or more of the clusters, which are especially allocated for the operating system. The kernel is the core of the operating system and handles amongst others the management of the memory, interrupts and scheduling of the processes or tasks over the one or more processor cores. Most kernels also handle input/output of the processes and most control of the hardware within the computer, such as a network interface, graphical processing unit and the like.

In a single-core architecture, or even in multiprocessing architectures having a few cores the operating system and in particular the kernel thereof can efficiently manage its tasks. Although with many-core systems wherein the amount of cores is substantially higher the processing power can in principle be very high, efficient management of such a system by an operating system prevents maximum employment of the system.

In an aspect of the invention there is provided a many-core processor architecture comprising a plurality of heterogeneous clusters. Each of these clusters comprises a plurality of heterogeneous arithmetic cores, a shared distributed memory arranged to be accessed by each of the plurality of arithmetic cores, an input-output, 10, interface arranged for inter-connecting the plurality of clusters, and a control core arranged for locally scheduling of tasks over the arithmetic cores within a same cluster.

Since each cluster comprises both its own dedicated private memory, located physically local within the cluster, for handling all tasks within the cluster and its own control core for managing these tasks and the memory these are, in principle, arranged for local scheduling of the tasks. As such each control core of a cluster could be comprised of an operating system. Such would however require too high of a resource footprint. Moreover, managing such a system is complex as well since these operating systems have to be controlled as if it were a cluster of computers. On the other hand, when only the kernel is to be comprised in the cluster the footprint is reduced, but still too large for efficient use. As such, there is provided in an aspect of the invention a microkernel that is sufficiently small to be comprised in a control core of a cluster of a many-core processing architecture without the need for the control core to increase in resource footprint. The microkernel is not a duplicated or replicated version of a kernel running in a further cluster but is a distributed microkernel which enables inter-process communication between or over the microkernels and hence the clusters.

The operating system, and in particular the user space applications can be executed from a central general purpose processor on the architecture and outside of the clusters and core of the operating system, the kernel, in particular the microkernel or distributed microkernel, can be distributed as a plurality of cooperating microkernels running from in a kernel space of the control cores of each of the clusters for locally scheduling of the tasks over the arithmetic cores within a same cluster.

In yet another example of the invention, the control core is arranged for scheduling said tasks outside said cluster by one or more of said plurality of arithmetic cores of a further cluster of said plurality of clusters through a shared distributed memory which can be addressed by said cluster and said further cluster. The many-core architecture functions optimal and maximizes it capabilities when the required performance can be scaled. By increasing or decreasing the amount of cores needed, or by using dedicated application or function specific cores, the capabilities of the architecture are fully utilized. Scaling is to be interpreted very broadly. Preferably the scaling is performed by some level of automation. What is meant is that the application developers preferably need to be able to develop applications in a way that they are familiar to and in programming languages and models also used when programming for conventional single or multi- core (non many-core) architectures. In an example, the operating system of the many-core architecture is able to cope with conventional programming models and languages wherein an additional intermediate many-core programming platform is introduced which is able to receive on the one hand conventional programming models and programming language code, but on the other hand able to output application tasks or instructions optimized for many-core architecture. The many- core architecture according to the invention enables this by provided a microkernel which is allocated to every control core of the cluster. The microkernel offers basic functionality, including, but not limited to, inter-process communication, inter-core message passing, thread scheduling and tread management. The kernel is kept small, only employing the minimum amount of functions and can thus be kept very compact in size. Such a microkernel thus has little overhead.

Tasks that would normally be handled by the kernel are in a microkernel according to the invention performed outside of the microkernel. Examples of these tasks are the middleware components of the system, such as file services, handling of the network communication (in/out of the system, not in between the clusters), communication and control of the peripherals (device drivers). These tasks can be handled outside of the clusters microkernels, by the operating system. In a further example of the invention, the scheduler of each of said distributed micro-kernels is arranged for any one or more of the group first in first out, earliest deadline first, shortest remaining time, round-robin, multilevel queue, shortest job first, fixed priority pre-emptive, round-robin or multilevel queue scheduling of tasks over the arithmetic cores within said same cluster. The skilled person will appreciate that the invention is not limited to the aforementioned scheduling policies. Other scheduling policies could also be used.

In particular heterogeneous many-core architectures benefit from employing different types of scheduling. As mentioned, efficient use of all capabilities of many-core architectures is challenging. By dividing applications not only into conventional low level instructions, which can only to a certain degree make efficient use of parallelism, but into intermediate tasks, these tasks can be grouped by for example whether or not to be handled real-time, or to match a certain scheduling model like round-robin, shortest job first, etc. These tasks could be assigned to clusters that match the scheduling model. However, the other way round is more preferred, i.e. wherein the scheduling policy/model in a cluster can be adapted respecting the needs of the application. In an example, some clusters could be arranged for real-time instruction handling, others for high resource but non-real-time instruction handling. In particular, the system is arranged for a programming model wherein the application developer does not have to take care of all handling of the instructions by certain dedicated clusters or cores within the cluster. The system could be arranged to determine from the application which scheduling model fits best. These processes thus operate autonomously but could in an example also be manipulated by a user input, e.g. by the application developer. The system could be arranged to comprise certain registers, or flags which indicate a certain required or preferred handling of an instruction or a task, e.g. real-time, before a certain deadline, with certain priority over other task, or at a moment another task has been completed, etc. etc.

Some tasks require a certain handling and could therefor require a certain cluster layout. The operating system is arranged to detect whether this is the case and which of the available clusters correspond to the required demands of handling the task, and thus direct the task to be handled and executed by the microkernel of that particular cluster or any cluster having the required layout/functionality. If the task requires a certain cluster layout or functionality and or a certain scheduler, and the operating system is arranged to detect this prescribed requirement, the task could be considered to be bound to a certain core, e.g. core- binding. Alternatively, the tasks could also comprise additional arguments, for example in the form of meta-data or more specific in the form of a flag or register data type. The additional arguments or meta-data indicates a single preference, or multiple subsequent preferences. The preferences can relate to a single preferred or preferred order of executing the task on a particular cluster and/or a particular core within a cluster and/or a matching cluster type or core type and/or a cluster running a microkernel with a certain scheduling policy.

Tasks that are executed over multiple clusters could require intra- cluster communication via the IO interface through the distributed shared memory. By allocating only part of the shared memory and translating the logical addresses for accessing the shared memory to actual physical address helps reducing overhead in memory bandwidth in view of conventional shared memory multi-core architectures, but it still enables communication between the microkernels at a kernel level through inter process communication.

In an example the many-core processor architecture can be arranged to perform single tasks multiple times, for example subsequently, or in parallel. By running single tasks multiple times, the system is better protected against errors. Only if the outcome, result of each of the parallel tasks is the same, the result will be used with increased reliability. This could also be used as error correction, or a fault tolerant system, in which a deviating (in view of the majority) result of multiple (at least three) parallel or serial executed tasks is used. In another example of the invention, the many-core processor architecture comprises a plurality of heterogeneous clusters.

A homogeneous architecture with shared global memory is easier to program for parallelism, that is when a program can make use of the whole core, compared to a heterogeneous architecture where the cores do not have the same instruction set. It was the insight of the inventors, however, that in case of an application which lends itself to be partitioned into long-lived threads of control with little or regular communication it makes sense to manually, and/or automatically put the partitions onto cores that are specialized for that specific task. A heterogeneous architecture is in the context of the invention to be interpreted in its broadest sense. The heterogeneousness can be present at cluster level, hence, different types of clusters could exist beside each other and/or at core level, hence, different types of cores are integrated within a single cluster. The invention is not restricted to any of the particular types but is suitable for both.

In another example of the invention the plurality of arithmetic cores comprises any of a Digital Signal Processor, DSP, a Hardware Accelerator core and a Processing General Purpose Processor, GPP.

The advantage hereof is that each cluster is able to effectively handle any task, as each cluster may be equipped with different types of arithmetic cores suitable for a specific task. In a further example, each of the clusters comprises between 8 -

128 arithmetic cores, preferably between 32 - 64 arithmetic cores.

In yet a further example, the IO interface comprises a Network-on- Chip, NOC, interface for communication between the plurality of clusters.

In an even further example, the many-core processor architecture comprises between 50 - 1000 clusters, preferably between 100 - 400 clusters, even more preferably between 200 - 300 clusters. The inventors noted that it could be advantageous if there is a trade off between the total amount of cores in the many-core processor architecture and the amount of cores in each cluster.

In another example, the control core is a General Purpose Processor.

In a second aspect of the invention, there is provided a computing system comprising a, or a plurality of many-core processor architecture according to any of the examples provided above. The computing system may comprise a hierarchical structure of clusters, said computing system comprising a plurality of main clusters, each of said main clusters comprises a plurality of clusters, wherein one of said plurality of clusters in a main cluster is arranged for inter-connecting said plurality of main clusters, and arranged for locally scheduling of tasks over said clusters within a same main cluster, for controlling communication between said plurality of main clusters via said IO interface.

The expressions, i.e. the wording, of the different aspects comprised by the many-core processor architecture according to the present invention should not be taken literally. The wording of the aspects is merely chosen to accurately express the rationale behind the actual function of the aspects.

The above-mentioned and other features and advantages of the invention will be best understood from the following description referring to the attached drawings. In the drawings, like reference numerals denote identical parts or parts performing an identical or comparable function or operation.

The invention is not limited to the particular examples disclosed below in connection with a particular type of many-core processor architecture or with a particular may-core processor system.

This invention is not limited to chip implementations in CMOS. The architecture is technology agnostic, and can be implemented in FPGA and/or ASIC. So, according to all examples of the invention, a many-core processor architecture is considered to be implemented in one or more chips, such as FPGA or as ASIC.

Brief description of the drawings Figure 1 is a schematic view of a topology of a many-core processor architecture according to the present invention. Figure 2 is a schematic view of a topology of a computing system comprising a plurality of many-core processor architectures according to the present invention. Figure 3 is a schematic view of a cluster layout according to the present invention.

Figure 4 is a schematic view of a topology of a many-core processor architecture with an operating system implementation according to an aspect of the present invention.

Figure 5 is a schematic view of a kernel layout for a many-core processor architecture according to an aspect of the invention. Figure 6 is a schematic view of assigning tasks of an application to cores of the many-core processor architecture according to an aspect of the invention.

Detailed description of the drawings

Figure 1 is a schematic view of a topology of a many-core processor architecture 1 according to the present invention. Here, the many-core processor architecture 1 comprises a plurality of clusters 2, of which only four are shown. Typically, the architecture 1 comprises many more clusters, for example, 100 - 1000 clusters. The present invention is not limited in the amount of clusters to be incorporated in the many-core architecture 1 , as the topology of the present invention is scalable, i.e. it is applicable for just a few clusters 2 to many hundreds or even thousands of clusters 2. A cluster 2 comprises a plurality of arithmetic cores 5, 6, 7, for example Digital Signal Processors, DSPs, 5 Hardware Accelerator cores 6 and Processing General Purpose Processors, GPPs, 7. As such, the clusters 2 may be heterogeneous clusters 2 as they comprises a variety of different types of arithmetic cores 5, 6, 7. Each cluster 2 of the many-core processor 1 may further comprise different types of arithmetic cores 5, 6, 7. In other words, the topology of each of the clusters 2 does not need to be the same, each cluster 2 may have their own set of various arithmetic cores 5, 6, 7. Each cluster 2 typically comprises a memory 3, which is a physical piece of hardware under control by a control core 4. The memory may be split in a private memory and a shared memory (not shown). The difference between these types of memories is that a private memory comprised in a particular cluster 2 may only be accessed by the arithmetic cores 5, 6, 7 of that same cluster.

According to the present invention, the shared memory comprised in a cluster is a part of a larger distributed shared memory, wherein each of the shared memories of each of the clusters 2 may be addresses as one logically shared address space. The term shared thus does not mean that there is a single centralized memory in the architecture 1 , but shared means that the address space of all the physical shared memories of each of the clusters 2 is shared.

The access to the memory 3, i.e. the local memory as well as the shared memory, is controlled by a control core 4. In other words, a control core 4 comprised by a particular cluster 2 is responsible for controlling access to the local memory and the shared memory comprised by that particular cluster 2.

Each cluster further comprises an input/output, IO, interface 9 arranged for inter/connecting the plurality of clusters 2 together. This means that the clusters 2 can communicate to each other via their corresponding input/output interface 9. The control core 4 is also arranged to control the communication at the IO interface 9, i.e. the communication between a plurality of clusters 2. The client application core/system 10 is then also connected to the clusters 2 via a same type of IO interface 9.

Typically, the IO interface 9 comprises a Network-on-Chip, NOC, interface for communication between said plurality of clusters 2. The inventors found that the use of a bus between the clusters 2 may not be beneficial, especially in cases where many clusters 2 are comprised in a single architecture 1. There are a couple of downsides of using a bus, which is that the bandwidth is limited, shared, speed goes down as the number of clusters 2 grows, there is no concurrency, pipelining is tough, there is a central arbitration and there are no layers of abstraction. The advantages of using a NOC are that the aggregate bandwidth grows, that link speed unaffected by the number of clusters 2, concurrent spatial reuse, pipelining is built-in, arbitration is distributed and abstraction layers are separated.

Disclosed in figure 1 is a topology of a many-core processor architecture, which is a homogeneous topology. However, according to the present invention, the topology may also be a heterogeneous topology, wherein the layout of each cluster may be different.

Figure 2 is a schematic view of a topology of a computing system 50 comprising a plurality of many-core processor architectures according to the present invention.

Here, a computing system 50 is displayed having of a hierarchical structure type of clusters. The computing system 50 comprises a plurality of main clusters 61 , 62, 63, 64, each of said main clusters comprises a plurality of clusters 51 , 52, 53, 54, i.e. sub-clusters.

One of the plurality of clusters 51 , 52, 53, 54, i.e. sub-cluster, of a single main cluster 61 , 62, 63, 64 is appointed as the responsible sub-cluster for the corresponding main cluster 61 , 62, 63, 64. As such, that sub-cluster is arranged to, for example, distribute tasks among the sub-clusters and for communication between the sub-clusters, etc.

The input-output, IO, interface may be arranged for inter-connecting said plurality of clusters, for communication between the plurality of clusters in the form of message queuing and for address space translation for access to the shared memory. As mentioned before, address space translation according to the present invention is especially useful for reducing the size of the total address space, thereby reducing the overhead. According to the present invention, in an embodiment thereof, by using the address space translation, a cluster is able to access only a part of the shared memory at once. That is, only that part of the shared memory physically present in the clusters with which the control core has initiated communication. This is advantageous as in such a case, the cluster does not need to utilize the total address space available for the shared memory, but only needs to utilize the address space for the shared memories in the clusters with which a communication is established.

Figure 3 is a schematic view of a layout of a cluster 101 according to the present invention. The cluster 101 comprises a plurality of arithmetic cores 102, 104,

105. As shown in figure 3, a variety of different cores may exist, for example three Digital Signal Processors, DSPs, 102, seven hardware accelerators 104 and six processing General Purpose Processors, GPPs, 105. Each of these arithmetic cores 102, 104, 105 may in addition to the memory 106 of the cluster 101 comprise its own memory 1 12. The local core memory or tightly coupled memory 1 12 could be used for many purposes, for example as cache, as scratch pad, or random access memory, etc. to the local core in the form of a cache memory. A cache memory 1 12, is thus intended to be used for specifically one arithmetic core, hence, it is a tightly coupled memory, i.e. memory which is specifically coupled to a single core.

In this example, the cluster 101 comprises one physical hardware memory chip 106, which is further divided in a private memory 107 arranged to be accessed by each of said plurality of arithmetic cores comprised in that single, corresponding cluster 101 , and a shared memory 108 being part of a distributed shared memory, and arranged to be accessed by a plurality of arithmetic cores comprised by each of a plurality of other clusters. This thus implies a memory architecture wherein the physically separate memories 108 can be addressed as one logically shared address space. Here, the term shared does not mean that there is a single centralized memory but shared essentially means that the address space is shared.

The cluster 101 further comprises an input-output, IO, interface 109 arranged for inter-connecting said plurality of clusters 101 , also via the connecting lines 1 1 1 , and a control core 1 10 which is arranged for locally scheduling of tasks over said arithmetic cores within a same cluster, for controlling communication between said plurality of clusters via said IO interface, and for controlling access to said private memory and said shared memory. The control core 1 10 is thus responsible for the scheduling part, i.e. scheduling tasks over the plurality of arithmetic cores 102, 104, 105, the communication between clusters, and for the memory 106 access.

The architecture 1 , 50 described in the figures, also referred to as Multi Processor System on Chip, MPSoC or Multi Core System on Chip, MCSoC comprises a plurality of arithmetic cores 5, 6, 7, distributed over the many clusters 2. In these cores 5, 6, 7 the actual processing of the tasks of the application running on the Operating System, OS, is performed. In for example a Digital Signal Processing, DSP application, the cores 5, 6, 7, each process part of data of the DSP application.

In Figure 4 the Operating System, OS, 410 of the many core system 400 is shown. The OS is comprised of two parts, the part 41 1 of running on the client application core and the part 412 running on the individual subsystems, i.e. clusters. Processing of tasks is under control of the OS. The OS of a computer system has the role, amongst others, of managing resources of the hardware platform. The resources are made available to the application or applications running on the OS. Examples of the resources that are managed by the OS include the instructions that are to be executed on the processor core(s), the Input/Output, IO devices, the memory (allocation), interrupts, etc.

The most simple form of an OS can only run one single program at a time, hence referred to as a single-task OS. A modern OS however allows the computer system to run multiple programs or threads of the program concurrently, hence called multi-tasking OS. A multi-tasking OS may be achieved by running concurrent tasks in a time-sharing manner, and/or by dividing tasks over multiple execution units, e.g. in a dual, multi or many core computer system. By employing time-sharing the available resources of the processor core is divided between multiple processes which are interrupted repeatedly in time-slices by a task scheduler which forms part of the OS.

In an example pre-emptive multitasking may be employed. With preemptive scheduling any processes executed on a processor core can be interrupted from being executed by the scheduler to suspend in favour of a different process that is to be invoke and executed on the processor core. It is up to the scheduler to determine when the current process is suspended and which process is to be executed next. This can be employed according to different type of scheduling regimes.

Pre-emptive scheduling has some advantages as compared to a cooperative multi-tasking OS wherein each process must explicitly be programmed to define if and when it may be suspended. With pre-emptive scheduling all processes will get some amount of CPU time at any given time. This makes the computer system more reliable by guaranteeing each process a regular slice of processor core time. Moreover, if for example data is received from an IO device that requires immediate attention, processor core time is made available and timely processing of the data can be guaranteed. The OS consists of several components that all have their own function and work together to execute applications on the OS. All applications need to make use of the OS in order to access any hardware in the computer system. The components of the OS operate at different levels. These can be represented in a model with the following parts in the order from high level to low level: user applications, shared libraries, devices drivers, kernel, hardware. In such a model, the levels above the kernel are also called user-space, or to run in user-mode.

One of the most important parts of an OS is the kernel which provides basic control over the hardware in the computer system. Memory, CPU, IO, peripherals, etc. are all under control of the kernel. The applications running on the OS can thus make use of these parts of the computer via the kernel.

Although hybrid forms also exist, in general two types of kernels can be recognised, monolithic kernels and microkernels. A monolithic kernel is a kernel in which most of the processes are handled in a supervisor-mode and form part of the kernel itself. Microkernels, μ-kernel, as shown in figure 5, are kernels in which most of the processes are handled in a user-mode, i.e. in user-space, and thus do not directly form part of the kernel. These processes can in them self communicate with each other directly, such without interference of the kernel.

Microkernels are relatively small in size since they are only comprised of the most fundamental parts of a kernel such as the scheduler, memory management and InterProcessCommunication, IPC 510, 520, 530. These parts are controlled from the supervisor-mode, i.e. the kernel-space, such with high restrictions, all other parts are controlled from a user-mode, such with lower restrictions. As indicated above, microkernels generally provide a multi level security hierarchy with a supervisor/kernel-mode at one end and a user mode at the other end.

An application or task running in supervisor-mode has privileges necessary to obtain access to the resources of the computer system, i.e. the hardware. This access can be considered unsafe since misuse or an error can result in system failure. On the other hand, an applications or task running in user-mode can only access those parts of the OS that are considered safe, i.e. virtual memory, threads, processes, etc.

In a microkernel the access to these unsafe parts can be given to an application or task running in user-mode. Much of the functionality that in a monolithic kernel resides inside the kernel, is in a microkernel moved to the user- space, i.e. running in user-mode.

As indicated above, in the course of time, the need for more processing power shifted from increase in clock speed of the (single core) processor towards parallel execution of instructions on plural cores, i.e. multiple or even many core system with high amounts of arithmetic cores. Heterogeneous systems on a architecture with a high amount of cores, i.e. MPSoCs, have the potential to outperform single core or homogeneous many-core system. It is a preferred architecture when there is a demand for high performance at low power.

MPSoCs however has some constraints. Execution of a few concurrent applications on a computer system with a single processor core can be performed simply by a conventional multi-tasking OS. Execution of plural concurrent applications on a computer system with many processor cores is, at least in an efficient manner, challenging. This makes the usability of such systems non-optimal as extensive knowledge is needed for effective use of such systems. Conventional applications may not perform to the optimal potential of a MPSoC. Applications should be able to cope with high amounts of parallelism and employ most of the cores of the system.

To that end conventional applications have to be programmed, converted or redesigned to allow the application to be divided into (sub-)tasks dedicated to be executed, on those many cores, concurrently.

It is proposed to divide applications into tasks in stead of threads, wherein the tasks are, in particular, small enough to be executed in a short amount of time on one of the cores and wherein, in particular, these tasks can depend on any one or more of the trade-offs performance, resources, latency, and energy budgets.

In principle, a task requiring high performance will not be able to perform that task at low energy consumption. Vice versa, low energy consumption will have a negative impact on performance. The same reasoning applies for resources and latency. Low latency will most likely require a large resource footprint.

For specific applications or specific parts of applications, i.e. tasks within the applications, it is important to have guarantees on performance. The kernel has to assure these performance guarantees. To this end a many-core OS 410 is proposed as illustrated in Figure 4 which consists of at least two parts, being the part 41 1 running on the client application core/system and the part 412 running over the multiple subsystems/clusters. The first can be a conventional OS with a conventional kernel comprised therein. Examples thereof are Linux, Unix or Windows based OSs. This could however also be a custom OS, designed from scratch or more likely, Linux, Unix or Windows based.

In an example the application runs on the client application side 410 of Figure 4, of the system. Thus the applications can be programmed for any of these common OS types. The OS running in the client general purpose processor is arranged to cooperate with the distributed OS 412 running over the total of clusters. Thus in this example, the path of execution can be considered to start at the client system side, via the (conventional) OS 41 1 running thereon, towards a distributed OS 412 at a lower hierarchy lever. Thus the assigning of the tasks towards the clusters is handled by the general purpose processer of the client system, and hence by the kernel of the OS running thereon.

Once the tasks are received by the subsystem, the individual general purpose processors 421 , 422, 423, 424 thereof are arranged to handle the individual tasks within the cluster.

In order for the kernel of the OS on the client system to determine which tasks to be assigned to which cluster, the kernel requires additional information to base that decision on. To that end it is proposed to assign such information to the tasks such that the kernel 415 can assign each task in an effective manner and not only distribute the tasks evenly over the clusters. Due to the heterogeneous configuration of the system the clusters can have different components. Some clusters are for example arranged for more DSP processing while others are better arranged for other hardware acceleration tasks.

In known multicore systems the applications are divided into conventional low-level threads that are executed on an individual processor core in a time scheduling basis, e.g. by pre-emptive scheduling. In the architecture and OS according to the invention a different programming model is used. Although from a programmer's perspective a homogenous multi or many core system with shared- memory is the most convenient, it is ineffective on a many core scale due to the scalability issues at these amounts of cores. Examples of these scalability issues are increased energy consumption and fault tolerances. Therefor a more abstract model of programming is proposed that is arranged for execution of activities, i.e. operations instead of convention instructions on a thread basis.

Activities are means to specify for which cores the application implemented its functions, what kernels and configurations are available for hardware accelerators etc. As such the application is not divided into instructions executed as threads but decomposed into parts, i.e. tasks and assigned with additional scheduling information for handling by the scheduler of the kernel on the basis of for example responsibility, input-and-output, resource budget, performance budget, etc. Thus an application can consist for example roughly of Graphical User Interface, GUI, instructions and data processing instructions as well as data input- output instructions. I n accordance with the invention it is proposed to decompose the application into individual tasks that are arranged to be executed within a cluster of the many-core system. These tasks have additional scheduling information on the basis of which the scheduler of the kernel 415 running in the general purpose processor of the client application system can decide to which subsystem/cluster this tasks is to be assigned. Communication thereof is arranged via network on chip communication units 441 , 442, 443, 444 and the use of the shared address space within the distributed shared memory of the memory units of the clusters 431 , 432, 433, 434. For example, GUI instructions have other requirements like responsiveness, then I/O instructions that rely more resource usage. Thus a GUI task comprised of GUI related instructions are more likely to be assigned a responsiveness profile and scheduled on a subsystem/cluster that comprises an architecture that is arranged to that end.

Once the task is assigned to any of the subsystems, the local microkernel 421 , 422, 423, 424 of the distributed OS 412 is arranged to locally schedule the tasks over the individual arithmetic such as the DSPs and HW accelerators. On a cluster level the general purpose processor of the cluster can determine an over or under capacity within the cluster. This depends on the architecture of the cluster, e.g. the amount of cores, types of cores, etc, as well as the amount of tasks assigned to the individual core. If the micro-kernel determines a resource shortage within the cluster, it can query a resource request to other (neighbouring) cluster to determine if tasks can be hand over to one of these other clusters. Herewith the overall resource capacity of the system is used more efficiently. The other way round works as well. The micro-kernel can also determine inefficient use of local resources due to a low instruction queue for example. In that case the micro-kernel can also signal its free capacity to other clusters, either directly via direct inter cluster communication via the network-on-chips 441 , 442, 443, 444 or via the higher-level coordinating scheduler of the kernel 415 in the client system. The use of the task based programming model and the decomposition of the application into parts, i.e. tasks that can be eventually be assigned to cores of the cluster is illustrated in Figure 6. In Figure 6 an application is according to an aspect of the invention modelled towards a plurality of task. The applications is thus defined by multiple tasks, which may or may not rely on high amounts on communication. The communication between the tasks is known as the channels. The channels between these tasks thus have to guarantee a certain amount of sufficient bandwidth at low latency. As such, not all schedulers function optimal by utilizing the resources in a efficient manner. It is thus proposed to use a distributed many-core OS that is comprised over plural microkernels that control tasks in kernel mode such that they can be executed on the different cores 5, 6, 7, 102, 104, 105, of the cluster. The microkernel comprises a scheduler selected to efficiently employing the execution of the tasks on the individual cores. Such a scheduler is arranged for any one or more of the group first in first out, earliest deadline first, shortest remaining time, round-robin, multilevel queue, shortest job first, fixed priority pre-emptive, round-robin or multilevel queue scheduling of tasks over the arithmetic cores within said same cluster.

Each of the GPPs run a microkernel (although not every GPP of a cluster needs to run a microkernel, the cluster could comprise multiple GPPs, with one of the GPPs running the microkernel) and each microkernel is arranged for controlling I PC, scheduling of the tasks, and memory management, e.g. private memory and/or shared memory. The microkernel running in the GPP of the cluster thus controls the local resources of the cluster and runs in kernel-mode.

In accordance with all examples of the present invention, a many- core architecture is also to be understood as a many-core processor chip.

The present invention is not limited to the embodiments as disclosed above, and can be modified and enhanced by those skilled in the art beyond the scope of the present invention as disclosed in the appended claims without having to apply inventive skills.

Claims

1 . Many-core processor architecture having a distributed shared memory, wherein said distributed shared memory may be addressed as one logically shared address space, said architecture comprising a plurality of clusters, wherein each of said plurality of clusters comprises:

a plurality of arithmetic cores,

- a shared memory being part of said distributed shared memory, and arranged to be accessed by a plurality of arithmetic cores comprised by each of said plurality of clusters;

2. Many-core processor architecture according to claim 1 , wherein said IO interface is arranged for address space translation for translating a logic address to a physical address for access to a shared memory of a cluster of said plurality of clusters, wherein said control core has controlled communication with said cluster of said plurality of clusters.

3. Many-core processor architecture according to claim 2, wherein said IO interface is arranged to perform said address space translation only for shared memories of clusters of said plurality of clusters with which said control core has already established communication with.

4. Many-core processor architecture according to any of the previous claims, wherein said many-core processor architecture comprises a distributed many-core operating system arranged for distributing tasks over said plurality of clusters, and wherein said operating system comprises distributed micro-kernels, each of said distributed micro-kernels running within a kernel space of said control core of a respective cluster and arranged for local scheduling of said tasks within said cluster by one or more of said plurality of said arithmetic cores, and arranged for scheduling of tasks outside said cluster by one or more of said plurality of arithmetic cores of a further cluster of said plurality of clusters.

5. Many-core processor architecture according to any of the previous claims, wherein said control core is further arranged for scheduling said tasks outside said cluster by one or more of said plurality of arithmetic cores of a further cluster of said plurality of clusters through a shared distributed memory which can be addressed by said cluster and said further cluster, and wherein said control core is preferably further arranged for I/O and memory management.

6. Many-core processor architecture according to any of the claims 4 - 5, wherein a scheduler of each of said distributed micro-kernels is arranged for any one or more of the group first in first out, earliest deadline first, shortest remaining time, round-robin, multilevel queue, shortest job first, fixed priority pre-emptive, round-robin or multilevel queue scheduling of tasks over the arithmetic cores within said same cluster.

7. Many-core processor architecture according to any of the previous claims, wherein said many-core processor architecture comprises a plurality of heterogeneous clusters.

8. Many-core processor architecture according to any of the previous claims, wherein said plurality of arithmetic cores comprises any of a Digital Signal Processor, DSP, a Hardware Accelerator core and a Processing General Purpose Processor, GPP, or any combination thereof.

9. Many-core processor architecture according to any of the previous claims, wherein each of said clusters comprises between 8 - 128 arithmetic cores, preferably between 32 - 64 arithmetic cores.

10. Many-core processor architecture according to any of the previous claims, wherein said IO interface comprises a Network-on-Chip, NOC, interface for communication between said plurality of clusters.

1 1 . Many-core processor architecture according to any of the previous claims, wherein said many-core processor architecture comprises between 50 - 1000 clusters, preferably between 100 - 400 clusters, even more preferably between 200 - 300 clusters.

12. Many-core processor architecture according to any of the previous claims, wherein said control core is a General Purpose Processor.

13. A computing system comprising a many-core processor architecture according to any of the previous claims.

14. A computing system according to claim 13, wherein said system comprises a plurality of many-core architectures according to any of the claims 1 - 12.

15. A computing system according to claim 14, wherein said computing system of a hierarchical structure of clusters, said computing system comprising a plurality of main clusters, each of said main clusters comprises a plurality of clusters, wherein one of said plurality of clusters in a main cluster is arranged for interconnecting said plurality of main clusters, and arranged for locally scheduling of tasks over said clusters within a same main cluster, for controlling communication between said plurality of main clusters via said IO interface.

16. A method of programming an application for a computing system according to any of the claims 13 - 15, comprising at least one many-core processor architecture according to any of the claims 1 -12, wherein said method comprises the step of defining a plurality of tasks from said application for distributing said tasks over said multiple many-core processor architectures.

17. The method according to claim 16, wherein to each of said tasks can be assigned any one or more of the group comprising a responsibility, input- output, resource budget and performance budget.