CN118227278A

CN118227278A - Scheduling of duplicate threads

Info

Publication number: CN118227278A
Application number: CN202311680425.7A
Authority: CN
Inventors: R·卡马拉吉; V·卡迪亚拉; J·安德鲁; O·奥库尔特
Original assignee: Imagination Technologies Ltd
Current assignee: Imagination Technologies Ltd
Priority date: 2022-12-21
Filing date: 2023-12-08
Publication date: 2024-06-21

Abstract

The invention relates to scheduling of duplicate threads. A processing system, the processing system comprising: a secure thread scheduling circuit that schedules a check thread, which is a copy of the critical thread, to execute on a second execution unit of the plurality of parallel execution units, other than the first execution unit on which the critical thread is running. The processor also includes a comparison circuit that compares the results of the critical threads with the results of the inspection threads and generates an error signal if the results do not match. The secure thread scheduling circuit is configured to detect when one of the execution units is idle and if none of the execution units is detected to be idle at the expiration of the secure time window, interrupt a non-critical thread executing on a non-idle one of the execution units and select the non-idle execution unit as a second execution unit to execute the inspection thread in place of the interrupted thread.

Description

Scheduling of duplicate threads

Cross Reference to Related Applications

The present application claims priority from british patent applications GB2219357.7 and GB2219359.3 filed on month 21 of 2022, which are incorporated herein by reference in their entirety.

Technical Field

The invention relates to scheduling of duplicate threads.

Background

Many modern processing systems have two or more parallel execution units, e.g. independent CPU cores on the same chip, enabling parallel execution of corresponding instruction sequences on different execution units. Each instruction sequence may include one or more threads. Typically, this arrangement is used to run different instruction sequences on each of the execution units in parallel with each other in order to increase the number of unique processing operations that can be performed per unit time.

However, it is also possible to use a second one of the execution units (e.g., cores) to run a duplicate instance of a "critical" thread running on a first one of the execution units. This may be done as a "security check" to check if the system has executed the critical thread without error. That is, in a processing system having at least two parallel execution units, one execution unit may be used to run redundant instances of threads running on another execution unit in addition to parallel execution of different threads in order to check whether the system is working as intended (note that parallel execution does not refer herein to interleaving threads by the same execution unit in a time-shared manner, but rather to execution on separate hardware). If both instances of the thread produce the same result, this proves that the processor is working, but if not, this indicates that there is a hardware or software failure (an actual failure in the operation of the processing system, not just a bug due to an error of the developer). For example, a hardware failure may have occurred in the hardware of the processing system, or when saved in memory or registers, a random bit flip (e.g., a cosmic bit flip caused by cosmic radiation) may have occurred in the data or code.

Example applications for such inspection may be found in processing systems on automatic or semi-automatic vehicles (e.g., automobiles, airplanes, or trains), where the thread to be inspected may be configured to perform critical operations for controlling movement of the vehicle, or outputting critical information (e.g., speed or engine temperature) through a user interface of the vehicle (e.g., through a head-up display HUD of the vehicle). Another example application may occur in medical robots, where a critical thread may be configured to perform operations for controlling the robot to perform actions (e.g., surgical actions) on a human or other living being. For example, in such applications, a cosmic bit flip or hardware failure may have a catastrophic effect.

Some standards may actually require that duplicate processes be run on parallel hardware for some processes. For example, ASIL-D (automotive safety integrity class D) is a specification under international standard ISO 26262 that defines certain safety measures that must be taken in connection with automotive systems for controlling road vehicles.

More generally, "critical" for the purposes of this disclosure means critical to the desired application for which the thread in question is running. In particular, "critical" may refer herein to any thread that desires to run a repeated instance (inspection thread) and at least one result of at least one operation performed by the critical thread is checked against the corresponding result of the same operation performed by the inspection thread. Similarly, the term "security" or "security check" in this context refers only to preventing errors in critical thread execution by means of repeated execution and comparison results. The terms "critical" and "safety" as used herein do not necessarily mean that the safety of humans or other living beings is at risk, although these are certainly examples of safety critical applications.

The processing system may run both critical and non-critical threads at different times. A non-critical thread is a thread for which it is not necessary to run redundant instances. Critical threads are scheduled into non-critical threads at different times across multiple execution units (e.g., cores) of the system.

Conventionally, in order to perform a check on a critical thread, two threads (the critical thread and its checking thread) are executed in a "lockstep" manner. Meaning that they are executed at exactly the same time. Executing both threads in lockstep helps to capture errors that occur at the same time and at the same point in the code. In practice, two threads may be offset from each other by a small predetermined number of clock cycles to avoid that an error hits both cores at the same time and in the same state (if the assumption that an error only occurs in one execution unit, execution at exactly the same time is possible, but if the same error occurs in both execution units, the error will be masked, which is why in practice one of the threads is typically slightly delayed relative to the other thread).

Lockstep execution may be set by an Operating System (OS) running on the processing system, or may be implemented by means of dedicated hardware that lockstep the cores.

Disclosure of Invention

However, a problem with lockstep execution of two threads is that setting a lockstep execution introduces latency. If both idle execution units are available at the same time, two instances of the critical thread can only be scheduled to begin executing at the same time (or within a few cycles of each other). However, if most or all of the execution units of the processing system are busy executing non-critical threads, then it is unlikely that two of the execution units simply happen to become idle at the same time. If this is the case, conventionally, the OS or lockstep hardware would have to force two execution units to become idle at the same time. Since execution of different non-critical threads on different execution units is unlikely to be aligned in time, this means that scheduling of one execution unit is deliberately prevented, i.e. it is deliberately kept in an idle state after the end of the last non-critical thread of one execution unit, while waiting for the other execution unit to become idle, in order to artificially create a window in which both execution units are idle and two instances of the critical thread can start at the same time. This will therefore result in the introduction of additional idle time, thereby affecting the overall processing throughput of the processing system.

It is desirable to provide a system that enables scheduling of critical threads and their corresponding inspection threads without introducing a strict lockstep idle time penalty, but while ensuring that the inspection threads still execute within some safe time window of the critical threads.

According to a first aspect disclosed herein, there is provided a processing system comprising a plurality of parallel execution units, each operable to execute a respective series of threads, wherein at least some of the threads executed by at least some of the execution units are non-critical threads not specified as critical. The processing system further includes a request buffer operable to receive a request indicating that one of the threads in the respective series executed by the first one of the execution units is designated as a critical thread; and a secure thread scheduling circuit arranged to read requests from the request buffer store and, in response, schedule the inspection thread as a copy of the critical thread to execute on a second execution unit of the plurality of execution units other than the first execution unit. The processing system further comprises a result buffer storage arranged to buffer one of: a first result, the first result being a result of executing a critical thread on a first execution unit; and a second result, the second result being a result of executing the inspection thread on the second execution unit; and a comparison circuit arranged to compare the one of the first and second results from the result buffer with the other of the first and second results and to generate an error signal if the first and second results do not match in dependence on the comparison. The request includes an indication of a secure time window. The secure thread scheduling circuit is configured to detect when at least one of the execution units is idle, and if none of the execution units is detected to be idle at the expiration of the secure time window, interrupt one of the non-critical threads executing on the non-idle one of the execution units, and select the non-idle execution unit as the second execution unit to execute the inspection thread in place of the interrupted thread.

The disclosed system thus allows the main instance of the critical thread to begin executing before its repeated instance (the checking thread). The disclosed system thus eliminates the need to artificially hold one execution unit in an idle state to wait for another execution unit to become idle at the same time. This is possible because the security window provides protection against the situation where another execution unit does not become idle for a sufficient amount of time to check the results of critical threads, rather than requiring locksteps. In other words, the ability to interrupt non-critical threads ensures that the inspection thread will always be scheduled in a sufficient amount of time. The value of the secure time window may depend on the particular application in question and may be programmable.

If (typically) the second execution unit is not immediately available at the beginning of executing the main instance of the critical thread, the checking thread (sub-instance or repeated instance of the critical thread) may interrupt one of the non-critical threads at any time up to the end of the time window, depending on the specific implementation, after which the checking thread has to interrupt the non-critical thread if no execution unit naturally becomes available by that time. Preferably, the secure thread scheduling circuitry will schedule the inspection thread once the idle execution unit becomes available, or whenever the secure window expires, whichever is earlier.

In an embodiment, the secure thread scheduling circuitry may be configured to: i) If one of the execution units is detected as idle when the request is read from the request buffer store, then selecting the idle one of the execution units as a second execution unit to begin executing the inspection thread; but ii) if none of the execution units are detected as idle at the time of the read request, waiting and detecting if one of the execution units becomes newly idle before expiration of the secure time window, and if one of the execution units becomes newly idle, then selecting the newly idle execution unit as a second execution unit to start executing the inspection thread; and iii) if none of the execution units becomes idle at the expiration of the secure time window, executing said interrupt to one of the non-critical threads executing on the non-idle ones of the execution units, and selecting the non-idle execution unit as a second execution unit to begin executing the checking thread in place of the interrupted non-idle thread.

If the inspection thread begins to be executed before the security window expires, it may be described herein as "eager" or "eager" executed. Some embodiments disclosed herein may allow additional flexibility regarding how to handle eager inspection threads.

In an embodiment, the secure thread scheduling circuitry may be configured to: the inspection thread is allowed to be interrupted if it is executed urgently, and allowed to be executed urgently if execution of the inspection thread starts before the expiration of the secure time window. The interrupt to the inspection thread may include scheduling one or more additional critical or non-critical threads to execute on the second execution unit instead of the inspection thread.

In an embodiment, the secure thread scheduling circuitry may be configured to: if the one or more additional threads have not completed at the expiration of the rescheduling time period, interrupting one of the one or more additional threads by resuming execution of the checking thread on the second execution unit, otherwise resuming execution of the checking thread after the one or more additional threads are completed. The rescheduling time period may be: a) expiration of the secure time window, b) expiration of the secure time window plus any time it has taken to execute the inspection thread, or c) a time between a) and b).

In an embodiment, the secure thread scheduling circuitry may be configured to: when the urgently executed inspection thread is interrupted by another thread, the urgently executed inspection thread is migrated to another execution unit other than the first execution unit and the second execution unit among the execution units.

The concept of eager execution can also be used independently, regardless of how the inspection thread was originally scheduled. For example, regardless of whether the inspection thread starts executing in lockstep with its corresponding critical thread or is delayed until the execution unit becomes idle, the principles of eager execution may be applied to subsequently allow the inspection thread to be interrupted after it has started, so long as the interrupt is before the rescheduling time limit expires. For example, in an embodiment, the system may wait for two execution units to become available at the same time, as in the prior art. Or the checking thread may even interrupt another non-critical thread immediately at the beginning of the corresponding critical thread (or just a few cycles after the beginning of the corresponding critical thread) so that they begin executing in a lockstep fashion. Even in this scenario, this does not exclude that the inspection thread may then be interrupted by another thread, if there is still enough time available (according to the restrictions defined by the application) to reschedule the inspection thread and obtain its results. Having the ability to interrupt such a check thread that is "eager" to execute (i.e., a check thread that starts executing earlier than necessary) would advantageously allow more opportunities to schedule the thread, and thus more flexibility in thread scheduling in a processing system that includes multiple execution units (e.g., multiple cores), such as allowing more opportunities to hide the latency, regardless of whether the check thread originally interrupted another thread.

Thus, according to a second aspect disclosed herein, which may be used independently of or in combination with the first aspect, there is provided a secure thread scheduler configured to schedule a checking thread for a critical thread running on one of a plurality of execution units, the checking thread being a copy of the critical thread. The secure thread scheduler may be configured to schedule the inspection thread to begin running on a second execution unit of the plurality of execution units before the secure time window for scheduling the inspection thread ends. Further, the secure thread scheduler may be configured to allow the inspection thread to be interrupted by another thread when the inspection thread runs on a second execution unit of the plurality of execution units, and to reschedule the inspection thread to resume when the rescheduling time limit expires.

In an embodiment, the interrupted inspection thread may be rescheduled again on the same second execution unit (e.g., if another thread completes before the rescheduling time limit ends, or if another thread is interrupted to resume the inspection thread). Alternatively, the inspection thread may be rescheduled by migrating the inspection thread to a third execution unit other than the first execution unit and the second execution unit.

The secure thread scheduler or processing system may be embodied in hardware on an integrated circuit. A method of manufacturing a processing system according to any of the embodiments disclosed herein in an integrated circuit manufacturing system may be provided. An integrated circuit definition data set may be provided that, when processed in an integrated circuit manufacturing system, configures the system to a manufacturing processing system. A non-transitory computer readable storage medium may be provided having stored thereon a computer readable description of a processing system, which when processed in an integrated circuit manufacturing system, causes the integrated circuit manufacturing system to manufacture an integrated circuit embodying the processing system according to any of the embodiments disclosed herein.

An integrated circuit manufacturing system may be provided, the integrated circuit manufacturing system comprising: a non-transitory computer readable storage medium having stored thereon a computer readable description of a processing system; a layout processing system configured to process the computer-readable description to generate a circuit layout description of an integrated circuit embodying the processing system; and an integrated circuit generation system configured to fabricate the processing system according to the circuit layout description.

Computer program code for performing any of the methods described herein may be provided. A non-transitory computer readable storage medium may be provided having stored thereon computer readable instructions that, when executed at a computer system, cause the computer system to perform any of the methods described herein.

As will be apparent to those skilled in the art, the above features may be suitably combined, and may be combined with any of the aspects of the examples described herein.

This summary is provided merely to illustrate some of the concepts disclosed herein and the specific implementations thereof that are possible. Not all statements in this section are intended to limit the scope of the present disclosure. Rather, the scope of the present disclosure is limited only by the claims.

Drawings

Examples will now be described in detail with reference to the accompanying drawings, in which:

Figure 1 is a schematic block diagram of a processing system including more than one execution unit,

Figure 2 is a timing diagram illustrating an example of scheduling of non-critical threads,

Figure 3 is a timing diagram illustrating an example of a conventional method of scheduling non-critical and critical threads,

Figure 4 is a timing diagram illustrating an example of a method of scheduling non-critical and critical threads according to embodiments disclosed herein,

Figure 5 is a timing diagram illustrating an example of another method of scheduling non-critical and critical threads according to embodiments disclosed herein,

Figure 6 is a schematic block diagram of a processing system according to an embodiment disclosed herein,

Figure 7 is another schematic block diagram of a processing system according to embodiments disclosed herein,

Figure 8 is a schematic block diagram of a computer system in which a graphics processing system is implemented,

Figure 9 is a schematic block diagram of an integrated circuit manufacturing system for generating an integrated circuit implementing a graphics processing system,

FIG. 10 is a schematic block diagram of an example arrangement of a memory access queue, an

FIG. 11 is a schematic block diagram of an example of an alternative arrangement of memory access queues according to certain embodiments disclosed herein.

The figures illustrate various examples. Skilled artisans will appreciate that element boundaries (e.g., blocks, groups of blocks, or other shapes) illustrated in the figures represent one example of boundaries. In some examples, it may be the case that one element may be designed as a plurality of elements, or that a plurality of elements may be designed as one element. Where appropriate, common reference numerals have been used throughout the various figures to indicate like features.

Detailed Description

The following description is presented by way of example to enable a person skilled in the art to make and use the invention. The invention is not limited to the embodiments described herein and various modifications to the disclosed embodiments will be apparent to those skilled in the art.

Embodiments will now be described by way of example only.

Consider a chip with two (or more) identical CPUs (named CPU 0 and CPU 1), where it is desirable to check whether the output running on one of these CPUs does not contain a hardware-induced failure. The current state of the art is to run the same software on both hardware simultaneously and compare the outputs to check if they are the same. This may help to implement ASIL-D security standards, for example. However, a limitation of this process is that it requires duplication of hardware to achieve the desired results, and does not use additional hardware to do useful work.

In accordance with the present disclosure, a new component, nominally referred to as a Secure Thread Scheduler (STS), is provided to balance the need to run useful work and to check hardware faults within a defined time (which may be programmable). It is common for the CPU to have idle time. Thus, if a "secure thread scheduler" is able to ensure that a check is completed within an acceptable amount of time that can be known from the execution of a first thread, one of the threads may be delayed until there is free time available for the security check, rather than scheduling two identical threads in parallel on two identical hardware.

To facilitate this, the inputs and outputs from the safety critical threads of CPU 0 may be captured, for example, in main memory or private memory, so that the inputs may be replayed identically to CPU1 and the outputs may be compared to the outputs of CPU 1. In some embodiments, the inputs and/or outputs may be stored and compared in compressed form. The input/output of CPU 0 and CPU1 is unlikely to change frequently, and thus it should be possible to achieve high compression of the input and output so that they reduce their space in memory and reduce area impact.

FIG. 1 more generally illustrates a schematic block diagram of a processing system 100 including a plurality of parallel heterogeneous execution units. The processing system 100 comprises at least two parallel execution units 102_0, 102_1; and more generally may include any plurality K of parallel execution units 102_0 … 102 _102_k-1, as shown later in fig. 6. In an embodiment, each processing unit may include a respective pipeline. Preferably, the execution units 102_0 … 102_K-1 are identical in structure to each other.

The processing system 100 further includes a program memory 104 and a data memory 105 to which each of the execution units 102_0 … 102 _102_k-1 is operatively coupled. Processing system 100 may all be integrated onto the same Integrated Circuit (IC), i.e., the same die (chip); or alternatively may be distributed over more than one IC in the same IC package, or even over different IC packages on the same circuit board or on different circuit boards. The execution units 102_0 … 120_k-1 may be implemented on the same IC or on different ICs, or some on the same IC and some on different ICs. Each of the program memory 104 and the data memory 105 may be implemented in any one or more memory devices employing one or more storage media, such as electronic media, for example, RAM (static or dynamic), ROM, EEPROM, flash memory, or a solid state drive; or a magnetic medium such as a hard disk drive. The program memory 104 may be implemented on the same IC as one, more or all of the execution units 102_0 … 102 _102_k-1, or external, or in combination with some memory and some external memory on the same IC. Similarly, the data memory 105 may be implemented on the same IC as one, more or all of the execution units 102_0 … 102 _102_k-1, or externally, or in combination with some memory and some external memory on the same IC. Program memory 104 and data memory 105 may be implemented in different memory devices from each other, or may be implemented as different regions in the same memory device, or as a combination of the same and different devices. Note also that either or both of the memories 104, 105 may include one or more levels of cache.

In one embodiment, the different execution units 102_0 … 102 _102_k-1 are different CPU cores of the same multi-core processor integrated into one IC package. For convenience, the execution unit 102_0 … 102 _102_k-1 may be referred to hereinafter as a "core," although it should be understood that this is not limiting, and that any reference to a core may be more generally replaced with an "execution unit" in the description of any embodiment. In an embodiment, the program memory 104 and the data memory 105 may be implemented in a local memory integrated into the same IC as the core 102_0 … 102 _102_k-1, but this is again not limiting. Further, note again that the illustration of separate program and data memories for purposes of illustration does not necessarily mean that they are implemented as distinct blocks: this is of course a possibility, or alternatively they may be just different regions in the same physical block of memory (where each region may comprise a contiguous or non-contiguous set of addresses in the address space).

In operation, each of the cores 102_0 … 102 _102_k-1 is arranged to execute a respective sequence of instructions fetched from the program memory 104. Each sequence is formed by one or more threads (each thread comprising a sub-sequence of instructions) such that each core 102_0 … 102 _102_k-1 executes a corresponding thread series. The cores (or, more generally, execution units) 102_0 … 102 _102_k-1 are described as being parallel to one another, as each core 102_k is operable to execute its respective series of threads in parallel with each other core of the cores 102_0 … 102_k-1, 102_k+1 … 102_k-1. That is, parallel means that the series executed by each execution unit may be executed in parallel (at least partially overlapping in time) with the series executed by each other execution unit of the plurality of execution units.

Each core 102_0 … 102 _102_k-1, upon executing instructions for each thread in its respective thread series, may load the respective data from the data store 105, operate on the data, and store the data to the data store 105. The loading and storing may be accomplished by executing load and store instructions referencing the source and destination addresses, respectively, for loading data from the data store 105 and storing data into the data store 105.

One or more of the cores 102_0 … 102_k-1, or a separate host or main CPU (not shown), or a combination thereof, may run an Operating System (OS). In general, in a multi-core processor, the OS may be responsible for scheduling threads to run on different cores 102_0 … 102 _102_K-1 (i.e., selecting which thread runs on which core and when). Alternatively, scheduling may be accomplished by dedicated scheduling hardware or a combination of OS and dedicated scheduling hardware. In accordance with the disclosure herein, at least the scheduling of inspection threads (i.e., repeated instances of critical threads) is accomplished by a dedicated hardware thread scheduler, as will be discussed in more detail later.

The processing system 100 runs one or more threads that are designated as critical from among a plurality of non-critical threads, which are any threads that are not designated as critical. In general, critical and non-critical threads may run on any of the cores 102_0 … 102 _102_k-1 in any combination. Thus, each core executes a respective thread series comprising a plurality of non-critical threads, and the series executed by at least one of the cores comprises at least one critical thread. Furthermore, a copy of each critical thread (i.e., its checking thread) needs to be scheduled to execute on a different core than the corresponding master instance of that critical thread.

By way of example, fig. 2 shows a scenario where each of the two cores 102_0 and 102_1 executes a respective thread series, up to the point shown in fig. 2, which up to now includes only non-critical (NC) threads. In the example shown, core 0 execution includes non-critical threads NC_m,0; NC __ m+1,0; and nc_m+1, 0. Core 1 execution includes non-critical threads nc_n,1; and nc_n+1, 1. In each series, threads are scheduled one after the other, i.e., idle time between non-critical threads on either of the cores 102_0, 102_1 is avoided, in order to maximize the processing throughput of the system. I.e. in the series of cores 0, then once nc_m,0 has completed execution, the next thread nc_m+1,0 then starts executing as soon as possible, preferably in the next processing cycle or only in a few cycles thereafter. Similarly, in the series executed by core 1, once nc_n,1 has completed execution, the next thread nc_n+1,1 begins as soon as possible thereafter, and so on.

This arrangement can well avoid idle time between threads when non-critical threads are executing. However, problems arise when a critical thread needs to be scheduled for execution, which would require that repeated instances of the critical thread be scheduled on another core so that its output can be checked against that of the main instance of the critical thread. Conventionally, to do this, the operating system on the processor must schedule two thread instances of a given critical thread at exactly the same time (or in practice, usually a predetermined number of cycles offset from each other by a small amount), the so-called "lockstep".

However, to schedule two threads in lockstep requires that both cores be available at the same (or nearly the same) time. Assuming processing system 100 is typically kept busy, then the likelihood that two cores simply happen to become idle at the same time is low (i.e., execution of different non-critical threads on different cores is unlikely to be aligned in time). Thus, in order to be able to schedule critical threads and their repeated instances (check threads) in lockstep, this means that the operating system or scheduling hardware must deliberately keep one core in idle state for a certain period of time after completing its most recent thread while waiting for the other core to complete its own current thread and also become idle. This would result in wasted time resources on cores that remain idle.

This problem is illustrated by way of example in fig. 3. For example, suppose that a critical thread CTp is scheduled to run after a non-critical thread nc_m+2,0 on the first core (core 0 (102_0)). As part of an application, an instance of a critical thread CTp that is autonomously scheduled by the OS (as opposed to a repeated instance that is scheduled for inspection purposes only), for example, may be referred to herein as a "main instance" of the critical thread, or simply "critical thread". Furthermore, the OS or scheduling hardware must schedule the duplicate instance CTs of the critical thread to run on a different core than the main instance CTp. Repeated instance CTs may also be referred to herein as "secondary instances" of critical threads or "inspection threads". For example, if there are only two cores as shown in FIG. 1, then the secondary instance (inspection thread) CTs would have to run on the second core (core 1 (102_1)). The processing system 100 has a lot of work to do, otherwise if security checks are not required, non-critical (NC) threads will be scheduled one after the other as shown in FIG. 2. However, when a checking thread is to be scheduled, which conventionally has to be executed in lockstep with a critical thread (i.e. the master instance), the OS or scheduling hardware will have to ensure that both cores 102_0, 102_1 become idle at the same time. In the example shown, the end of the thread nc_m+2,0 scheduled immediately before the critical thread CTp on the first core 102_0 is not temporally coincident with the thread nc_n,1 running on the second core 102_1; the ends of either thread in nc_n+1,1 … are aligned (i.e., the two cores do not simply happen to become idle at the same time, which may occur but are unlikely in any given case). Thus, instead, the OS or scheduling hardware will have to manually hold the second core 102_0 in the idle state for a certain period 302 after the previous thread NC_n+1,1 running on the second core 102_0 until the first core 102_0 also becomes idle after NC_m+2,0 is completed. Then, the primary instance CTp and the secondary instance CTs, which execute the critical threads in lockstep, may start on both cores 102_0, 102_1.

In other words, the OS or scheduling hardware must deliberately keep one core idle for a period of time 302 while the OS or scheduling hardware waits for the other core to complete what it is doing so that both cores can execute two instances of a critical thread in lockstep.

Thus, conventionally, executing a critical thread may result in additional forced idle time 302, which introduces undesirable latency to the processing by the second core 102_1 and thus reduces the processing throughput of the system 100. In other words, it wastes potentially available processing resources of the second core. In a pipelined execution unit, this may also be described as a "pipeline bubble". It is desirable to avoid or at least mitigate this problem.

One of the non-critical threads may instead be interrupted to run the inspection thread in its place. For example, in the example of FIG. 3, the thread to be interrupted may be a hypothetical thread NC_n+2,1 on core 102_1, which is not shown in the example of FIG. 3, but is a thread that follows NC_n+1,1, and would otherwise still run when core 102_0 completes running thread NC_m+2, 0. However, this would require saving the program state of the interrupted non-critical thread to memory, and then after checking that the thread is complete, reloading the program state and resuming execution of the interrupted non-critical thread. This would result in an undesirable amount of additional software overhead.

If there are more than two cores 102_0 … 102_K-1, the OS or scheduling hardware may select any of a number of other cores (other than the core on which the master instance CTp is running) to execute the inspection thread CTs without having to select the second core 102_1. But assuming that the cores are all currently busy, the OS or scheduling hardware will still have to intervene in one of these cores to run the check thread CTs and thus the same problem will occur.

Fig. 6 shows an improved design of a processing system 100 according to the present disclosure that solves the above-mentioned problems.

The disclosed processing system 100 includes a plurality of cores 102_0 … 102_k-1 as described with respect to fig. 1, as well as a data memory 105 and a program memory 104 (not shown in fig. 6). The processing system 100 also includes a request buffer 602, a secure thread scheduling circuit 608, a result buffer 610, and a comparison circuit 612.

Each of the cores 102_0 … 102_k-1 is operatively coupled to the data store 105 and the result buffer storage 610, for example, via a suitable interconnect 606, such as a cross-interconnect. Each of the cores 102_0 … 102_k-1 is also operatively coupled to the request buffer 602. The secure thread scheduling circuit 608 is operatively coupled to the request buffer 602. The comparison circuit 612 is operatively coupled to the result buffer memory 610.

The secure thread scheduling circuit 608 is implemented in dedicated fixed function hardware. For brevity, it may also be referred to herein as a Secure Thread Scheduler (STS). The comparison circuit 612 is also implemented in dedicated fixed function hardware. It may also be referred to herein as comparison logic. Request buffer 602 may be implemented in any form of one or more temporary storage devices, such as RAM or one or more hardware buffers. The request buffer 602 may be part of the general purpose data store 105 or a stand alone device such as a dedicated RAM or a combination thereof. In an embodiment, the request buffer storage 602 may include separate individual request buffers 602_0 … 602_k-1 for each of the cores 102_0 … 102_k-1, respectively. The results buffer storage 610 may also be implemented in any suitable form of one or more memory devices, such as RAM or one or more hardware buffers. The results buffer 610 may be part of the general purpose data store 105 or a stand alone device such as a dedicated RAM or a combination thereof. It may be part of the same memory device as the request buffer 602, either alone or in combination. In an embodiment, the result buffer storage 610 may include separate individual result buffers 610_0 … 610_k-1 for each core 102_0 … 102_k-1, respectively. In some such embodiments, the result buffer 610_0 … 610 _610_k-1 for each core 102_0 … 102_k-1 includes a respective Memory Access Queue (MAQ) for the respective core.

In operation, the OS (or potentially some other scheduling software or component) may schedule non-critical threads and critical threads (master instances) to schedule across cores 102_0 … 102 _102_K-1 in the normal manner. In an embodiment, little modification to the OS software is required. However, in addition, when a critical thread CTp is scheduled, a request is written to the request buffer store 602 to indicate that the thread is to be considered critical. The request may be received from software executing on one of the cores 102_0 … 102 _102_k-1 or from another execution unit (not shown) that is part of the processing system, such as a host or a host CPU. In an embodiment, the request is written by the OS, but the request may potentially be written by some other scheduling software or component. In an embodiment, the OS writes the request by calling a special function that may be named_Safe.

In an embodiment, the request is written to a separate request buffer 602—k associated with the respective core 102—k on which the master instance of the critical thread CTp is to run. However, the possibility of buffering requests for multiple critical threads across multiple cores in a common request buffer is not precluded by alternative implementations.

Regardless of the manner in which the request is written, the request in the request buffer 602 identifies the thread CTp to be considered critical, e.g. by means of a thread ID and/or an entry PC (program counter value). It may also include any other starting register value for the thread in question. In addition, the request includes an indication of the duration of the secure time window, which may also be referred to herein as a "secure window" (SW) for brevity. The safety window is the maximum time that a critical thread's duplicate instance CTs (i.e., the inspection thread) must start. It is defined as the maximum time after the start of the master instance CTp that the checking thread CTs has to start. As will be discussed in more detail later.

A Secure Thread Scheduler (STS) 608 is arranged to read requests from the request buffer 602. In response, based on the information in the request, STS 608 will schedule the corresponding inspection thread CPs (i.e., the sub-repeated instance of the critical thread CTp identified in the request). In other words, the request informs STS 608 to attempt to schedule the inspection thread forward from that point in time. It also informs STS 608 of the core on which the thread is scheduled so that the STS avoids scheduling check threads on the same core. Based on this information, secure thread scheduler 608 will schedule the check thread CTs to execute on one of cores 102_0 … 102_k-1, except for example the core on which the main instance CTp is scheduled by the OS to execute (STS 608 has hardware that keeps track of where the main thread is scheduled, e.g., this information is filled by the OS into STS registers). Thus, if the core on which the master instance CTp is scheduled is referred to as a first core (e.g., 102_0), STS 608 schedules the corresponding inspection thread CTs to run on a second one of the cores (e.g., 102_1).

The processing system 100 also includes hardware 103_0 … 103_k-1 for each of one, some, or all of the cores 102_0 … 102_k-1, respectively, to indicate to the STS 608 when the respective core is idle. This may be described as an idle flag for each core, although the term is not intended to be necessarily limited to a single bit or any particular form of signal (although only a single bit is required). For example, in an embodiment, the idle flag 103_0 … 103_k-1 for each core 102_0 … 102_k-1 may be implemented in one or more memory-mapped registers accessed by STS 608. An alternative is to provide dedicated signal lines from each core 102_0 … 120_k-1 to STS 608.

Based on the idle flag 103_0 … 103_k-1, the Secure Thread Scheduler (STS) 608 is configured to detect when at least one of the cores 102_0 … 102_k-1 is idle (i.e., monitor whether the core is idle). In an embodiment, STS 608 is configured to detect when each of some or all of cores 102_0 … 102 _102_k are idle (i.e., monitor whether each individual core is idle).

Processing system 100 also includes a timer 614 operatively coupled to STS 608. The timer 614 may be a dedicated hardware timer for scheduling inspection thread CTs, or a general purpose hardware timer shared with other functions, or a dedicated or general purpose software timer running on one or more of the cores 102_0 … 102 _102_k-1, or a separate host CPU or host CPU, or a combination thereof. Regardless of the manner in which it is implemented, timer 614 enables STS 608 to determine the current time and, thus, compare the current time to the time at which the designated security window SW expires. Any scheduling described herein in relation to a security window will be understood to be performed by reference to a timer 614 and for brevity this will not be repeated each time a security window is referred to.

Based on the information from the flag 103_0 … 103_k-1 and the timer 614, the STS may determine whether any of the cores 102_0 … 102_k-1 is idle by the time the security window expires. If not, STS 608 will interrupt one of the non-critical threads executing on the non-idle one of the cores and schedule the checking thread CTs to execute on that core in place of the interrupted thread.

The security window SW runs from the beginning of (the main instance of) the critical thread CTp. STS 608 thus determines whether a second one of cores 102_0 … 102 _102_K-1 has become available (e.g., idle) to begin execution of the corresponding inspection thread CTs within a secure window that runs from the beginning of execution of the main instance of critical thread CTp on the first core. As described above, the security window is defined as the maximum time that the checking thread CTs must start after the start of the master instance. In practice, it is important that the repeated instance is completed no later than the end of a period of time having the same length as the security window and starting from the completion of the main instance, since in practice there will be a time when the result of the checking thread CTs has to become available in order to perform a check on the result of the main instance CTp, according to the requirements of the security critical application in question. However, since STS 608 needs to know the latest time to begin checking thread CTs, it works according to the start time, assuming that the execution of the threads is deterministic, and thus the total time taken to execute each thread is fixed. The value (i.e., duration) of the security window may depend on the application or the particular thread in question. The selected duration may be programmed with requests written to the request buffer 602 on a thread-by-thread basis for each critical thread.

If the core 102—k becomes available (i.e., idle) before the security window SW expires, the STS 608 may, in principle, schedule the inspection thread CTs to start on the core at any time between becoming idle and the expiration of the corresponding security window, depending on the particular implementation.

However, preferably, STS 608 may be configured to select an idle one of the cores 102_0 … 102 _102_K-1 to begin executing the inspection thread if one of the cores is detected as idle upon reading a request from request buffer 602. Alternatively, it is not excluded that the execution unit may wait, even if a non-idle execution unit is currently available when reading a request from the request buffer store, for example in order to keep it idle for some other upcoming purpose (if known), or in order to avoid unnecessarily delaying other threads, it is desirable (or desirable) that it is still possible to execute a checking thread on a core that has no other work to do before the end of the security window.

If no cores 102_0 … 102 _102_k-1 are immediately available at the time of the read request, the STS 608 will wait until one core becomes idle. If a core does become idle before the end of the security window, then the inspection thread may be scheduled to run on the core at any time between becoming idle and the expiration of the security window. Preferably, once the appropriate core becomes available, it immediately starts checking the thread, but it is not excluded as well that it may wait until a closer security window expires.

If no execution units do become idle before the end of the secure window, the secure thread scheduler will interrupt one of the execution units (if possible).

Most preferably, STS 608 is configured to: i) If one of the cores 102_0 … 102 _102_K-1 is detected as idle while the request is read from the request buffer 602, then an idle one of the cores is selected to begin executing the inspection thread; but ii) if none of the cores 102_0 … 102 _102_k-1 are detected as idle at the time of the read request, waiting and detecting if one of the cores becomes newly idle before the expiration of the security window, and if so, then selecting the newly idle core to start executing the checking thread; but only iii) when none of the cores 102_0 … 102 _102_k-1 becomes idle upon expiration of the security window, one of the non-critical threads executing on the non-idle one of the cores is interrupted and the non-idle core is selected to begin executing the checking thread instead of the interrupted non-idle thread.

In other words, STS 608 attempts to pick an idle core if available, but otherwise may interrupt non-critical threads. If no idle cores are currently available for the duplicate instance of the critical thread, the STS will wait until one idle core becomes available, except if no idle cores become available before the end of the security window, the STS will interrupt another non-critical thread.

The effect of the described system is shown by way of example in fig. 4 (see the situation in the conventional case of fig. 3). This advantageously conceals idle time since STS 608 may delay scheduling of duplicate instance (i.e., inspection thread) CTs and exploit when STS sees an opportunity to schedule inspection threads.

The program state of any interrupted thread is saved to memory (e.g., SRAM) so that the program state can be restored again later. Once the inspection thread is complete, the interrupted non-critical thread may resume on the same second core (e.g., 102_1) as the core on which the inspection thread CTs is running. Or if another core becomes available and when another core becomes available, the interrupted non-critical thread may resume on another one of the cores 102_0 … 102 _102_k-1.

The results buffer storage 610 is arranged to receive and buffer either or both of: at least one first result, which is the result of executing (a main instance of) the critical thread CTp on the first core (e.g. 102_0); and at least one second result, which is the result of executing the corresponding inspection thread CTs on the second core (e.g., 102_1). In an embodiment, the results of only one of the two threads (e.g., the main thread) need be buffered in the results buffer storage 610; and compare logic 612 compares the buffered results of one thread (e.g., the main thread CTp) with the incoming results of another thread (e.g., the secondary checking thread CTs) received via interconnect 606. Alternatively, in other implementations, it is not precluded that the results of both threads may be buffered and compared from the results buffer 610.

The result of the two thread instances CTp, CTs may be their respective memory accesses, i.e. reads and/or writes to memory addressed locations in the data memory 105, or signatures based thereon. This may mean that the result includes either the address or the payload data of the load/store, or both. Preferably, at least addresses of both loads and stores are captured. Alternatively or additionally, the result may include one or more other pieces of information, such as operand data generated by the operation or operations performed by the threads CTp, CTs; or architectural state such as one or more register values resulting from execution of the thread. The results are automatically written out to the results buffer 610. In an embodiment, the results of each thread instance CTp, CTs to be buffered are written to a respective result buffer associated with the respective core on which the thread instance is executing, such as a Memory Access Queue (MAQ), in a separate result buffer 610_0 … 610_k-1. In an embodiment, a direct connection 601_k from each core 102_k to its respective result buffer storage 610_k may be provided to allow it to buffer its results directly, rather than via the interconnect 606.

Note that: in an embodiment, the checking thread CTs does not actually access the memory itself, only the main thread CTp accesses the memory, and then the core of the checking thread fetches data from the memory access queue (FIFO). Specifically, for example, the main thread and the inspection thread access memory location 0x5: the actual access to 0x5 is done by the master core (such as CPU 0) and the results (as expected) are loaded into the memory queue. The checking thread (e.g., on CPU 1) simply re-uses the data in the memory queue, rather than re-accessing memory location 0x5 again. This is further aided by an embodiment in which the inspection thread is scheduled at least a few cycles later than the main thread.

In one particular implementation, the OS needs to call another function called a_Unsafe function to tell STS 608 to stop recording results to result buffer 610 (e.g., a memory access queue). In such an embodiment, STS 608 needs to be informed by the OS when the main instance of the critical thread CTp is complete, because STS itself does not have the concept of a thread, it simply keeps copying the instruction sequence from the main instance until informed of a stop. Alternatively, however, it is not precluded that in other implementations STS 608 may be configured to automatically detect when a master instance is completed and automatically stop recording results in response.

Regardless of the manner in which the results are collected, the comparison logic 612 is arranged to compare the first result and the second result based on the result buffer storage 610 and to generate an error signal if the first result and the second result do not match according to the comparison. In an embodiment, only the results of one of the primary and secondary threads (e.g., the primary thread CTp) are buffered in the results buffer 610 (e.g., in the respective results buffer 610—k (such as MAQ) associated with the core 102—k running that thread). Comparison logic 612 then compares the buffered results of the thread with the incoming results of the other of the two threads (e.g., secondary thread CTs), which may be received by comparison logic 612 from core 102 running the other thread via interconnect 606. For example, in an embodiment, only the main thread CTp needs to be buffered, and the inspection thread CTs is inspected when the transaction is generated in real-time (or near real-time). However, in alternative implementations, it is not precluded that the results of both the primary and secondary (e.g., check) threads may be buffered, and comparison logic 612 compares the buffered results from both threads of result buffer store 610.

In any event, the comparison is done in hardware by comparison logic 612. If there is no match, an error signal is generated. Depending on the particular implementation, the error signal may be a signal to software (e.g., an OS or an application) or a hardware error processor (not shown) or both. In an embodiment, the error signal may include an exception issued to an exception handler (not shown) (e.g., interrupt controller) so that the system 100 reacts accordingly. The exception handler may be implemented in hardware or software (e.g., as part of an OS) or a combination thereof.

For example, the processing system 100 (e.g., via an exception handler) may be configured to perform any of the following in response to an error signal (e.g., an exception). It may output a warning through a user interface to which the processing system 100 is connected. Alternatively or additionally, the processing system 100 may disable one or both of the cores 102_0, 102_1 on which the primary and secondary instance CTp, CTs of the critical thread are executed, but continue executing the thread on at least one remaining core. Alternatively, it may stop execution across the entire processing system 100 (i.e., across all cores 102_0 … 102 _k-1). Another possibility is that the system 100 schedules the critical thread CTp and the inspection thread CTs to execute again at least once on the same or different one of the cores and repeats the comparison each time. Then, if an error signal is still obtained, this may indicate a hardware failure, so the system may take action, such as outputting a warning or disabling one or more of the cores or the entire system. But if an error signal is no longer encountered after one or more iterations, this may indicate that the error is due to random bit flipping (e.g., due to cosmic radiation), so the system may continue normal execution once successful iterations of the critical thread are achieved.

In some embodiments, the system 100 may run the critical thread CTp on different combinations of cores and check the thread CTs in order to attempt to track which core is faulty. If only one pair of threads is running on the same pair of cores, it cannot be determined whether the first core or the second core experienced a failure, or even both cores experienced a failure. However, if the system runs CTp and CTs on the first core 102_0 and the second core 102_1, respectively, and gets an error, then, for example, attempts to run CTp on the first core but run CTs on the third core 102_2 again, and no error is encountered, this may indicate that the second core 102_1 experienced a failure. Similarly, if the system again attempts to run CTs on the same second core, but runs CTp on a new third core, and no error is obtained, this indicates that the fault is on the first core. If the conclusion is that a failed core is found (possibly after further attempts to exclude random bit flipping), the failed core may be shut down. The retry may be accomplished automatically by STS 608, or in alternative implementations, the retry may be accomplished explicitly by the OS or other software, or a combination of STS and OS/software.

The response to the error signal may depend on the particular implementation and/or the particular situation.

In an embodiment, STS 608 may be configured to raise an emergency if no cores become idle to execute a check thread when security window SW expires, and no cores are running non-critical threads that may be interrupted. In other words, an emergency situation may arise if all cores are busy executing an uninterruptible thread, such as a critical thread or a checking thread for which the security window has expired. An emergency situation is another type of error signal and may also be referred to as an emergency signal. The response to the emergency situation may be the same or different from the situation in which the error signal is generated. For example, system 100 may output an alert via a user interface, or stop execution on one or more cores or the entire system, or run an emergency exception handling routine.

As a further alternative or additional feature, in embodiments, request buffer store 602 may hold a plurality of requests for different critical threads, and STS 608 may service the requests in order of priority, where the priority is determined based on how close the respective security window is due. This will reduce the chance that an available core cannot be found to service a given inspection thread at the end of its respective security window (and thus reduce the chance that an emergency will be raised).

With respect to capturing results, in an implementation, all output in terms of memory accesses performed by each of the primary and secondary instances of the critical thread is captured in the results buffer storage 610. However, alternatively, the results may be stored in a condensed or alternative form to reduce the storage requirements of the results buffer storage 610. This may include, for example, storing the results in a compressed form and/or storing architectural register states (e.g., program counter and/or other register value (s)) of the core in question instead of a complete memory access. An example of compression is a hash of a memory access. The results of the two thread instances CTp, CTs running on the two cores 120_0, 102_1 may then also be compared in their condensed (e.g., compressed) form without the need to decompress or expand them. For example, in one particular implementation, the results may be stored and compared in the form of architectural states, such as PC and register values, plus a hash of the memory access.

As previously described, in an embodiment, both the input and output from the core (e.g., core 0) on which the safety critical thread CTp is running may be captured, for example, in main memory or private memory, such that the input may be replayed the same as core 1 and the output may be compared to the output of core 1. The input/output of core 0 and core 1 are unlikely to change frequently, and thus it should be possible to achieve high compression of the inputs and outputs so that they reduce their space in memory and reduce area impact. In an embodiment, some or all of the outputs may be stored in compressed form in results buffer storage 610. Regarding possible compression of inputs, in an embodiment, some inputs may be compressed and some other inputs may not be compressed. For example: interrupt signaling may not need to be compressed. However, read data from memory into the master core (which is also an input) may be compressed before storing it in the memory access queue.

Cores (such as CPUs) typically have many outputs and capturing these outputs over many cycles may require a significant amount of memory utilization, depending on the application and the compression that is available. Thus, by capturing the architectural register state when a thread or some other checkpoint completes, the amount of memory required by the system may be significantly reduced.

If the architecture state is used instead of the output of the core, the check is performed. This provides many options for PPA (power, performance, area tradeoff) of core 1. For example, the microarchitecture and technology nodes of core 1 may be selected to minimize the power and area impact of fault detection.

Fig. 5 shows by way of example another alternative additional concept which may be used in combination with the basic concepts described in relation to fig. 4 and 6, in addition to or independent of any of the further optional features discussed later (e.g. options for responding to error signals, possibility of issuing emergency signals, prioritization of service requests, and different options in the form of storing and comparing results). Additional concepts herein are as follows. If a checking thread CTs is scheduled to start executing before its respective security window SW expires, the checking thread may be described as "eager" or "eager" executing, either because the idle core happens to be already available for executing the checking thread immediately at the beginning of executing the corresponding master instance CTp, or because the core becomes available later but still earlier than the security window expires. In accordance with embodiments disclosed herein, secure Thread Scheduler (STS) 608 may allow additional flexibility for scheduling additional threads around the execution of the eager inspection thread.

Fig. 5 shows an example of this. Here, STS 608 will allow the OS to interrupt the inspection thread CTs to run one or more additional critical or non-critical threads (nc_n+2, 1 in this example) whenever an interrupt occurs before the end of the security window SW and whenever the inspection thread resumes again before the end of the security window. Or in fact the time limit for resuming execution of the checking thread CTs may be extended beyond the safety window SW by the amount of time it takes for the secondary instance of the critical thread to execute before it is interrupted. I.e. the time limit for recovery is sw+τ (or in principle any time between SW and sw+τ). However, for simplicity of implementation, the time limit for the resume check thread CTs may simply be considered as the expiration of the original security window SW.

The OS may interrupt the inspection thread CTs because the OS is unaware of the inspection thread being scheduled. In an embodiment, the OS is only aware of the main thread. This assumption simplifies the OS design and allows existing OSs to migrate to this system without much modification. In such an embodiment, STS 608 has the effect of being implemented in hardware to interrupt both existing inspection thread CTs and non-critical thread NC. However, STS cannot interrupt the main thread CTp. In a preferred embodiment, as much of the functionality as possible is located in hardware, rather than in the OS or other software. Note that if necessary, it is decided by the OS whether to interrupt the main thread.

Alternatively or additionally, a critical inspection thread CTs running on one core (e.g., 102_1) may be interrupted to run one or more additional critical or non-critical threads on that core in its place; and the interrupted inspection thread may be migrated to run on the third core (e.g., 102_2). The third core 102_2 may be an idle core or a core running another non-critical thread that may be interrupted to run the migrated inspection thread. In an embodiment, STS 608 may be configured to automatically perform migration if the OS interrupts checking a thread by scheduling another thread on the same core 102_1. Alternatively, the migration may be performed by the OS.

The reason why interrupts eager the check thread to run another thread, not just on another core, is to provide flexibility to the OS. As described above, in an embodiment, the OS is not aware of the inspection thread, and may actually need to inspect the resources on which the thread is currently running. Furthermore, this gives more flexibility to move the inspection thread back and forth. Migration is possible because an interface is provided on the core that allows the STS to capture all CPU states. This is the way the STS can capture the CPU state. The other interface allows the CPU state to be overridden. These special interfaces on the CPU allow automatic switching to occur.

In an embodiment, rules for both eager and non-eager inspection threads may be summarized as follows.

The inspection thread CTs may be scheduled urgently before its scheduling timer expires. When this is done, the portion 608—k where the secure thread is scheduled becomes eager.

The eager thread may be interrupted by a non-critical/main/non-eager inspection thread.

The non-eager inspection thread is an inspection thread that is scheduled after the schedule timer expires.

The inspection thread may be partially executed in the eager mode and partially executed in the non-eager mode. There may be a gap between the two modes. Gaps may exist within the eager mode.

The eager mode may be moved across the CPU core 102_0 … 102 _102_k-1. If CPU 0 is urgently executing a check thread and the OS causes it to exit the idle state, the STS automatically moves the thread to another idle CPU (or) waits for an idle CPU to be found (or) for the expiration of a timer.

Idle CPU becomes the target of eager execution.

The non-eager inspection thread cannot be interrupted.

Note that: while the concept of eager execution and the ability to interrupt eager threads have been described above in connection with the principle of delaying the initial scheduling of inspection threads, this is not limiting. Whether the execution of the checking thread is delayed by waiting for an idle execution unit or whether the checking thread starts its corresponding critical thread lockstep execution (possibly even in a system like that of fig. 2 in which the availability of two idle execution units is created artificially, or whether the checking thread starts by interrupting the other thread), the principle of the checking thread of interrupt eager execution can still be applied in order to create more scheduling opportunities after the initial scheduling of the checking thread.

FIG. 7 illustrates a particular example implementation of the processing system of FIG. 6. Here, STS 608 includes a separate respective STS block 601_0 … 608_k-1 for each core 102_0 … 102_k-1, and comparison logic 612 includes a separate respective comparison block 612_0 … 612_1 associated with each core 102_0 … 102 _k-1. The result buffer 610_0 … 610 _610_K-1 includes a Memory Access Queue (MAQ) associated with each respective core. The data storage 105 may include RAM and optionally an associated cache. The line 702_k shown from each compare logic block 612_k to the connection between its respective core 102_k and interconnect 606 represents a bus interface. As in the case of fig. 6, a direct connection 601_k is also provided between each core 102_k and its corresponding result buffer storage (e.g. MAQ) 610_k, without the need for communication via the interconnect 606. The core 102_k running the main thread CTp buffers its own respective result via the direct connection 601_k with its own result buffer storage 610_k, and then the comparison logic 612_k associated with the core running the main thread compares the result of the main thread with the result from the secondary thread/inspection thread CTs received via the interconnect 606 (which is not necessarily buffered).

Fig. 11 illustrates a further optional extension of the disclosed principles, which may be used in combination with or independently of any of the other optional features disclosed herein.

In contrast, FIG. 10 shows a simpler implementation. In the arrangement of fig. 10, the result buffer memory device 610 comprises a memory access queue in the form of a first-in-first-out (FIFO) buffer fifo_0→1 from one core 102_0 to another core 102_1, and similarly fifo_1≡0 in the other direction. These are specific implementations of MAQ 610_0 … 610_K-1 in FIG. 6. Fifo_0→1 buffers the result from core 0 (102_0) to compare logic 612_1 of core 1 (102_1), if needed in the opposite direction, and vice versa. Assume that core 0 executes a first critical thread ctp_a and then executes a second subsequent critical thread ctp_b; and the corresponding first checking thread cts_a and second checking thread cts_b will run on core 1. In the arrangement shown in fig. 10, there is only a single queue in a given direction, which means that the result of ctp_b in the queue will be buffered after the result of ctp_b, and thus the comparison logic block 612_1 will not be able to perform a comparison of the result of the inspection thread cts_b with the result of ctp_b before the result of cts_a is compared with the result of ctp_a. Thus, the second inspection thread cts_b cannot be scheduled before the first cts_a. However, the second inspection thread cts_b may have a much shorter corresponding security window, and thus it may be desirable to be able to prioritize the execution of cts_b and the inspection of cts_1 with respect to ctp_1 over the execution of cts_a and the inspection of the result of cts_0 with respect to ctp_0.

Fig. 11 shows an alternative implementation that would allow this. Here, the memory access queue in a given direction between at least one pair of cores (e.g., from core 0 to core 1 in the illustrated example) includes at least two parallel FIFOs: fifo_0,0→1; and fifo_1,0→1. In this way, results from ctp_b may be buffered in parallel with results from ctp_a, and thus it is possible to run and check cts_b before cts_b. In other words, if the security window of the inspection thread cts_b corresponding to the later critical thread ctp_b is shorter (i.e., if the priority of the inspection ctp_b is more urgent than that of the inspection ctp_a, even if ctp_a starts earlier), the inspection thread cts_b may "override" the inspection thread cts_a corresponding to the earlier critical thread ctp_a. The principle of parallel queuing may replicate in two directions and across all cores, but for simplicity of illustration only one direction between a pair of cores is shown in fig. 11.

FIG. 8 illustrates a computer system in which the graphics processing system described herein may be implemented. The computer system includes a CPU 802 and memory 806, and may also include a GPU 804, a Neural Network Accelerator (NNA) 808, and/or other devices 814, such as a display 816, speakers 818, and a camera 819. Processing block 810 (corresponding to processing block 600 in fig. 6 or fig. 7) is implemented on CPU 802. In other examples, one or more of the depicted components may be omitted from the system and/or the processing block 810 may be implemented on the GPU 804 or the NNA 808. The components of the computer system may communicate with each other via a communication bus 820. Memory area 812 is implemented as part of memory 806. Memory 806 or storage 812 in fig. 8 may represent one or more memory devices employing one or more storage media, such as electronic media, such as ROM, EEPROM, flash memory, RAM, or solid state drives, or magnetic media, such as hard disk drives. Some or all of the program memory 104 from which the program threads are retrieved may be implemented in the memory area 812, or in an internal memory of the processing block 810, or a combination thereof. Similarly, the data store 105 may be implemented in the storage area 812, or in an internal memory of the processing block 810, or in a combination thereof. Request buffer 602 and result buffer 610 are preferably implemented in an internal memory of processing block 810, such as local RAM or a dedicated memory, but may in principle be implemented in storage 812, or in a combination thereof.

The processing systems of fig. 1 and 6-8 are shown as including several functional blocks. This is merely illustrative and is not intended to limit the strict division between the different logic elements of such entities. Each of the functional blocks may be provided in any suitable manner. It should be understood that intermediate values described herein as being formed by a processing system need not be physically generated by the processing system at any point in time, and may represent only logical values that facilitate description of the processing performed by the processing system between its inputs and outputs.

The processing systems described herein may be embodied in hardware on an integrated circuit. The processing systems described herein may be configured to perform any of the methods described herein. In general, unless otherwise indicated, any of the functions, methods, techniques or components described above may be implemented in software, firmware, hardware (e.g., fixed logic circuitry) or any combination thereof. In general, the terms "module," "functionality," "component," "element," "unit," "block," and "logic" may be used herein to generally represent software, firmware, hardware, or any combination thereof. In the case of a software implementation, the module, functionality, component, element, unit, block or logic represents program code that performs specified tasks when executed on a processor. The algorithms and methods described herein may be executed by one or more processors executing code that causes the processors to perform the algorithms/methods. Examples of a computer-readable storage medium include Random Access Memory (RAM), read-only memory (ROM), optical disks, flash memory, hard disk memory, and other memory devices that can store instructions or other data using magnetic, optical, and other techniques and that can be accessed by a machine.

The terms computer program code and computer readable instructions as used herein refer to any kind of executable code for a processor, including code expressed in a machine language, an interpreted language, or a scripting language. Executable code includes binary code, machine code, byte code, code defining an integrated circuit (e.g., a hardware description language or netlist), and code expressed in programming language code such as C, java or OpenCL. The executable code may be, for example, any kind of software, firmware, script, module, or library that, when properly executed, handled, interpreted, compiled, run in a virtual machine or other software environment, causes the processor of the computer system supporting the executable code to perform the tasks specified by the code.

The processor, computer, or computer system may be any kind of device, machine, or special purpose circuit, or a collection or portion thereof, that has processing capabilities such that instructions can be executed. The processor may be any kind of general purpose or special purpose processor such as a CPU, GPU, system on a chip, state machine, media processor, application Specific Integrated Circuit (ASIC), programmable logic array, field Programmable Gate Array (FPGA), etc. The computer or computer system may include one or more processors.

The present invention is also intended to cover software defining the configuration of hardware as described herein, such as Hardware Description Language (HDL) software, for designing integrated circuits or for configuring programmable chips to perform desired functions. That is, a computer readable storage medium having encoded thereon computer readable program code in the form of an integrated circuit definition data set, which when processed (i.e., run) in an integrated circuit manufacturing system, configures the system to manufacture a processing system configured to perform any of the methods described herein, or to manufacture a processing system comprising any of the devices described herein, may be provided. The integrated circuit definition data set may be, for example, an integrated circuit description.

Accordingly, a method of manufacturing a processing system as described herein at an integrated circuit manufacturing system may be provided. Furthermore, an integrated circuit definition data set may be provided that, when processed in an integrated circuit manufacturing system, causes a method of manufacturing the processing system to be performed.

The integrated circuit definition data set may be in the form of computer code, for example, as a netlist, code for configuring a programmable chip, as a hardware description language defining a hardware suitable for fabrication at any level in an integrated circuit, including as Register Transfer Level (RTL) code, as a high-level circuit representation (such as Verilog or VHDL), and as a low-level circuit representation (such as OASIS (RTM) and GDSII). A higher-level representation (e.g., RTL) that logically defines hardware suitable for fabrication in an integrated circuit may be processed at a computer system configured to generate fabrication definitions for the integrated circuit in the context of a software environment that includes definitions of circuit elements and rules for combining these elements to generate fabrication definitions for the integrated circuit so defined by the representation. As is typically the case when software is executed at a computer system to define a machine, one or more intermediate user steps (e.g., providing commands, variables, etc.) may be required to configure the computer system to generate a manufacturing definition for an integrated circuit to execute code that defines the integrated circuit to generate the manufacturing definition for the integrated circuit.

An example of processing an integrated circuit definition data set at an integrated circuit manufacturing system to configure the system as a manufacturing processing system will now be described with reference to fig. 9.

Fig. 9 illustrates an example of an Integrated Circuit (IC) fabrication system 902 configured to fabricate a processing system as described in any of the examples herein. In particular, IC manufacturing system 902 includes layout processing system 904 and integrated circuit generation system 906. The IC fabrication system 902 is configured to receive an IC definition data set (e.g., defining a processing system as described in any of the examples herein), process the IC definition data set, and generate an IC (e.g., embodying the processing system as described in any of the examples herein) from the IC definition data set. Processing of the IC definition data set configures IC fabrication system 902 to fabricate an integrated circuit embodying the processing system as described in any of the examples herein.

Layout processing system 904 is configured to receive and process the IC definition data set to determine a circuit layout. Methods of determining a circuit layout from an IC definition dataset are known in the art and may involve, for example, synthesizing RTL codes to determine a gate level representation of a circuit to be generated, for example, in terms of logic components (e.g., NAND, NOR, AND, OR, MUX and FLIP-FLOP components). By determining the location information of the logic components, the circuit layout may be determined from the gate level representation of the circuit. This may be done automatically or with the participation of a user in order to optimize the circuit layout. When the layout processing system 904 has determined the circuit layout, the layout processing system may output the circuit layout definition to the IC generation system 1006. The circuit layout definition may be, for example, a circuit layout description.

As is known in the art, the IC generation system 906 generates ICs according to a circuit layout definition. For example, the IC generation system 906 may implement a semiconductor device fabrication process to generate ICs, which may involve a multi-step sequence of photolithography and chemical processing steps during which electronic circuits are gradually formed on a wafer made of semiconductor material. The circuit layout definition may be in the form of a mask that may be used in a lithographic process to generate an IC from the circuit definition. Alternatively, the circuit layout definitions provided to the IC generation system 906 may be in the form of computer readable code that the IC generation system 906 may use to form an appropriate mask for generating the IC.

The different processes performed by IC fabrication system 902 may all be implemented in one location, e.g., by a party. Alternatively, IC fabrication system 902 may be a distributed system such that some processes may be performed at different locations and by different parties. For example, some of the following phases may be performed at different locations and/or by different parties: (i) Synthesizing an RTL code representing the IC definition dataset to form a gate level representation of the circuit to be generated; (ii) generating a circuit layout based on the gate level representation; (iii) forming a mask according to the circuit layout; and (iv) using the mask to fabricate the integrated circuit.

In other examples, processing of the integrated circuit definition data set at the integrated circuit manufacturing system may configure the system to manufacture the processing system without processing the integrated circuit definition data set to determine the circuit layout. For example, an integrated circuit definition dataset may define a configuration of a reconfigurable processor, such as an FPGA, and processing of the dataset may configure the IC manufacturing system to generate (e.g., by loading configuration data into the FPGA) the reconfigurable processor having the defined configuration.

In some embodiments, the integrated circuit manufacturing definition data set, when processed in the integrated circuit manufacturing system, may cause the integrated circuit manufacturing system to produce an apparatus as described herein. For example, configuration of an integrated circuit manufacturing system by an integrated circuit manufacturing definition dataset in the manner described above with reference to fig. 9 may cause an apparatus as described herein to be manufactured.

In some examples, the integrated circuit definition dataset may include software running on or in combination with hardware defined at the dataset. In the example illustrated in fig. 9, the IC generation system may also be configured by the integrated circuit definition data set to load firmware onto the integrated circuit in accordance with program code defined at the integrated circuit definition data set at the time of manufacturing the integrated circuit, or otherwise provide the integrated circuit with program code for use with the integrated circuit.

The implementation of the concepts set forth in the present disclosure in devices, apparatuses, modules, and/or systems (and in methods implemented herein) may result in performance improvements when compared to known implementations. Performance improvements may include one or more of increased computational performance, reduced latency, increased throughput, and/or reduced power consumption. During the manufacture of such devices, apparatuses, modules and systems (e.g., in integrated circuits), a tradeoff may be made between performance improvements and physical implementation, thereby improving the manufacturing method. For example, a tradeoff can be made between performance improvement and layout area, matching the performance of a known implementation, but using less silicon. This may be accomplished, for example, by reusing the functional blocks in a serial fashion or sharing the functional blocks among elements of an apparatus, device, module, and/or system. Rather, the concepts described herein that lead to improvements in the physical implementation of devices, apparatus, modules and systems (e.g., reduced silicon area) can be weighed against performance improvements. This may be accomplished, for example, by fabricating multiple instances of the module within a predefined area budget.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

According to one aspect disclosed herein, a processing system is provided

The processing system includes: a plurality of parallel execution units, each parallel execution unit operable to execute a respective series of threads, wherein at least some of the threads executed by at least some of the execution units are non-critical threads that are not designated as critical; a request buffer operable to receive a request indicating that one of the threads in a respective series executed by a first one of the execution units is designated as a critical thread; a secure thread scheduling circuit arranged to read the request from the request buffer store and, in response, schedule an inspection thread that is a copy of the critical thread to execute on a second execution unit of the plurality of execution units other than the first execution unit; a result buffer storage arranged to buffer one of: a first result, the first result being a result of executing the critical thread on the first execution unit; and a second result, the second result being a result of executing the inspection thread on the second execution unit; and a comparison circuit arranged to compare the one of the first and second results from the result buffer with the other of the first and second results and to generate an error signal if the first and second results do not match in accordance with the comparison; wherein the request includes an indication of a secure time window; and wherein the secure thread scheduling circuit is configured to detect when at least one of the execution units is idle and if none of the execution units is detected as idle at the expiration of the secure time window, to interrupt one of the non-critical threads executing on a non-idle one of the execution units and to select the non-idle execution unit as the second execution unit to execute the checking thread instead of the interrupted thread.

In an embodiment, the secure thread scheduling circuitry may be configured to: if one of the execution units is detected as idle when the request is read from the request buffer store, then immediately selecting the idle one of the execution units as the second execution unit to begin executing the inspection thread; however, if none of the execution units is detected as idle upon reading the request, waiting and detecting if one of the execution units becomes newly idle before the expiration of the secure time window, and if one of the execution units becomes newly idle, then selecting the newly idle execution unit as the second execution unit to begin executing the inspection thread; but if none of the execution units becomes idle at the expiration of the secure time window, the interrupt to one of the non-critical threads executing on a non-idle one of the execution units is executed and the non-idle execution unit is selected as the second execution unit to begin executing the check thread instead of the interrupted non-idle thread.

In an embodiment, the secure thread scheduling circuitry may be configured to: allowing the inspection thread to be interrupted if the inspection thread is executed urgently, and allowing the inspection thread to be executed urgently if the execution of the inspection thread begins before the secure time window expires; the interrupting of the inspection thread includes scheduling one or more additional critical or non-critical threads to execute on the second execution unit in place of the inspection thread.

In an embodiment, the secure thread scheduling circuitry may be configured to: interrupting one of the one or more further threads by resuming the execution of the inspection thread on the second execution unit if the one or more further threads have not completed at the expiration of the rescheduling time limit, otherwise resuming the execution of the inspection thread after the one or more further threads have completed; the rescheduling time limit is: a) expiration of a secure time window, b) expiration of a secure time window plus any time it has taken to execute the inspection thread, or c) a time between a) and b).

In an embodiment, the secure thread scheduling circuitry may be configured to: an emergency situation is raised if none of the execution units become idle before the secure time window expires, nor is it found that the execution unit is executing a non-critical thread.

In an embodiment, the first result may include an indication of a memory access performed by the critical thread and the second result includes an indication of a memory access performed by the inspection thread.

In embodiments, the first result and the second result may be output and compared in compressed form.

In an embodiment, the processing system may be configured to, in response to the error signal: a) outputting a warning through a user interface, or B) disabling the first execution unit and/or the second execution unit, but continuing to execute a thread on at least one remaining one of the execution units, or C) stopping execution across the entire processing system, or D) executing the critical thread and the inspection thread again on the same or different one of the execution units at least once, and repeating the comparison each time, and then executing one of a) -C) if the repeated comparison still generates the error signal.

In an embodiment, the request buffer store may be operable to buffer a plurality of requests, each request indicating that a respective one of the threads executing on a respective first one of the execution units is to be classified as a critical thread, wherein each request includes an indication of a respective secure time window; wherein the thread scheduling circuitry is configured to schedule execution of a respective inspection thread on a respective second one of the execution units other than the respective first execution unit, the respective inspection thread being a copy of a respective critical thread; the result buffer storage is arranged to buffer at least a respective first result of each respective critical thread or a respective second result of each respective inspection thread; the comparison circuit is configured to compare each second result with a respective first result and to generate an error signal if the respective second result does not match the respective first result; and the secure thread scheduling circuitry is configured to schedule the inspection threads in order of priority, wherein the priority is determined according to the extent to which the respective secure time window is near expired.

In an embodiment, the result buffer storage may comprise a respective memory access queue for each of the plurality of execution units, and each memory access queue comprises a plurality of FIFOs for buffering the results of different threads executing on the same execution unit.

According to another aspect disclosed herein, there may be provided a method comprising: scheduling a respective thread series to execute on each of a plurality of parallel execution units, wherein at least some of the threads executed by at least some of the execution units are non-critical threads that are not designated as critical; receiving a request indicating that one of the threads in the respective series executed by a first one of the execution units is designated as a critical thread; scheduling, in response to the request, a check thread to execute on a second execution unit of the plurality of execution units other than the first execution unit, the check thread being a copy of the critical thread; buffering one of the following: a first result, the first result being a result of executing the critical thread on the first execution unit; and a second result, the second result being a result of executing the inspection thread on the second execution unit; comparing a first result with a second result, the first result being a result of executing the critical thread on the first execution unit and the second result being a result of executing the inspection thread on the second execution unit; and detecting whether the first result and the second result match based on the comparison; wherein the request includes an indication of a secure time window; and wherein the method further comprises: detecting when at least one of the execution units is idle and interrupting one of the non-critical threads executing on non-idle ones of the execution units when none of the execution units is detected to be idle upon expiration of the secure time window, and selecting the non-idle execution unit as the second execution unit to execute the inspection thread in place of the interrupted thread.

According to another aspect disclosed herein, there is provided a secure thread scheduler configured to schedule an inspection thread for a critical thread running on one of a plurality of execution units, the inspection thread being a copy of the critical thread; the secure thread scheduler is configured to schedule the inspection thread to begin running on a second execution unit of the plurality of execution units before a secure time window for scheduling the inspection thread ends; and the secure thread scheduler is further configured to allow the inspection thread to be interrupted by another thread when the inspection thread is running on the second execution unit of the plurality of execution units, and to reschedule the inspection thread to resume when a reschedule time limit expires.

In an embodiment, the secure thread scheduler may be configured to resume execution of the inspection thread on an execution unit on which another thread is running by interrupting another thread, other than the critical thread, running on an execution unit other than the first execution unit.

In an embodiment, the secure thread scheduler may be operable such that the other thread of the interrupt is another thread on the second execution unit, such that the checking thread resumes on the second execution unit.

In an embodiment, the secure thread scheduler may be operable such that the other thread of the interrupt is a thread running on a third one of the execution units other than the first and second execution units other than the critical thread and the other thread, such that the inspection thread is migrated to the third execution unit.

In an embodiment, the rescheduling time period may be: a) expiration of a secure time window of a predetermined length running from a point in time when execution of the critical thread begins, or b) expiration of the secure time window plus any time it has taken to execute the inspection thread, or c) a time between a) and b).

In an embodiment, the secure thread scheduler may be operable to initially schedule the inspection thread to begin execution at any time between the point in time when execution of the critical thread begins and the expiration of the secure time window before being interrupted by the other thread.

In an embodiment, the secure thread scheduler may be configured to perform an initial scheduling of the inspection thread by: i) If at least one of the execution units is detected as idle at a point in time when execution of the critical thread starts, selecting an idle one of the execution units as the second execution unit to start executing the inspection thread; but ii) if none of the execution units are detected as idle at the point in time when execution of the critical thread starts, waiting and detecting if one of the execution units becomes newly idle before the expiration of the secure time window, and if one of the execution units becomes newly idle, then immediately selecting the newly idle execution unit as the second execution unit to start executing the inspection thread; but iii) if none of the execution units becomes idle upon expiration of the secure time window, interrupting a non-critical thread executing on a non-idle one of the execution units and selecting the non-idle execution unit as the second execution unit to begin executing the inspection thread.

In an embodiment, the secure thread scheduler may be configured to: an emergency situation is raised if none of the execution units become idle before the secure time window expires, nor is it found that the execution unit is executing a non-critical thread.

In an embodiment, the secure thread scheduler may be configured to: the rescheduling of the inspection thread is performed by migrating the inspection thread to a third execution unit of the execution units other than the first execution unit and the second execution unit, thereby resuming execution of the inspection thread on the third execution unit.

In an embodiment, the other thread may be a thread other than the inspection thread.

In an embodiment, the other thread may be a critical thread. Alternatively, the other thread may be a non-critical thread.

In an embodiment, a processing system may be provided that includes the execution unit and the secure thread scheduler.

The processing system may further include a comparison circuit configured to compare a first result with a second result, the first result being a result of executing the critical thread on the first execution unit and the second result being a result of executing the inspection thread on the second execution unit; wherein the comparison circuit is configured to generate an error signal if the first result and the second result do not match in accordance with the comparison.

In an embodiment, the first result may include an indication of a memory access performed by the critical thread, and the second result may include an indication of a memory access performed by the inspection thread.

According to another aspect disclosed herein, there may be provided a method comprising: scheduling a check thread for a critical thread running on one of a plurality of execution units, the check thread being a copy of the critical thread, wherein scheduling the check thread includes scheduling the check thread to begin running on a second of the plurality of execution units before a secure time window for scheduling the check thread ends; and interrupting the inspection thread by another thread when the inspection thread is running on the second execution unit of the plurality of execution units, and rescheduling the inspection thread resumes when a rescheduling time limit expires.

According to further aspects disclosed herein, a corresponding method of operating a processing system, and a corresponding computer program configured to operate a processing system, may be provided. According to yet further aspects, a corresponding method of manufacturing a processing system, a corresponding manufacturing facility arranged to manufacture a processing system, and a corresponding circuit design dataset embodied on a computer readable storage device may be provided.

For example, according to one aspect, a non-transitory computer-readable storage medium may be provided having stored thereon a computer-readable description of a processing system of any embodiment herein, which when processed in an integrated circuit manufacturing system, causes the integrated circuit manufacturing system to: processing a computer readable description of a processing system using a layout processing system to generate a circuit layout description of an integrated circuit embodying the processing system; and using the integrated circuit generation system, manufacturing the processing system according to the circuit layout description.

According to another aspect, there may be provided an integrated circuit manufacturing system including: a non-transitory computer readable storage medium having stored thereon a computer readable description of a processing system of any of the embodiments disclosed herein; a layout processing system configured to process a computer readable description to generate a circuit layout description of an integrated circuit embodying the processing system; and an integrated circuit generation system configured to manufacture the processing system according to the circuit layout description.

According to another aspect, there may be provided a method of manufacturing a processing system of any of the embodiments disclosed herein using an integrated circuit manufacturing system, the method comprising: processing the computer readable description of the circuit using a layout processing system to generate a circuit layout description of an integrated circuit embodying the processing system; and using the integrated circuit generation system, manufacturing the processing system according to the circuit layout description.

According to another aspect, a layout processing system may be provided that is configured to determine location information of logic components of a circuit derived from an integrated circuit description in order to generate a circuit layout description of an integrated circuit embodying the processing system of any of the embodiments disclosed herein.

Other variations, implementations, and/or applications of the disclosed techniques may become apparent to those skilled in the art once the disclosure is presented herein. The scope of the present disclosure is not limited by the embodiments described above, but is only limited by the claims.

Claims

1. A processing system, the processing system comprising:

A plurality of parallel execution units, each parallel execution unit operable to execute a respective series of threads, wherein at least some of the threads executed by at least some of the execution units are non-critical threads that are not designated as critical;

A request buffer operable to receive a request indicating that one of the threads in a respective series executed by a first one of the execution units is designated as a critical thread;

A secure thread scheduling circuit arranged to read the request from the request buffer store and, in response, schedule an inspection thread that is a copy of the critical thread to execute on a second execution unit of the plurality of execution units other than the first execution unit;

A result buffer storage arranged to buffer one of: a first result, the first result being a result of executing the critical thread on the first execution unit; and a second result, the second result being a result of executing the inspection thread on the second execution unit; and

A comparison circuit arranged to compare the one of the first and second results from the result buffer with the other of the first and second results and to generate an error signal if the first and second results do not match in accordance with the comparison;

wherein the request includes an indication of a secure time window; and

Wherein the secure thread scheduling circuit is configured to detect when at least one of the execution units is idle and if none of the execution units is detected as idle at the expiration of the secure time window, to interrupt one of the non-critical threads executing on a non-idle one of the execution units and to select the non-idle execution unit as the second execution unit to execute the checking thread instead of the interrupted thread.

2. The processing system of claim 1, wherein the secure thread scheduling circuitry is configured to: if one of the execution units is detected as idle when the request is read from the request buffer store, then immediately selecting the idle one of the execution units as the second execution unit to begin executing the inspection thread; however, if none of the execution units is detected as idle upon reading the request, waiting and detecting if one of the execution units becomes newly idle before the expiration of the secure time window, and if one of the execution units becomes newly idle, then selecting the newly idle execution unit as the second execution unit to begin executing the inspection thread; but if none of the execution units becomes idle at the expiration of the secure time window, the interrupt to one of the non-critical threads executing on a non-idle one of the execution units is executed and the non-idle execution unit is selected as the second execution unit to begin executing the check thread instead of the interrupted non-idle thread.

3. The processing system of claim 2, wherein the secure thread scheduling circuitry is configured to: allowing the inspection thread to be interrupted if the inspection thread is executed urgently, and allowing the inspection thread to be executed urgently if the execution of the inspection thread begins before the secure time window expires; the interrupting of the inspection thread includes scheduling one or more additional critical or non-critical threads to execute on the second execution unit in place of the inspection thread.

4. The processing system of claim 3, wherein the secure thread scheduling circuitry is configured to: interrupting one of the one or more further threads by resuming the execution of the inspection thread on the second execution unit if the one or more further threads have not completed at the expiration of the rescheduling time limit, otherwise resuming the execution of the inspection thread after the one or more further threads have completed; the rescheduling time limit is:

a) The expiration of the secure time window is performed,

B) The expiration of the secure time window plus any time it has taken to execute the inspection thread, or

C) The time between a) and b).

5. The processing system of claim 3, wherein the secure thread scheduling circuitry is configured to: and when the urgently executed checking thread is interrupted by another thread, migrating the urgently executed checking thread to another executing unit except the first executing unit and the second executing unit in the executing units.

6. The processing system of any of claims 2 to 5, wherein the secure thread scheduling circuitry is configured to: an emergency situation is raised if none of the execution units become idle before the secure time window expires, nor is it found that the execution unit is executing a non-critical thread.

7. The processing system of any of claims 1 to 5, wherein the first result comprises an indication of a memory access performed by the critical thread and the second result comprises an indication of a memory access performed by the inspection thread.

8. The processing system of any of claims 1 to 5, wherein the first result and the second result are output and compared in compressed form.

9. The processing system of any of claims 1 to 5, the processing system configured to, in response to the error signal:

A) Outputting a warning through a user interface, or

B) Disabling the first execution unit and/or the second execution unit but continuing to execute threads on at least one remaining execution unit of the execution units, or

C) Stopping execution across the entire processing system, or

D) The critical thread and the inspection thread are executed again at least once on the same or different ones of the execution units and the comparison is repeated each time, and then one of a) -C) is executed if the repeated comparison still produces the error signal.

10. The processing system of any of claims 1 to 5, wherein:

The request buffer is operable to buffer a plurality of requests, each request indicating that a respective one of the threads executing on a respective first one of the execution units is to be classified as a critical thread, wherein each request includes an indication of a respective secure time window;

the thread scheduling circuitry is configured to schedule execution of a respective inspection thread on a respective second one of the execution units other than the respective first execution unit, the respective inspection thread being a copy of a respective critical thread;

The result buffer storage is arranged to buffer at least a respective first result of each respective critical thread or a respective second result of each respective inspection thread;

The comparison circuit is configured to compare each second result with a respective first result and to generate an error signal if the respective second result does not match the respective first result; and

The secure thread scheduling circuitry is configured to schedule the inspection threads in order of priority, wherein the priority is determined according to a degree to which the respective secure time window is near expired.

11. The processing system of claim 10, wherein the result buffer storage includes a respective memory access queue for each of the plurality of execution units, and each memory access queue includes a plurality of FIFOs for buffering results of different threads executing on the same execution unit.

12. The processing system of any of claims 1 to 5, wherein the processing system is embodied in hardware on an integrated circuit.

13. A method of manufacturing the processing system of any of claims 1 to 5 using an integrated circuit manufacturing system.

14. An integrated circuit definition data set that, when processed in an integrated circuit manufacturing system, configures the integrated circuit manufacturing system to manufacture the processing system of any of claims 1 to 5.

15. A computer readable storage medium having stored thereon a computer readable description of a processing system according to any of claims 1 to 5, which when processed in an integrated circuit manufacturing system causes the integrated circuit manufacturing system to manufacture an integrated circuit embodying the processing system.

16. An integrated circuit fabrication system configured to fabricate the processing system of any of claims 1 to 5.

17. A method, the method comprising:

Scheduling a respective thread series to execute on each of a plurality of parallel execution units, wherein at least some of the threads executed by at least some of the execution units are non-critical threads that are not designated as critical;

Receiving a request indicating that one of the threads in the respective series executed by a first one of the execution units is designated as a critical thread;

Scheduling, in response to the request, a check thread to execute on a second execution unit of the plurality of execution units other than the first execution unit, the check thread being a copy of the critical thread;

Buffering one of the following: a first result, the first result being a result of executing the critical thread on the first execution unit; and a second result, the second result being a result of executing the inspection thread on the second execution unit;

Comparing a first result with a second result, the first result being a result of executing the critical thread on the first execution unit and the second result being a result of executing the inspection thread on the second execution unit; and

Detecting whether the first result and the second result match based on the comparison;

wherein the request includes an indication of a secure time window; and

Wherein the method further comprises: detecting when at least one of the execution units is idle and interrupting one of the non-critical threads executing on non-idle ones of the execution units when none of the execution units is detected to be idle upon expiration of the secure time window, and selecting the non-idle execution unit as the second execution unit to execute the inspection thread in place of the interrupted thread.

18. A graphics processing system configured to perform the method of claim 17.

19. A computer readable storage medium having computer readable code encoded thereon, the computer readable code configured to cause the method of claim 17 to be performed when the code is run.