WO2022218510A1

WO2022218510A1 - Method for scheduling software tasks on at least one heterogeneous processing system using a backend computer system, wherein each processing system is situated in a respective vehicle of a vehicle fleet, and control framework for at least one vehicle

Info

Publication number: WO2022218510A1
Application number: PCT/EP2021/059568
Authority: WO
Inventors: Malek Naffati
Original assignee: Cariad Se
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2022-10-20
Also published as: EP4298513A1

Abstract

The invention is concerned with a method for scheduling software tasks (21) on at least one processing system (14) of a given system type, wherein each processing system (14) is situated in a respective vehicle (11) and the tasks (21) are executed in different driving situations (33). The invention comprises that a scheduling plan (25) is provided to the at least one processing system (14) for executing a respective instance of the tasks (21) in the respective processing system (14) and from the at least one processing system (14) respective analysis data (30) are received and the scheduler (23) generates an update of the scheduling plan (25) by applying at least one mitigation measure, wherein the respective mitigation measure is applied if at least one performance attribute (32) is improved according to a predefined optimization criterion as compared to the received analysis data (30).

Description

Method for scheduling software tasks on at least one heterogeneous processing system using a backend computer system, wherein each processing system is situated in a respective vehicle of a vehicle fleet, and control framework for at least one vehicle

DESCRIPTION:

The invention is concerned with method for scheduling software tasks on at least one processing system. Each processing system is situated in a respective vehicle and the tasks are executed in different driving situations, such that their behavior with regard to at least one predefined performance attribute may vary. Each processing system executes its own instances of the tasks. The invention is also concerned with control framework for at least one vehicle for executing the tasks according to a scheduling plan.

In a vehicle, an electronic control unit or a network of several electronic control units may constitute a so-called heterogenous processing system of the vehicle, i.e. a processing system that comprises several different processing units of different technology. For example, a processing system may comprise at least one CPU (central processing unit) with one or more processing kernels and/or at least one GPU (graphical processing unit) and/or one or more TPUs (tensor processing units) and/or DLAs (deep learning accelerators) and/or at least one DMA controller (DMA - direct memory access) and/or a NIC (network interface controller) and/or a RAM memory controller (RAM - random access memory), just to name examples. The processing units may run autonomously or asynchronously, that is the processing units may perform or execute their respective task in parallel or at the same time. However, some or all of the tasks may belong to one specific application program of the processing system, i.e. these task in combination fulfill or perform a specific function as defined by the application program.

To this end, the tasks interact in that the output data of one task constitute the input data of another task. In other words, the tasks are concatenated and their execution needs to be coordinated such that the described concatenation of the tasks is obtained in that one task produces its output data fast enough to supply the next task in the concatenation with input data. The other way round, a task should not produce its output data faster than they can be taken in by the next task. The order in which the tasks are triggered or executed on the respective processing units of a processing system can be coordinated or defined by a scheduling plan that associates the tasks to the available processing units and defines the condition for triggering the tasks or executing the tasks (i.e. triggering a task whenever new input data is available in a buffer).

An additional degree of complexity arises when a data stream is processed by such a processing system as such a data stream provides more stream data than the processing system can process by executing the tasks only once. For example, the data stream may provide a continuous flow of stream data, as can be the case for processing camera images of a video camera for surveilling or observing the environment of the vehicle, or processing sensor data for controlling an engine of a vehicle, both of which are controlling application programs that may last or continue throughout a whole driving phase of a vehicle.

In this case, a processing system may process the data stream in chunks or frames, that is a subset of the stream data is received and processed and forwarded at a time wherein the throughput or processing speed or data rate of this processing of chunks or frames must meet real-time conditions or online-conditions, that is the rate of processing the stream data must be the same or larger than the data rate of the data stream itself in order to prevent an overrun. For the described frame-wise processing of stream data, the tasks are executed repeatedly in a cyclic execution (e.g. one cycle per frame) which introduces the additional complexity that a task that produces output data for consumption by the next task as input data, should not override its previous output data of the previous cycle, if the next task has not yet read in or accepted the previous output data.

An additional problem arises when the processing system is operated in different or changing driving situations, as this may result in different processing speeds of the tasks. For example, if the vehicle is operated in a driving system where the air temperature is high (for example, above 40° C), the clock of a processing unit may be reduced or lowered, in order to prevent overheating of that processing unit. The corresponding task executed by that processing unit therefore processes data at a lower data rate as might be considered or planed in the scheduling plan. Document US 4,972,314 A describes a data flow signal processor that is optimized for processing a data stream. This processor coordinates tasks on the basis of a graph scheduling process that assigns the tasks to available processing units. However, the processor relies on a specific architecture such that adapting the processing system by, for example, equipping it with more GPUs, demands a complete re-design of the processor and the graph scheduling process.

Document WO 2009 / 029549 A2 describes a method for performance management of a computer system. The method assigns a specific time slot to individual processing tasks executed by a processing unit. At the end of each time slot, the next task is assigned to that processing unit for execution. In other words, tasks may be interrupted while processing stream data, if they take longer than originally planned. This may cause have an unforeseen side effect on the coordination of the tasks.

Document WO 2014 / 207759 A2 describes a scheduler that may schedule the same task on one out of several different processing units depending on a respective constrain of a processing unit, for example, a constrain regarding band width, energy, computation capability. The method is applied to a network of computing units that may comprise smartphones and other so- called edge devices. Instead of processing a data stream, this method is used for processing data that may be processed independently on one of the edge device independently. The processed data may then be re-collected by a master node for combining them resulting in an overall processing result.

When developing a software application program of a heterogeneous processing system with a reliability requirement, to process large, continuous data streams of stream data, like video data or sensor data, it is important to determine, what the limits / confines are in worst-case driving situations, if the application programs must deliver a result at a known schedule or with a predefined time limit (e.g. real-time conditions).

It is an objective of the present invention to provide a scheduling of tasks in a processing system of a vehicle such that at least one predefined performance attribute does not deteriorate even when the driving situation changes.

The objective is accomplished by the subject matter of the independent claims. Advantageous developments with convenient and non-trivial further embodiments of the invention are specified in the following description, the dependent claims and the figures.

As a solution the invention provides a method for scheduling software tasks on at least one processing system of a given common system type, wherein each processing system is situated in a respective vehicle and the tasks are executed in different or varying driving situations. Within each processing system, the tasks may be executed on a respective processing unit. Each processing system may have several processing units of different technology or type. Such a processing unit can be, for example, a CPU kernel, an hardware encoder, a hardware decoder, a memory storage controller, GPU-x, DLA-x, NIC (network interface controller) or one of the already described exemplary processing units, just to name examples. In general, a processing unit is a hardware module or an integrated circuit that may perform or execute the task asynchronously or independently from the other processing units. It is noted though that the tasks themselves may depend on each other in that the tasks may be concatenated in that one task may rely on the output data of a preceding task as input data for itself. A task can be, for example, a threat or a job for a DMA transfer or a matrix operation or an SIMD-instruction (single instruction multiple data), just to name examples. Data transfer (DMA / Interconnects) and also setup/context-switches of said processing units add further costs on top of executing task. It can also be considered part of the task execution time. It's not so much part of the task but rather something the scheduler must also consider for the sake of deterministic execution on such a processing unit. Execution time of an individual task would vary more when setup/teardown are considered to be part of it, because that depends on what other tasks are running simultaneously. That is something the scheduler might want to measure independently. The method considers one single processing system or several processing systems that each are of the same common system type. The system type may be defined as the specific combination of processing units built into that processing system. In other words, the respective processing system is preferably a heterogenous system as has already been described. The tasks may belong to one software application program or several software application programs. An example for such a software application program can be, for example, an object recognition based on an image data stream of a camera or an engine control for an combustion engine or an electric engine based on sensor data of at least one sensor of the engine.

For coordinating the execution of the tasks in the respective processing system, a scheduler determines a concatenation of the tasks on the basis of task data, wherein

- the task data at least describe respective input data needed by some or all of the tasks for being executed and respective output data produced by some or all of the tasks when executed, wherein

- the tasks are concatenated in that the output data of at least some of the tasks correspond to the input data of at least one respective other task. The task data may be provided with each task or each software application program. The task data may be provided by a developer of the respective software application program. For example, the task data may be part of the source code and/or the binary code of the software application program. For each task, the task data may provide a data type and/or a buffer size of the input data and/or output data as they may be processed by the tasks in one execution cycle.

The scheduler determines the scheduling plan that allocates the tasks to processing units of the respective system for executing the tasks, wherein the allocation corresponds to the concatenation of the tasks. The first version or an initial version of the scheduling plan can be derived on the basis of a method according to the prior art. Flowever, in order to ensure that the scheduling plan runs the tasks such that they provide their output data reliably in different or changing driving situations, the scheduling plan is provided to the at least one processing system. Each processing system (i.e. each vehicle) then executes a respective instance of the tasks according to the scheduling plan. From the at least one processing system respective analysis data are received. These analysis data describe a respective performance value of at least one predefined performance attribute (e.g. execution time) for some or each of the tasks and/or for the processing system. It is noted here that in the case of several vehicles, an according number of processing system exists such that each task exists in each processing system, i.e. several “instances” of the task may exist. In other words, an instance of the task is the specific task running on a specific processing system. After implementing the scheduling plan in the respective processing system and running or executing the tasks according to the scheduling plan, a respective performance value for at least one performance attribute is measured or determined for the respective tasks. Such a performance attribute can be, for example, an execution time or duration for executing the task and/or an energy consumption. In other words, the analysis data describe how much of the respective performance attribute is required by the respective task, for example, a performance attribute of time duration or power consumption or access collisions when two tasks try to access the same resource at the same time.

The scheduler then generates an update of the scheduling plan by applying at least one mitigation measure that adapts the allocation of the tasks to the processing units in dependence on the analysis data, wherein the respective mitigation measure is applied if the at least one performance attribute is improved according to a predefined optimization criterion as compared to the received analysis data. Therefore, on the basis of the analysis data, it is observed or monitored if or how well the scheduling plan fits a performance requirement or a respective threshold for the performance value of the at least one predefined performance attribute for the one driving situation or the several driving situations that have been experienced by the respective processing system while the scheduling plan was in use. If a possibility for improving the at least one performance attribute is given, for example, reducing the execution time duration and/or power consumption and/or number of access collisions, the respective mitigation measure is applied to the scheduling data of the scheduling plan such that when the scheduling plan is updated and used in the respective processing system, for a future execution of the tasks in the respective processing system, an improvement with regard to the at least one predefined performance attribute is achieved.

The method provides the advantage that the scheduling plan is adapted or improved while the at least one vehicle experiences different driving situations (e.g. driving in different air temperatures) such that a scheduling plan is iteratively developed or derived that will cope with an increased number of driving situations. A driving situation may be defined by a respective interval for values of one or more driving parameters, for example, driving speed, engine speed, air temperature, number of traffic participants around the vehicle, just to name examples.

The invention also comprises embodiments that provide features which afford additional technical advantages. One embodiment comprises that, as one mitigation measure, the scheduler links at least two tasks through a queue buffer and/or a double-buffer for providing the output data of one of these tasks as input data to the other one of these tasks, if an access conflict (so-called hazard) is detected. In other words, by this specific mitigation measure it is dynamically determined if and where a buffer, for example, a so-called FIFO (first in first out), is needed. Accordingly, a respective buffer is inserted to decouple the execution time of the concatenated tasks. The need of a buffer is detected on the basis of the detection or observation of access conflicts, which are also termed as hazards. Introducing a buffer provides the advantage that a first task can be already re- executed or repeated for generating new output data while the next, second task is still reading or processing the previous input data. There is no need to wait for the first task until the second task has finished reading or processing the previous input data. Thus, a scheduling plan results that is optimized or minimized in terms of waiting time for the first task.

One embodiment comprises that the at least one processing attribute comprises a power consumption and/or temperature and as one mitigation measure the scheduler transfers the execution of at least one of the tasks from a first processing unit to another second processing unit for reducing a load on the first processing unit and/or for switch off the first processing unit. If two processing units are available that provide the same functionality (although they may have different performance power), it can be of advantage to transfer a task from one processing unit to another one, if this allows to reduce energy consumption and/or dissipation of heat by deactivating a first processing unit. Thus, a scheduling plan results that is optimized or minimized in terms of power consumption and/or temperature generation or heat generation.

One embodiment comprises that the at least one processing attribute comprises a memory bandwidth for a data throughput (duration of data transfer) between at least two tasks and as one mitigation measure the scheduler introduces memory remapping for avoiding copy operations of data between these tasks. Instead of copying the data from one memory section to another memory section, a task may read input data via a memory pointer that may be re-assigned to the respective new input data whenever they are available. This saves the necessity of copying these data into a specific input data buffer. Instead, the pointer is set to the respective data that shall be used as new input data.

The respective processing attribute may be monitored or measured using at least one hardware component of the processing system, for example, a temperature sensor and/or a timing clock and/or a register for counting clock ticks of a clock of the processing system and/or an interrupt counter (for example, for detecting hazards).

For deriving a scheduling plan according to a method of the prior art, it is generally assumed that a task will need or require a specific performance value of the at least one performance attribute, e.g. an average execution time. In general, such assumptions are made based on the experience of the developer of the scheduling algorithm. This allows to estimate run-time confines for the tasks. Accordingly, one embodiment comprises that confines for the analysis data are estimated on the basis of a predefined system model for the system type that is used in the at least one vehicle. On the basis of the received analysis data at least one task is identified that is running outside the confines. In order to identify the reason, input data exchanged between the tasks and/or situation data describing the current driving situation (e.g. air temperature and /or engine speed) is recorded. If a task runs outside its confines, the input data that caused this behavior of the task are provided in at least one mitigation measure. If in a specific processing system a task results in an outlier regarding the range of values for the respective performance value, the input data and/or information about the driving situation that caused this outlier, are available and may be considered in the update of the scheduling plan. Thus, experience or information gained from different driving situations can be considered in the scheduling plan.

One embodiment comprises that the scheduler is a master scheduler in a backend computing system. Alternatively, a distributed scheduler comprising the master scheduler and a respective local scheduler in each of the processing systems is provided. Several processing systems in different vehicles may be provided with the same scheduling plan and corresponding analysis data are received from several vehicles, such that the update of the scheduling plan considers driving situations that not all of the vehicles have experienced. This provides the advantage that an update of the scheduling plan may also be based on experience or information, i.e. a specific driving situation that a specific vehicle may not have experienced as the corresponding analysis data were received from another vehicle. The scheduling plan will thus converge faster to a scheduling plan considering different driving situations, as several vehicles may be operated in different driving situations at the same time.

One embodiment comprises that the optimization criterion comprises that in comparison to the analysis data a number of outliers and/or an average of the performance values of at least one performance attribute are reduced. For example, if the analysis data describe that the execution duration of a task is twice as long as expected due to a reduced clock frequency of an over-heated processing unit, by applying the mitigation measure of transferring the task to a cooler processing unit, it can be expected that with this new or updated scheduling plan, the task will not be executed on the hot processing unit the next time such that the outlier is prevented. The respective effect or outcome of each mitigation measure can be estimated on the basis of a model of a processing system of the known given system type. Such models are available in the prior art.

One embodiment comprises that a process model of the executed tasks is generated that describes or is adapted to the analysis data. In other words, the operation of the processing system is described as a model. The model may provide a statistical description (e.g. mean value and variance) for the respective performance value of at least one performance attribute (e.g. performance duration and/or power consumption and/or number of hazards) for the respective task. On the basis of such a model, a software developer may be supported in developing an update of the at least one software application program. Additionally or alternatively, a simulator of the system type is configured using the analysis data and a simulation of the execution of the tasks is performed and estimates for the analysis data are generated for at least one predefined situation data set that describes a respective driving situation. Once the simulator is configured on the basis of analysis data, for example, as a process model, the simulator can simulate a driving situation that, for example, has not been experienced by one of the real processing systems or the real processing system (if only one is used). This allows to optimize or adapt or update the scheduling plan even for such driving situations that the respective vehicle has not experienced so far. This prevents difficulties as the scheduling plan can even be prepared for such a driving situation before the vehicle actually experiences this specific driving situation.

One embodiment comprises that the scheduling plan is generated as an acyclic graph from the task data. Such an acyclic graph has proven as a very reliable means for defining the scheduling plan. In the graph, the edges of the graph can comprise weights that are set according to the performance values as they are described by the analysis data.

One embodiment comprises that the tasks are executed asynchronously on the respective processing unit. This allows to handle a heterogeneous processing system on the basis of the described method. The execution is thus triggered to execute asynchronously (i.e. independently from a central CPU) but the scheduler may receive feedback from the respective processing unit and/or task, when it was completed, for example in order to tell if the time constraints where met, or to tell that the processing unit / task did not crash (thus the scheduler might also function as a watchdog or provide information to a central watchdog, interestingly). So a function to query the task "are you still working" is beneficial, too.

One embodiment comprises that the at least one performance attribute comprises: latency, throughput, power consumption, temperature, access collisions. Monitoring one or more or all of these performance attributes has proven to result in a reliable or robust scheduling plan that yields a robust or real-time performance ability in a processing system of a vehicle in varying or different driving situations. One embodiment comprises that different operating modes, in particular a low power mode and/or a high performance mode, are pre-defined for the given system type as respective processing constraints and at least one version of the scheduling plan is evaluated in regard of its suitability for the different processing constraints on the basis of the analysis data and if a specific version of the scheduling plan fulfills a predefined suitability criterion (with respect to the processing constraint), plan data describing that version of the scheduling plan are stored and later used as scheduling plan, when the corresponding operating mode is activated in a processing system. Thus, although a current version of a scheduling plan might not meet the requirements of a current driving situation (for example, the performance of the tasks might be too slow), the scheduling plan might have the advantage that another goal or condition is met, for example, the power consumption might be lower than for other versions of the scheduling plan. If the plan data of the scheduling plan are saved, the scheduling plan can be reused or applied in a different driving situation, when a specific processing constrain, for example, a low power mode, is needed. Thus, different scheduling plans can be derived on the basis of observing the tasks in different driving situations and collecting the corresponding analysis data. If one of the operating modes is demanded, for example, by a corresponding mode signal, the corresponding scheduling plan can be implemented or activated in the respective processing system.

One embodiment comprises that a cyclic execution of the tasks is performed and for some or each cycle or loop of the cyclic execution, individual analysis data are provided and the scheduling plan is iteratively adapted to a respective current driving situation by performing the update after each cycle or after each predefined number of cycles. In other words, the scheduling plan is adapted or updated while the data stream continues streaming into the processing system. The scheduling plan is therefore adapted to the data stream and therefore dynamically adapts to the current driving situation. The cyclic updating of the scheduling plan may be continued throughout the whole driving situation and/or the whole data stream or it may be stopped once a predefined convergence criterion is fulfilled for the scheduling plan, for example, a change that is achieved in the optimization of the at least one performance attribute (for example, a reduction of task execution time is lower than a specific percentage value, for example, less than 10% change or less than 5% change of execution time reduction is achieved). The iterative updating can then be interrupted.

One embodiment comprises that the tasks are designed for processing stream data of a data stream (i.e. a flow of data, especially video data, measurement data) and the tasks process the data stream in real-time, that is at the speed that new stream data arrive and processed stream data are passed on, wherein the processing is done in each cycle on a subset of the overall stream data at a time. Exemplary tasks for processing a data stream are an ingest task, a processing task, and an output task.

The invention also provides a control framework for at least one vehicle, wherein for each vehicle a processing system comprising processing units is provided and the control framework comprises at least one processor circuitry that is designed to provide a local scheduler in the respective processing system of the vehicles or a master scheduler in a stationary backend computing system or a distributed scheduler comprising the respective local scheduler and the master scheduler, wherein the at least one processor circuitry is designed to perform an embodiment of the inventive method. The control framework can be designed for one single vehicle, that is all the components are included in the vehicle, for example, in the processor circuitry of an electronic control unit and/or a central computing unit (head unit) of the vehicle. Alternatively, the vehicle may be connected to a stationary backend computing system, for example, a cloud computer or a computer server that may be operated in the internet. A link between the vehicle and the backend computer system may be provided on the basis of an internet connection and/or a wireless connection based on, for example, WiFi technology or mobile communication technology, e.g. 5G. The control framework may also comprise several vehicles, that may be linked to the stationary backend computing system, such that the vehicles may be operated on the basis of a common scheduling plan and the analysis data of each of the vehicle may be gathered or combined by the master scheduler in the backend computing system to update the scheduling plan for all of the vehicles.

The respective vehicle is preferably designed as a motor vehicle, in particular as a passenger vehicle or a truck, or as a bus or a motorcycle.

The invention also comprises the combinations of the features of the different embodiments.

In the following an exemplary implementation of the invention is described. The figures show:

Fig. 1 a schematic illustration of an embodiment of the inventive control framework with a respective processing system in at least one vehicle and a backend computing system;

Fig. 2 a schematic illustration of the control framework after an update of a scheduling plan based on a mitigation measure; and

Fig. 3 a sketch for illustrating the effect of the mitigation measure.

The embodiment explained in the following is a preferred embodiment of the invention. Flowever, in the embodiment, the described components of the embodiment each represent individual features of the invention which are to be considered independently of each other and which each develop the invention also independently of each other and thereby are also to be regarded as a component of the invention in individual manner or in another than the shown combination. Furthermore, the described embodiment can also be supplemented by further features of the invention already described. In the figures identical reference signs indicate elements that provide the same function.

Fig. 1 shows a control framework 10 that may comprise at least one vehicle 11 and a stationary backend computing system 12. The backend computing system 12 can be an internet server and/or a cloud server operated in the internet 13. The vehicle 11 can be, for example, a motor vehicle, e.g. a passenger vehicle or a truck. In the vehicle 11, a processing system 14 may be provided for processing a data stream 15 that may be provided in chunks or frames 16 to the processing system 14 and for each frame 16, a predefined application program 17 (Prog) is applied to the stream data 18 of the frames 16 such that an output data stream 19 is generated. For example, the data stream may be comprising raw image data (.raw) and from the raw image data an encoded image data stream (.avi) may be generated as output data stream 19. The stream data 18 may be stored in a storage 20, for example, based on the technology NVMe (non-volatile memory express). For performing the application program 17, it may be run by the processing system 14. The application program 17 may define or demand several tasks 21 that make up or result in the overall functionality of the application program 17. The tasks 21 may be performed by processing resources or processing units 22 that are available in the processing system 14. Fig. 1 shows as an example a CPU with processing units 22 in the form of a kernels C, an encryption module Cryp and a video encoder Vid, which are exemplary processing units 22. At least one processing unit 22 may be provided in the form of a CPU and/or a GPU. Shared memory SFIM and dedicated memory DEM for the GPU may also be processing units 22 or resources of the processing system 14. A direct memory access controller for transferring data from the shared memory SFIM to the dedicated memory DEM and in the opposite direction may also be a processing unit 22. A memory controller for transferring data between the storage 20 and the shared memory SFIM may also be a processing unit. The processing unit may perform a respective task 21 once they have been initiated or programmed based on software data from the application program 17. For processing the incoming stream data 18 in the correct order and/or one after another, a specific order or concatenation of the tasks 21 needs to be considered. To this end, a local scheduler 23 of the processing system 14 may gather or receive task data 24 regarding the application program 17 and may set up a scheduling plan 25 indicating, how the tasks 21 shall interact and/or when the tasks 21 shall be triggered, in other words, which task 21 shall be associated or assigned to which processing unit 22. The tasks may be linked or interconnected by respective buffers 26 for using output data 27 of one task 21 as corresponding input data 28 of the next of following task.

As several chunks or frames 16 need to be processed, the tasks 21 are executed in a cyclic execution, i.e. they are repeated for each chunk or frame 16. For each or some of the cycles, respective analysis data 30 may be measured or observed or provided by the backend computing system 12. If several vehicles 11 are provided, each of the vehicle may send analysis data 30 to the backend computing system 12. The scheduling plan 25 may additionally or alternatively to the local scheduler 23 be derived by a master scheduler 31 based on task data 24. The scheduling plan 25 may then be provided to each of the vehicles 11 . In the local scheduler 23 and/or the master scheduler 31 , the scheduling plan 25 may be updated, if from the analysis data 30 it becomes apparent that at least one performance attribute 32 results in a performance value for at least one task 21 that indicates a non-real-time performance of the application software 17 in at least one driving situation 33. In this case, a mitigation measure MM may be applied to the scheduling plan 25 and the updated scheduling plan may be performed or used in the processing system 14 instead of the previous scheduling plan 25.

Fig. 2 and Fig. 3 illustrate possible mitigation measures. The following comparison shows the scheduling plan 25 in its original version as shown in Fig. 1 and the updated scheduling plan.

Initial scheduling plan 25 (Fig. 1 ):

#1 Copy RAW image to shared memory buffer; performance values: Uses DDR interface, DDR bandwidth=x, duration=y #2 Copy RAW image buffer to dedicated memory buffer; performance values: Uses shared and dedicated memory DDR interfaces and PCIe, DDR/PCIe bandwidth=x, duration=y

#3 Execute crop task on crop task buffer; performance values: Uses dedicated DDR interface, dedicated DDR bandwidth=x, duration=y

#4 Copy output buffer to encode task buffer; performance values: Uses shared and dedicated DDR interfaces and PCIe, DDR/PCIe bandwidth=x, duration=y

#5 Execute encode task on encode task buffer; performance values: Uses video encoder unit and shared DDR interfaces, DDR bandwidth=x, duration=y

#6 Copy output buffer to NVMe; performance values: Uses shared DDR interfaces and PCIe, DDR/PCIe bandwidth=x, duration=y

The exemplary performance attributes bandwidth and duration may be measured, resulting is task-specific performance values x, y that may be provided as part of analysis data.

The analysis data of consecutive execution of the same task of a program indicates that certain resources can and should be duplicated in order to allow for hazard free pipelineing. In contrast to execution units, buffers can be inserted at very little cost where they are needed. Remapping memory is a very inexpensive operation. Applying this technique as mitigation measure MM, it will allow for an optimal, parallel utilization of all needed execution units.

After applying the respective mitigation measure MM yields an updated scheduling plan (Fig. 2):

#1 Copy RAW image to shared memory buffer; performance values: Uses DDR interface, DDR bandwidth=x, duration=y

MM: Remap input RAW image buffer to shared memory buffer #2 Copy RAW image buffer to dedicated memory buffer; performance values: Uses shared and dedicated memory DDR interfaces and PCIe, DDR/PCIe bandwidth=x, duration=y

MM: Remap input buffer to crop task buffer

MM: Remap crop task buffer to output buffer

MM: Remap input buffer to encode task buffer

MM: Remap encode task buffer to output task buffer

Fig. 3 illustrates that the mitigation measure may avoid access collisions or hazards 35 that may be indicated by the analysis data 30 for the buffers 26. On the left hand side, the initial scheduling plan 25 is applied, and after an update 36 the updated scheduling plan 25’ with introduced mitigation measures MM is shown. As a mitigation measure, an additional remap task for introducing, for example, a double buffer logic can be applied such that the output data 27 may already be written to a buffer 27 while the next task 21 reads in the input data 28 from the previous cycle. Central purposing units (CPU processors) and additional processing unit may handle e.g. 3d graphics and video (multimedia in general) customer applications and at the same time the increased demand for mobile computing (laptops and smartphones) with limited battery capacity. It becomes apparent to hardware and software developers that the addition of more central processing units (multiprocessor or multicore) with media extensions (MMX, SSE, AVX, ...) to computers can still not compute such specialized tasks reasonably. Complex tradeoffs between performance, power consumption / heat dissipation and memory bandwidth/latency have proven graphics processing units (GPUs) and dedicated units for video encoding and decoding to be vastly superior to what can be achieved with CPUs for these highly specific tasks. Tensor Processing Units/TPUs (or Deep Learning Accelerators/DLAs) are used increasingly after having proven to be more efficient than GPUs for the same task.

At the same time, bandwidth needed to interconnect data storage, network, CPU and GPU components within but also outside of computers - with other computers or consumer devices (being computers themselves) - have increased dramatically. High speed peripherals manage incoming and outgoing data on common busses such as PCIe, USB, Ethernet, etc. found on modern processors and SoCs (System on Chip). On SoCs, peripherals, central processor cores and special purpose units often share needed working memory through a single memory controller to which the physical memory is attached.

Application developers manage the input, processing and output of their applications, running in the so-called user-space of an operating system on the CPU; an environment with less privileges than the operating systems kernel and drivers.

Data that is written to, or read from input/output peripherals or special purpose processors is copied from one devices address space to the other - even when using the same physical memory. This may or may not take place efficiently depending on the operating systems subsystems for the management of the peripherals that may or may not make use of hardware accelerated transfers. Hence, performance is heavily reliant on the system configuration at hand.

The latency, throughput and power consumption of memory transfers between devices within a computers (and even on a processors die) are the biggest challenge for overall system performance (overshadowing the compute speed within processors). This has led to increasing caches within processors and means to accelerate (and standardize) interconnects, even by adding compression.

The devoid of standardized interconnects and subsystems for peripheral memory management (drivers) may come at the price of very slow memory transfers between devices through the user-space / application on the CPU.

The need deliver stream data reliably within given time limits applies to the utilization of processing units, memory bandwidth and even power consumption. Neither central (general) nor specialized processing units are inherently deterministic and the duration (and power consumption) needed to process input data may vary strongly.

If programs are additionally dynamically added and removed from such a system, developers must be aware of the conditions under which their programs will operate and deliver results.

Typically, a scheduler is needed to manage resources, resource usage and mitigate problems but will allow for a simultaneous execution of multiple programs. The execution of a programs individual tasks in such a heterogeneous system will be managed by the scheduler and take place asynchronously, i.e. will not need the schedulers attention while executing.

In order to schedule programs that ingest, process and output streams of data in a heterogeneous asynchronous system, it is necessary to - Isolate all configuration (context switch) tasks, data transfer tasks, ingest tasks, processing tasks, output tasks applications may need to use at the smallest granularity needed and/or possible, that is inseparable. - Provide an interface to concatenate tasks to a program / describe programs as a concatenation of tasks and allocate/manage shared resources (such as shared memory) between tasks.

- Schedule the tasks of all programs to achieve an optimal utilization of processing units and memory bandwidth while considering the execution time of every task. This pipelining of programs must consider dependencies of tasks to one another and hazards/conflicts as simultaneous access on the same resource when requested by multiple programs tasks at once.

- Dynamically determine if and where buffers (FIFOs) are needed to overcome variable execution durations of a task - such as the encoding or compression of a picture may vary depending of the entropy of the picture data.

- Monitor the execution duration of individual tasks while adapting their average and maximum execution duration / statistical data needed for improving the estimation of overall program execution duration.

The statistics of the execution of tasks, programs and the schedule/pipelines derived and intermediate buffers needed, shall be sent to a backend. Data exchanged (input and output) between tasks can be recorded to find edge cases if the processing of a task took exceptionally long, i.e. longer than expected.

In the backend, statistics for the specific system configuration are aggregated from multiple installations and used to provide a highly representative simulation of the hardware’s capabilities to developers. This is possible if the execution duration of every task is well known, hence also making it possible to accurately identify potential for performance improvements through: - Rewriting processing tasks to execute parallelized or on a specialized execution unit (GPU, TPU, ...) instead of the CPU

- Handling edge cases with appropriate/efficient solutions or workarounds - Utilizing parallelization

- Avoiding context switching and memory transfers

- Accelerating memory transfers through shared memory / remapping memory between execution units

The results of optimizations can be verified precisely by deploying new software to a number of installations in the field.

Often times, performance is not the primary target for optimization. Low power usage may be equally or even more important than performance, when the compute power is sufficient for the given programs being executed.

Statistics on the execution of tasks could encompass power usage. With multiple implementations of the same tasks, it may be reasonable consider this when scheduling tasks and allow for a flexible configuration between optimal balance, power efficiency or highly performance. Disabling peripherals / processors completely in this case is also an option.

By building programs from a predefined set of independent and interdependent tasks, a schedule can be derived in which multiple programs can schedule and execute tasks in parallel asynchronously in a heterogeneous system whilst providing a high reliability in regards to execution time and power consumption (if needed). The precision is further increased by aggregating this data in a backend from multiple installations of the same system. System- and application developers can utilize this information to further tweak individual tasks or programs.

Resources (memory) shared between tasks can be managed, i.e. be decoupled by duplication automatically if necessary to allow for an optimal utilization of execution units and memory bandwidth on shared resources. A preferred implementation of the control framework comprises the following components:

- A software running on multiple/many same heterogeneous systems and a backend - A scheduler as a part of this software that provides a number of tasks to application/program developers in an heterogeneous system (multiprocessor, multicore, specialized processing units such as GPUs, TPU/DLAs, video encoders/decoders, encryption/decryption accelerators, compression/decompression accelerators) to process data and also exchange data through shared memory managed by the scheduler.

- Tasks are managed by the scheduler and run asynchronously in said heterogeneous system.

- Applications are managed by the scheduler or by the operating system.

- The scheduler may be embedded in the operating system an independent user-space application (daemon) and may/may not be embedded into a framework providing further functionality to application developers. - Furthermore, said scheduler provides means to measure the execution duration of tasks and the duration to exchange data between execution tasks and also measure attributes such as the power consumption / scaling of processors or processing units.

- Furthermore the scheduler may insert buffers automatically to decouple the execution time of concatenated tasks, when access conflicts are detected (hazards)

- Furthermore, said scheduler sends the measured attributes to a backend.

- Within the backend and/or in the system, the attributes can be used to derive a schedule when all programs (concatenation of tasks) are known, or programs are added. The derived schedule will provide the estimated execution time and power consumption of a number of programs/tasks running distributed in the heterogeneous system (in parallel and asynchronously) but managed through the scheduler.

- Within the system, tasks running outside of the confines of the estimated schedule (execution duration, power consumption) can be analyzed in the backend by transmitting the tasks input data and the generated schedule to a backend. The estimations aggregated in the backend can then be adapted or the task can be optimized to better handle certain conditions / exceptions.

- Within the backend, aggregated attribute-data is used to provide an accurate simulation of the execution within the system.

As previously described individual application will attempt to run multiple consecutive tasks through a common scheduler. Tasks have dependencies to each other and to resources needed to execute.

- In the very first step, an application will allocate an instance of each task through the framework, with the desired configuration it needs. Needed inputs for the execution of a task are also stated.

- For every application, the scheduler maintains a list of consecutive tasks, i.e. the order of execution is determined by the precondition of input data. Any data that is the output of another task must be available before the task can be executed in every cycle.

Hereby an order of execution of tasks is described as a chain, or even an directed acyclic graph. - For every application, the scheduler will traverse the enqueued graph of tasks.

- Tasks carry information about the needed/reserved hardware resourced (execution units, bandwith). The scheduler will only trigger the execution of a task, if the required resources are available and will reserve there resources until the execution of the task is complete.

- After completing a cycle, i.e. all tasks from all applications have executed once, an initial schedule is available. While the execution of cycles may overlap, and are run over and over again, the schedule is generated for every complete run.

- The schedule is described by the start point of execution as well as the execution time. Metrics such as CPU, memory bandwidth, power, consumption are also tracked.

- For every complete cycle of the schedule, the measured data becomes more representative for an individual tasks execution time and contribution to resource usage. This is especially true if applications and their tasks are executed in different combinations to other applications.

- Resulting schedules may be uploaded to a backend. It shall be possible to narrow / filter data specifically for worst case results measured or an arbitrary deviation from average values. Input data used for the execution of an individual task, or chain of tasks may also be uploaded to the backend, to recreate the situation encountered in the field.

- Mitigation measures for updating a dynamic scheduling plan may include at least one of the following: o Scorboarding, o Tomasulo-Algorithm, o Elimination of task-processing/data transfers when task- inputs where unchanged (propagation of unchanged outputs), o Re-ordering of tasks / attempts to optimize a schedule by applying known patterns to it whilst considering predefined profiles/constraints (e.g. optimize task order, or task variants according to the “Low Power Usage” profile).

Overall, the example shows how an asynchronous stream-data processing framework can be provided.

Claims

CLAIMS:

1. Method for scheduling software tasks (21) on at least one processing system (14) of a given system type, wherein each processing system (14) is situated in a respective vehicle (11) and the tasks (21) are executed in different driving situations (33) and a scheduler (23) determines a concatenation of the tasks (21) on the basis of task data, wherein - the task data at least describe respective input data (28) needed by some or all of the tasks (21) for being executed and respective output data (27) produced by some or all of the tasks (21) when executed, wherein the tasks (21) are concatenated in that the output data (27) of at least some of the tasks (21) correspond to the input data (28) of at least one respective other task (21), and the scheduler (23) determines a scheduling plan (25), wherein the scheduling plan (25) allocates the tasks (21) to processing units of the respective processing system (14) for executing the tasks (21), wherein the allocation corresponds to the concatenation of the tasks (21), characterized in that the scheduling plan (25) is provided to the at least one processing system (14) for executing a respective instance of the tasks (21) in the respective processing system (14) according to the scheduling plan (25) and from the at least one processing system (14) respective analysis data (30) are received wherein the analysis data (30) describe a respective performance value describing at least one predefined performance attribute (32) for some or each of the tasks (21) and/or for the processing system (14), and the scheduler (23) generates an update of the scheduling plan (25) by applying at least one mitigation measure that adapts the allocation of the tasks (21) to the processing units in dependence on the analysis data (30), wherein the respective mitigation measure (MM) is applied if the at least one performance attribute (32) is improved according to a predefined optimization criterion as compared to the received analysis data (30). 2. Method according to claim 1, wherein as one mitigation measure (MM) the scheduler (23) links at least two tasks (21) through a queue buffer (26) and/or a double-buffer (26) for providing the output data (27) of one of these tasks (21) as input data (28) to another one of these tasks (21), if an access conflict is detected.

Method according to any of the preceding claims, wherein the at least one processing attribute comprises a power consumption and/or temperature and as one mitigation measure (MM) the scheduler (23) transfers the execution of at least one of the tasks (21) from a first processing unit to another second processing unit for reducing a load on the first processing unit and/or for switching off the first processing unit.

Method according to any of the preceding claims, wherein the at least one processing attribute comprises a memory bandwidth for a data throughput between at least two tasks (21) and as one mitigation measure (MM) the scheduler (23) introduces memory remapping for avoiding copy operations of data between these tasks (21).

Method according to any of the preceding claims, wherein on the basis of the scheduling plan (25) confines for the analysis data (30) are estimated on the basis of a predefined system model for the given system type and input data (28) exchanged between tasks (21) and/or situation data describing the current driving situation (33) is recorded and on the basis of the received analysis data (30) at least one task (21) that is running outside the confines is identified and the input data (28) and/or situation data that caused this behavior of the task (21) are provided to the at least one mitigation measure (MM).

Method according to any of the preceding claims, wherein the scheduler (23) is a master scheduler (23) in a backend computing system (12) or a distributed scheduler (23) comprising the master scheduler (23) and a respective local scheduler (23) in each of the processing systems (14) and several processing systems (14) in different vehicles (11) are provided with the scheduling plan (25) such that the updated scheduling plan (25) is provided to all of the vehicles (11), wherein the scheduling plan (25) considers driving situations (33) that not all of the vehicles (11) have experienced.

7. Method according to any of the preceding claims, wherein the optimization criterion comprises that in comparison to the analysis data (30) a number of outliers and/or an average of the performance values of the at least one performance attribute (32) are reduced.

8. Method according to any of the preceding claims, wherein a process model of the executed tasks (21) is generated that describes the execution of the tasks (21) according to the analysis data (30) and/or a simulator for the system type is configured using the analysis data (30) and a simulation of the execution of the tasks (21) is performed and estimates for the analysis data (30) are generated for at least one predefined situation data set that describes a respective driving situation (33).

9. Method according to any of the preceding claims, wherein the scheduling plan (25) is generated as an acyclic graph from the task data.

10. Method according to any of the preceding claims, wherein the tasks (21) are executed asynchronously on the respective processing unit. 11. Method according to any of the preceding claims, wherein the at least one performance attribute (32) comprises: latency, throughput, power consumption, temperature, access collisions.

12. Method according to any of the preceding claims, wherein different operating modes, in particular a low power mode and/or a high performance mode, are pre-defined for the system type as respective processing constraints and at least one version of the scheduling plan (25) is evaluated in regard of its suitability for the different processing constraints and if this version fulfills a predefined suitability criterion with regard to the processing constraints, plan data describing that version of the scheduling plan (25) are stored and used as scheduling plan (25), when the corresponding operating mode is activated in a processing system (14).

13. Method according to any of the preceding claims, wherein a cyclic execution of the tasks (21) is performed and for some or each cycle of the cyclic execution, individual analysis data (30) are provided and the scheduling plan (25) is iteratively adapted to a respective current driving situation (33) by performing the update after each cycle or after each predefined number of cycles.

14. Method according to any of the preceding claims, wherein the tasks (21) are designed for processing stream data (18) of a data stream (15) and the tasks

(21) process the data stream (15) in real-time, that is at the speed that new stream data (18) arrive and processed stream data (18) are passed on, wherein the processing is done on a subset of the overall stream data (18) at a time. 15. Control framework (10) for at least one vehicle (11), wherein for each vehicle

(11) a processing system (14) comprising processing units is provided and the control system comprises at least one processor circuitry that is designed to provide a local scheduler (23) in the respective processing system (14) of the vehicles (11) or a master scheduler (23) in a stationary backend computing system (12) or a distributed scheduler (23) comprising the respective local scheduler (23) and the master scheduler (23), wherein the at least one processor circuitry is designed to perform a method according to one of the preceding claims.