WO2012089579A1

WO2012089579A1 - Method and device for processing data elements having minimal latency

Info

Publication number: WO2012089579A1
Application number: PCT/EP2011/073556
Authority: WO
Inventors: Tobias SCHÜLE
Original assignee: Siemens Aktiengesellschaft
Priority date: 2010-12-30
Filing date: 2011-12-21
Publication date: 2012-07-05
Also published as: DE102011007603A1; WO2012089579A9

Abstract

The invention relates to a data processing device (1) for the data processing of data elements of a serial data stream (DS), wherein a received data element of the serial data stream (DS) is processed in parallel by a plurality of data processing units (2), which have access to a priority queue (3), in several stages according to the pipelining principle using tasks (T), wherein a prioritized execution of the tasks (T) is carried out in accordance with execution priorities (AP) of the tasks (T) indicated in the priority queue (3), the execution being derived in each case from the position of the data element to be processed by the tasks (T) within the received data stream (DS).

Description

description

Method and apparatus for processing data elements with minimal latency

The invention relates to a method and a device for data processing of data elements with minimal latency. Data processing devices increasingly consist of several data processing units or processor cores operating in parallel. In order to be able to exploit the performance of such multicore processors or multicore data processing units, the calculations to be carried out on them are broken down as far as possible into independent parts, so that these parts can be processed in parallel. For many applications, in particular in the field of digital signal processing, therefore, the data processing takes place according to the so-called assembly line principle. In this case, a sequence or sequence of calculations is decomposed in such a way that different elements of a data stream can be processed in parallel in substeps. This is also called "pipelining".

1 shows by way of example the execution of a processing stage consisting of three stages X, Y, and Z for data elements according to the prior art.

Pipelining is increasingly being used in software development. While the allocation of pipeline stages to the resources or functional units is fixed in hardware development, the problem of optimally optimal mapping of the calculation stages to the various processor cores or data processing units of the multicore processor can arise during software development. For systems that are subject to real-time requirements, it must also be ensured that the result of a calculation is available in good time. Therefore, in a parallel execution of a pipeline, it is not just about increasing data throughput, but also also to minimize or limit the latency for the data processing of a single data element of a received data stream. Another problem is that the execution times of the pipeline stages typically vary widely. This can lead to unwanted effects or congestion, which increases the latency considerably.

It is therefore an object of the present invention to provide a method and a device for data processing of data elements of a serial data stream with low maximum latency.

This object is achieved by a data processing device having the features specified in claim 1.

The invention provides a data processing device for data processing of data elements of a serial data stream,

wherein a received data element of the serial data stream is processed in parallel by tasks in parallel by a plurality of data processing units having access to a priority queue in a multi-stage manner according to the assembly line principle,

wherein a prioritized execution of the tasks takes place in dependence on execution priorities of the tasks specified in the priority queue, which in each case is derived from the position of the data element to be processed by the task within the received data stream. In the case of the data processing device according to the invention, there is thus a prioritized execution of the tasks as a function of the point in time at which a data element of the data stream is fed into the pipeline. In this way, it is avoided that data elements remain unnecessarily long in the pipeline or processing pipeline and are overtaken by subsequent data elements. The method according to the invention thus reduces the maximum latency of a data element. The maximum latency of in soft- Implemented pipelines in parallel systems would be reduced in this way. In particular, the assignment of the sub-steps of a calculation to be carried out to the data processing units or processor cores can take place at runtime. The data processing device according to the invention is particularly suitable for applications with real-time requirements, in particular for processing audio, video or sensor data. In one possible embodiment of the data processing device according to the invention, the data processing device has at least one data interface for receiving the serial data stream. In one possible embodiment, the received data elements of the data stream receive a generated time stamp when they are received to identify their position within the data stream. In a further possible embodiment of the data processing device according to the invention receive the received data elements of the data stream when they receive a generated data element number to identify their position within the data stream.

In one possible embodiment of the data processing device according to the invention, the execution priority of a task is calculated by a calculation unit by means of a prioritization function as a function of the position of the data element to be processed by the task within the data stream.

In one possible embodiment of the data processing device according to the invention, this has a global priority queue for all data processing units.

In an alternative embodiment of the data processing device according to the invention, each data processing unit has its own local priority queue. In one possible embodiment of the data processing device according to the invention, the priority queue contains the tasks to be executed sequentially by the data processing unit and their respective execution priorities.

In one possible embodiment of the data processing device according to the invention, the priority waiting loop also contains an address pointer for the task to be executed, or pointer to a data memory address of a data memory at which the data of the data element to be processed by the task are stored. In one possible embodiment of the data processing device according to the invention, this has a task scheduler, which inserts the tasks and their calculated execution priorities into at least one priority queue. In a further possible embodiment of the data processing device according to the invention, the assignment of the tasks to different data processing units is performed by the task scheduler at runtime of the data processing. In a further possible embodiment of the data processing device according to the invention, the serial data stream comprises data elements which contain audio data, video data or sensor data which are processed in real time by the data processing device.

In one possible embodiment of the data processing apparatus according to the invention, if a local priority queue of a data processing unit is empty, a task with the highest execution priority is automatically transferred from another local priority queue to the empty priority queue. In one possible embodiment of the data processing device according to the invention, a task generates at least one successor task after execution by a data processing unit and transfers this data to the task scheduler.

In one possible embodiment of the data processing device according to the invention, the execution priority is derived from the reception order of the received data element within the data stream, wherein tasks for processing previously received data elements receive a higher execution priority.

In one possible embodiment of the data processing device according to the invention, the data processing units are processors which execute the task according to their respective execution priority.

The invention further provides a method with the features specified in claim 15.

The invention provides a method for processing data elements with low maximum latency,

wherein the data elements are received in a serial data stream and are processed in parallel by tasks in parallel in multiple stages according to the assembly line principle by a plurality of data processing units having access to a priority queue,

wherein a prioritized execution of the tasks takes place in dependence on the execution priorities of the tasks specified in the priority queue, which in each case is derived from the position of the data element to be processed by the task within the received data stream.

In the following, possible embodiments of the inventive method and the inventive apparatus for data processing of data elements with low maximum latency will be described in more detail with reference to the accompanying figures. a diagram for explaining a multi-stage pipeline data processing according to the prior art; a block diagram illustrating an embodiment of the data processing device according to the invention; a diagram for explaining a possible data structure of a priority queue used in the inventive apparatus and the method according to the invention; a flow diagram illustrating an embodiment of the inventive method for processing data elements with low maximum latency; a diagram illustrating a multi-level pipeline data processing by various tasks to explain the operation of the method according to the invention; an example for the execution of a data processing pipeline in a system with two data processing units, in which the execution is carried out in a conventional manner; a diagram illustrating a prioritized execution of a data processing pipeline in a system with two data processing units when using the method according to the invention for processing data elements with low maximum latency; Fig. 2 shows schematically an embodiment of a data processing device 1 according to the invention for data processing of data elements of a serial data stream DS. In the illustrated embodiment, the data processing device includes a plurality of data processing units 2-1, 2-2, ... 2-N. These data processing units 2-i are, for example, processor cores. In the embodiment shown in FIG. 2, each data processing unit or each processor core 2-i has its own local priority queue 3-i. In the illustrated embodiment, the data processing device or the multicore processor 1 according to the invention comprises N data processing units 2-i or processors. The data processing device 1 further includes in the illustrated embodiment a task scheduler 4 which inserts a task into the associated priority queue. The data processing device 1 has at least one data interface for receiving a serial data stream DS. The serial data stream DS can consist of a multiplicity of successive data elements. By means of the data processing device 1, a received data element of the serial data stream DS is processed in parallel by the data processing units 2-i in a multi-stage manner according to the assembly line principle by means of tasks T. The data processing units 2-i each have access to a priority queue PWS. In the embodiment shown in FIG. 2, each data processing unit 2-i has its own local priority queue 3-i. In an alternative embodiment, a global priority queue PWS can also be provided in the data processing device 1 for all the data processing units 2-i that can be accessed by all the data processing units 2-i. In the case of the data processing device 1 according to the invention, a prioritized execution of the tasks T occurs as a function of execution priorities AP of the tasks T specified in the priority queue. These execution priorities AP are in each case determined by the position of the task T to be processed. derived data element within the received data stream DS derived.

In one possible embodiment, the data processing device 1 has a unit for identifying the position of a received data element within the received data stream DS. In one possible embodiment, the received data elements of the data stream DS are provided with a generated time stamp when they are received, which indicates the time of their reception by the data processing unit 1. In one possible embodiment variant, the received data elements contain a generated data element number when they are received in order to identify their position within the data stream DS. The data elements can be any data elements, for example also data packets.

In a possible embodiment variant of the data processing device 1 according to the invention, this additionally contains a calculation unit which calculates the execution priority AP of a task T as a function of the position of the data element to be processed by the task T within the data stream DS by means of a prioritization function PF. The prioritization function PF can be configured in one possible embodiment.

The priority queue contains the tasks T to be sequentially executed by the data processing unit 2-i and their respective execution priorities AP.

FIG. 3 shows, by way of example, a data structure of a priority queue, as can be used in the data processing device 1 according to the invention. In the exemplary embodiment illustrated in FIG. 3, the priority queue contains the tasks ΊΊ to be executed sequentially by the respective data processing unit 2-i and their respective execution priorities APi. In the embodiment illustrated in FIG. 3, the priority queue contains furthermore, for the task T to be executed, connect an address pointer P ₊ to a data memory address of a data memory at which the data of the respective data element to be processed by the task T are stored. If the data element is, for example, a video image to be processed, the address pointer or pointer P indicates the memory address at which the data element or video image to be processed by the task T is located. In order to achieve a particularly high data processing speed, instead of the address pointer or pointer P, the data of the respective data element can also be stored directly in the priority queue. In this embodiment variant, in the priority queue, instead of an address pointer or pointer P, the user data of the data element to be processed by the respective task T is located. In this way, the time for accessing the data memory during data processing can be saved. The task scheduler 4 illustrated in FIG. 2 inserts the task T and its calculated execution priority AP into the priority queue, as shown by way of example in FIG. 3. The assignment of the task T to different data processing units 2-i by the task scheduler 4 can take place in one possible embodiment at runtime of the data processing. The serial data stream DS received by the data processing device 1 consists of data elements which each contain, for example, audio data, video data or sensor data. This data is preferably processed by the data processing device 1 in real time.

In one possible embodiment, if a local priority queue 3-i of the data processing unit 2-i is empty, a task T having the highest execution priority is automatically transferred from another local priority queue to the empty priority queue. The assignment of the tasks T takes place in the multicore system or the multicore system shown in FIG. Data processing device 1 with the aid of the task scheduler 4, which distributes executable tasks T to the priority queues. The order of execution of the tasks T does not depend on the time of arrival of the tasks, but on their execution priority AP. When a priority queue is empty, the data item with the highest execution priority is stolen from another priority queue to some extent. This is also called "work stealing".

In one possible embodiment, the execution priority AP can be the negated value of the data element number. The task for processing the first data element thus has the highest execution priority "-1" for processing the second data element, the second highest execution priority "-2", etc. In this way it is ensured that data elements that are fed into the data processing pipeline early are also given priority and leave the pipeline as a priority. This minimizes the maximum latency. The execution priority AP is derived in this embodiment from the reception order of the received data elements within the data stream DS. In this case, tasks T for processing data elements received earlier receive a higher execution priority, for example the highest execution priority "-1".

4 shows a flowchart for illustrating an embodiment of the method according to the invention for processing data elements with a low maximum latency. The process illustrated in FIG. 4 is a pipelining process.

Process that runs on each of the data processing units or processor cores 2-i.

In a step S1, the respective data processing unit 2-i checks whether the associated priority queue is empty. If so, in step S2 a task T is transferred from another priority queue to its own priority queue, preferably that one Task T with the highest execution priority AP is transferred. Conversely, if it is determined that the own priority queue is not empty, the data processing unit 2-i extracts the task T having the highest execution priority AP from its own priority queue in step S3.

Subsequently, the task T transferred in step S2 and the task T removed or loaded in step S3 are executed by the data processing unit 2-i in step S4.

In a further step S5, at least one successor task is generated or generated and transferred to the task scheduler 4. In this case, an entry for a priority waiting queue PWS is also generated for the generated successor task.

In a step S6, the task scheduler 4 outputs the generated PWS entry of the successor task to the priority queue, and the process returns to step S1.

5 shows an example of a multi-stage execution of tasks in three stages X, Y, Z to illustrate the method according to the invention. For example, a first task Τι_ ₂ consist in that a video image I is read from an image data memory. In a subsequent stage Y, the video image is then filtered by a task i_i. Subsequently, in a further stage Z by the next task ΊΊ a display of the filtered video image on a screen.

Fig. 6 shows an example of a conventional execution of tasks with a processing pipeline in a system with two processor cores or data processing units. At each time t, the content of the conventional queue WS1, WS1 is indicated. The data processing pipeline consists in the illustrated example of three stages X, Y, Z, as shown in Fig. 5. In the example shown For example, the first stage X requires two time units for execution, the second stage Y three time units, and the third stage Z only one time unit. The data elements of a data stream may be processed out of order, but must be brought back into the original order by the last stage Z, so that the semantics of the program is preserved. This means that, for example, Z _{3 may} not be performed before Z ₂ . As can be seen from the example according to FIG. 6, six time units are required for the data processing of the first data element, which ranges from the execution of Xi at time t = 0 to the execution of i at time t = 5. The latency for the processing of the second data element in the illustrated example is 11 time units, for the third data element 10 time units and for the fourth data element 9 time units. The maximum latency is thus for the period under consideration in the illustrated example 11 time units, namely the latency of the second data element. In that shown in Fig. 6

In practice, the parallel execution of a pipeline takes place in each case with the aid of a task scheduler. As soon as a pipeline stage is ready for execution, ie new data is available, a task is generated and transferred to the task scheduler. The task generation takes place, for example, directly through the respective preceding pipeline stage. However, as seen in Figure 1, if a processing stage is slower than the previous processing stage, processing jams may as it were come to a certain extent because more data is being produced than consumed.

In comparison, FIG. 7 shows the prioritized embodiment of a data processing pipeline of a data processing device having two data processing units or processor cores, which uses the inventive method for processing data elements with a low maximum latency. At each instant t, the data content of the priority queues PWS 1 and PWS 2 of the two data processors is stored. specified processing units or processor cores. As can be seen from FIG. 7, there is a latency of 6 time units for each of the first three data elements and time units for the fourth data element 9. The maximum latency time is thus for the considered period when using the method according to the invention only 9 time units namely the latency of the fourth data element. As can be seen by comparing FIGS. 6, 7, the method according to the invention reduces the maximum latency for the time period considered from 11 time units to 9 time units.

By means of the method according to the invention, a prioritized execution of the tasks T takes place as a function of the point in time at which a data element of the data stream DS is fed into the pipeline. This avoids data elements remaining in the pipeline for an unnecessarily long time and can be surpassed by subsequent data elements. The inventive method thus reduces the maximum latency, which is particularly advantageous in real-time systems.

The method according to the invention is not limited to linear data processing pipelines. In some applications, it may be useful to use non-linear data structures that allow duplication or splitting of the data streams. The inventive method is not limited to linear pipelines and can be transferred to non-linear structures.

In the implementation of the task scheduler 4, there are two possible variants.

A variant consists in providing a separate priority queue for each processor core or each data processing unit, as is also illustrated in FIG. In this embodiment, the data processing units or processor cores can be largely independent of each other. work, where only when work stealing or stealing a task T for an empty priority queue from another priority queue, the involved data processing units or processor cores are synchronized. In an alternative embodiment, a global priority queue is provided for all data processing units. An advantage of this embodiment variant is that always those tasks T with the highest priority in the overall system are processed first. However, this requires a certain amount of communication and synchronization. Furthermore, it is possible to provide combinations of local and global priority queues in the data processing device 1. Further variants relate to the calculation of the execution priority AP. Instead of using a data element number, one can also use the position of the pipeline stage for it. This means that the stages at the end of the pipeline receive a higher execution priority than the stages at the beginning. Consequently, data elements that have already passed through several stages are given priority. Again, combinations of different approaches are possible. The invention thus provides a method for reducing the maximum latency in parallel data stream processing devices or computer systems. In doing so, the maximum latency, ie the time between the entry of a data element into the pipeline and the exit of the processed data element from the pipeline, is reduced or minimized. The inventive method is particularly suitable for data processing of audio, video or sensor data in real time. The assignment of the tasks T is preferably carried out dynamically at runtime. By the method according to the invention the data throughput is increased. Furthermore, the fault tolerance against a failure of a data processing unit can be increased. The prioritization function PF for calculating the execution priority AP of a task T is preferably configurable. This increases the flexibility of the data processing device 1 for various applications. In one possible embodiment, the execution priority AP of a task is calculated according to a position of the data element to be processed by the task T within the received serial data stream DS. The calculation of the execution priority AP as a function of the position can be carried out in a variant embodiment with a configurable calculation function or prioritization function PF, which is located, for example, in a configuration memory of the data processing device 1. The calculation function can be configured or adapted in a possible embodiment variant via a configuration interface.

The data processing device 1 according to the invention, as shown for example in FIG. 2, may be, for example, a network node of a network. A network node has at least one data interface for receiving a serial data stream DS. The data can be received via a wired or wireless interface of the data processing device 1 for data processing. In a possible embodiment variant, the data processed according to the pipeline principle are output by the data processing device 1, for example a network node, via a further data section. The data processing device 1 shown in FIG. 2 may be a multi-core processor that can be used in any computer system, for example an embedded system. In one possible embodiment variant, the priority queues shown in FIG. 2 are integrated into the data processing units or processor cores 2-i.

Claims

Data processing device (1) for data processing of data elements of a serial data stream (DS), wherein a received data element of the serial data stream (DS) is processed in several stages by tasks (T ) are processed in parallel,

wherein a prioritized execution of the tasks (T) takes place depending on the execution priorities (AP) of the tasks (T) specified in the priority queue (3), which in each case depend on the position of the data element to be processed by the tasks (T) within the received data stream ( DS) is derived.

Data processing device according to claim 1,

wherein the data processing device (1) has at least one data interface for receiving the serial data stream (DS),

whereby the received data elements of the data stream (DS) receive a generated time stamp or a generated data element number upon receipt to identify their position within the received data stream (DS).

Data processing device according to claim 1 or 2, wherein the execution priority (AP) of a task (T) is determined by a calculation unit using a prioritization function (PF) depending on the position of the data element to be processed by the task (T) within the data stream (DS). is calculated.

Data processing device according to one of the preceding claims 1 to 3,

wherein the data processing device (1) has a global priority queue (3) for all data processing units (2).

5. Data processing device according to one of the preceding claims 1 to 3,

wherein each data processing unit (2-i) has its own local priority queue (3-i).

Data processing device according to one of the preceding claims 1 to 5,

wherein the priority queue (3) contains the tasks (T) to be executed sequentially by the data processing unit (2) and their respective execution priorities (AP).

Data processing device according to claim 6,

wherein the priority queue (3) further contains an address pointer (P) for the task (T) to be executed to a data storage address of a data memory, at which the data of the data element to be processed by the task (T) is stored.

Data processing device according to one of the preceding claims 1 to 7,

wherein a task scheduler (4) is provided, which inserts the task (T) and its calculated execution priority (AP) into at least one priority queue (3).

9. Data processing device according to claim 8,

wherein the tasks (T) are assigned to different data processing units 2-i by the task scheduler (4) at runtime of the data processing.

Data processing device according to one of the preceding claims 1 to 9,

wherein the serial data stream (DS) has data elements that contain audio data, video data or sensor data, which are processed in real time by the data processing device (1).

11. Data processing device according to one of the preceding claims 5 to 10,

wherein, if a local priority queue (3-i) of a data processing unit (2-i) is empty, a task (T) with the highest execution priority (AP) from another local priority queue (2-i) into the empty priority queue (2-i) i) is transferred automatically.

12. Data processing device according to one of the preceding claims 8 to 11,

wherein a task (T) generates at least one follow-up task after its execution by a data processing unit (2) and hands it over to the task scheduler (4).

13. Data processing device according to one of the preceding claims 1 to 12,

whereby the execution priority (AP) is derived from the reception order of the received data element within the data stream (DS),

where tasks (T) for processing previously received data items are given a higher execution priority (AP).

14. Data processing device according to one of the preceding claims 1 to 13,

wherein the data processing units (2-i) are processors that execute the tasks (T) according to their execution priority (AP).

15. Method for processing data elements with low maximum latency,

wherein the data elements (DE) are received in a serial data stream (DS) and are processed in parallel by several data processing units (2-i), which have access to a priority queue (3), according to the assembly line principle by tasks (T), wherein a prioritized execution of the tasks (T) takes place depending on the execution priorities (AP) of the tasks (T) specified in the priority queue (3), which in each case depend on the position of the data element to be processed by the task (T) within the received data stream (DS) is derived.