WO2017070900A1

WO2017070900A1 - Method and apparatus for processing task in a multi-core digital signal processing system

Info

Publication number: WO2017070900A1
Application number: PCT/CN2015/093248
Authority: WO
Inventors: 范冰; 周卫荣; 李海龙
Original assignee: 华为技术有限公司
Priority date: 2015-10-29
Filing date: 2015-10-29
Publication date: 2017-05-04
Also published as: CN108351783A

Abstract

A method and apparatus for processing a task in a multi-core digital signal processing system. In the process of processing a task, the waiting overhead for data loading is reduced, a degree of parallelism between tasks is improved, and the scheduling overhead of a system is reduced. The method comprises: determining a ready task in a task queue (S110); determining a target operation unit for executing the ready task (S120); and executing the ready task by means of the target operation unit, and at the same time, preparing, by means of the target operation unit, data for a task to be executed (S130).

Description

Method and apparatus for processing tasks in a multi-core digital signal processing system

Technical field

Embodiments of the present invention relate to the field of digital signal processors and, more particularly, to methods and apparatus for processing tasks in a multi-core digital signal processing system.

Background technique

With the development of mobile Internet technology, the amount of data processing has increased rapidly, and digital signal processor chips are moving toward multi-core and large data processing. Digital signal processors are usually implemented in software code when performing digital operations. The increase in the number of cores brings many difficulties to software development and hardware resource utilization and debugging. When the requirements change, the software architecture needs to re-divide the functional mapping relationship of multiple different cores, which is wasteful for insufficient use of some hardware resources such as memory, data channels, and message resources.

Static Task Scheduling Method in Related Art In the static task scheduling process, the software designer obtains the basic performance of each functional module according to the software task icon and the performance simulation of each functional algorithm module, and performs matching according to the capability of mapping hardware resources according to the target. According to the functional granularity, resource consumption deploys different software functions to different hardware resources, but the static task scheduling method has limited application scenarios, high scheduling complexity, and low memory resource utilization.

The dynamic task scheduling scheme in the related art adopts a resource pool scheme of master-slave distributed scheduling, and each processor carries a tailored operating system (OS), which can support different priorities. The task can respond to external interrupts, etc., and the main core divides the task into the appropriate granularity into the task cache pool. When the core is idle, the task is automatically acquired from the main core and executed. However, in this solution, each slave core needs to carry an operating system, task switching, data loading will occupy a lot of slave load, and computing resources and memory resource utilization are low.

Summary of the invention

Embodiments of the present invention provide a method and apparatus for processing a task in a multi-core digital signal processing system, which can determine a running scheduling process when a task runs, dynamically allocate computing resources, improve utilization of computing resources, and reduce system scheduling overhead.

In a first aspect, a method for processing a task in a multi-core digital signal processing system includes: determining a ready task in a task queue; determining a target computing unit that executes the ready task; performing the ready task by the target computing unit, and At the same time, the target computing unit is used as the task to be executed. Prepare data.

According to the method for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention, when a task is executed by one operation unit, data is prepared for other tasks through the operation unit at the same time, thereby enabling data loading and algorithm service execution to be parallel, thereby reducing The waiting cost of data loading increases the degree of parallelism between tasks and reduces system scheduling overhead.

In conjunction with the first aspect, in an implementation of the first aspect, when it is determined that the task-dependent operation unit that executes the ready task is idle, the operation unit that executes the ready task is determined as the target operation unit.

At this time, the operation unit that executes the ready task and the task dependent task of the ready task is the same operation unit, and when the ready task is executed, there is no need to load data again, and the congestion condition on the loading path is alleviated.

In conjunction with the first aspect and the foregoing implementation manner, in another implementation manner of the first aspect, before the performing the task is performed by the target computing unit, the method further includes: determining, in the near-end memory of the target computing unit, a memory block storing input data corresponding to the ready task; moving the input data corresponding to the ready task to the memory block.

In conjunction with the first aspect and the foregoing implementation manner, in another implementation manner of the first aspect, the determining, by the operating unit, the memory block of the near-end memory of the target computing unit for storing the input data corresponding to the ready task includes: The memory block is determined according to a fixed resource pool algorithm, wherein data stored in the near-end memory of the target unit supports docking until the user releases or replaces the data to the far-end memory when the near-end memory is insufficient.

In conjunction with the first aspect and the foregoing implementation manner, in another implementation manner of the first aspect, the method further includes: when the ready task is executed by the target computing unit, saving the output data of the ready task in the Near-end memory.

In this way, the read and write memory when performing the task is the near-end memory, and the task does not wait for the data to reach the consumption delay when executing the task, and the fixed resource pool algorithm can reduce the memory fragmentation when applying for the memory, improve the memory turnover efficiency, and save the memory. .

With reference to the first aspect and the foregoing implementation manner, in another possible implementation manner of the first aspect, the determining, by the fixed resource pool algorithm, the memory block includes: a data block corresponding to all parameters required by the ready task The ratio of the sum of the sum to the size of a single block of memory in the near-end memory determines the number of blocks.

Thus, by assembling the memory data in the task, the memory can be further improved. Use efficiency to reduce memory waste.

In conjunction with the first aspect and the foregoing implementation manner, in another possible implementation manner of the first aspect, before determining a ready task in the task queue, the method further includes: performing abstract processing on the service including the ready task, Obtaining abstract processing information, the abstract processing information including at least one of the following information: task dependency information, data dependency information, and sequence information of task execution.

In conjunction with the first aspect and the foregoing implementation manner, in another possible implementation manner of the first aspect, the task queue is a plurality of parallel task queues, wherein the determining the ready tasks in the task queue includes: prioritizing The level sequence polls the plurality of parallel task queues to determine the ready task.

In conjunction with the first aspect and the foregoing implementation manner, in another possible implementation manner of the first aspect, the abstract processing of the service including the ready task includes: creating a cache according to the requirement of the service; The cache ID is determined to determine the data dependency information.

In a second aspect, there is provided apparatus for processing a task in a multi-core digital signal processing system, for performing the method of any of the first aspect or the first aspect of the first aspect, in particular A module of the method of the first aspect or any of the possible implementations of the first aspect.

In a third aspect, a computer readable medium is provided for storing a computer program comprising instructions for performing the method of the first aspect or any of the possible implementations of the first aspect.

In a fourth aspect, a computer program product is provided, the computer program product comprising: computer program code, wherein the computer program code is executed by a device of a processing task of a multi-core digital signal processing system, such that the device performs the first aspect or A method in any of the possible implementations of the first aspect.

DRAWINGS

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the present invention, Those skilled in the art can also obtain other drawings based on these drawings without paying any creative work.

1 is a schematic structural diagram of an application system to which an embodiment of the present invention is applied;

2 is a diagram of various management modules included in the scheduler in the application system shown in FIG. Schematic diagram of the system;

3 is a schematic diagram of data dependency relationships in an application system to which an embodiment of the present invention is applied;

4 is a schematic diagram of scheduling results when only one core is scheduled in an application system to which an embodiment of the present invention is applied;

5 is a schematic diagram of scheduling results when three cores are scheduled in an application system to which an embodiment of the present invention is applied;

6 is a schematic flowchart of a method for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention;

7 is another schematic flowchart of a method for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention;

FIG. 8 is still another schematic flowchart of a method for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention; FIG.

FIG. 9 is a schematic flowchart of a method for abstracting a service according to an embodiment of the present invention; FIG.

FIG. 10 is a schematic flowchart of a method for implementing a processing task in a specific case according to an embodiment of the present invention; FIG.

11 is a schematic flowchart of a method of determining a ready task and an idle operation unit according to an embodiment of the present invention;

12 is a schematic block diagram of an apparatus for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention;

13 is another schematic block diagram of an apparatus for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention;

14 is still another schematic block diagram of an apparatus for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention;

15 is a schematic block diagram of an apparatus for processing a task in a multi-core digital signal processing system in accordance with another embodiment of the present invention.

detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are a part of the embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, those of ordinary skill in the art are not making innovations. All other embodiments obtained under the premise of causative labor are within the scope of protection of the present invention.

It should be understood that the technical solution of the embodiments of the present invention is mainly applied to a digital signal processing system that requires multi-core processing, and has a large number of parallel computing scenarios, such as a macro station baseband chip and a terminal chip. The multi-core feature is embodied by integrating the number of computing modules on a single chip, including but not limited to multiple general-purpose processors, multiple IP cores, multiple dedicated processors, and the like. If the number of calculation modules is greater than one, it is multi-core.

FIG. 1 is a schematic structural diagram of an application system (multi-core digital signal processing system) to which an embodiment of the present invention is applied. The application system is composed of three parts: a main control layer, an execution layer, and an operation layer. The main control layer carries user software, and completes high-level information interaction, process control, task decomposition, and task dependency definition. The execution layer consists of three parts, the master core execution layer, the scheduler, and the slave core execution layer. The main control core execution layer provides a software programming interface, submits commands to the scheduler and receives command feedback or callback notifications; the scheduler is a hardware part, which is responsible for processing task scheduling, and specific functions include: dependency processing between tasks, memory management, Task assignment, data movement and other functions. As shown in Figure 2, the scheduler is internally managed by multiple management modules: command management, command queue management, event management, buffer descriptor management, shared memory management, computation memory management, computing resource state management, and scheduling master. Control module; from the core execution layer to the software part, mainly responsible for receiving task messages, calling the algorithm function library to execute the cloud, and sending the task end feedback message after the operation ends. The computing layer can be hardware or software, and is mainly responsible for processing tasks.

The scenario in which the method of the embodiment of the present invention is applied will be exemplified below. It is assumed that there are three priority channel processing segments; among them, Kernels 0 to 2 are high priority processing, Kernels 3 to 5 are medium priority processing, and Kernels 7 to 9 are low priority processing. As shown in FIG. 3, in order to indicate the processed data flow (the direction pointed by the arrow indicates the direction in which the data flows), the data flow between the host and the device is indicated as a buffer input/output according to the direction (Buff_In). /Out). The data flow between Kernel and Kernel processing is marked as cache_M (Buff_M). Data dependencies between different cores can be described as:

The input of Kernel_0/3/7 is Buff_In0/1/2, which is prepared by Host (the actual application may be the output from external interface or Hardware Accelerate Control (HAC)).

After the Kernel_2 finishes processing, the data is output to the Host (in actual applications, the Host usually needs to send data processed by Digital Signal Processing ("DSP") to the off-chip, or continue processing for Hac).

The input processed by Kernel_2 depends on the output of Kernel_1 and also depends on the output of Kernel_5.

The input processed by Kernel_4 depends on the output of Kernel_9 and also depends on the output of Kernel_3.

The input processed by Kernel_8 depends on the output of Kernel_7 and also depends on the output of Kernel_3.

The output processed by Kernel_3 is used by Kernel_4 in addition to Kernel_4 (Kenrel_8 may only use a part of it).

The scheduler schedules according to the priority according to the number of actual execution cores that are scheduled, while ensuring that the data dependencies are correct. In the Kernel dependency described above, in the initial stage, the Host will all submit the Kernel to the Command Queue (CommandQueue), and the input data Buff_In0/1/2 is all ready, then the scheduler can only be scheduled if only one core can be scheduled. The scheduling result is shown in Figure 4. For the case where there are three cores, the scheduling result is shown in Figure 5. In Figure 5, whether the dotted line Kernel_2 and Kernel_4 are scheduled on DSP2 depends on the amount of data input. . For example, when the data Buff_M8 (the output of Kernel_9) input by Kernel_4 is greater than or equal to Buff_M2 (the output of Kernel_3), Kernel_4 should be dispatched to DSP2, otherwise it should be dispatched to the core where the output of Kernel_3 is located.

It should be understood that, in the embodiment of the present invention, a service refers to a program that processes data independently of hardware, and is a concept different from an operating system and a driver. The service may be, for example, data channel estimation, Fast Fourier Transformation (Fast Fourier Transformation). , referred to as "FFT" for short, and other operations such as decoding. A task is a software task that is a piece of program that implements a function and usually needs to run on a core processor.

FIG. 6 is a schematic flow diagram of a method 100 of processing a task in accordance with an embodiment of the present invention. The method 100 can be performed by the multi-core digital signal processing system shown in FIG. 1, as shown in FIG. 6, the method 100 includes:

S110. Determine a ready task in the task queue.

S120. Determine a target operation unit that performs the ready task.

S130. Perform the ready task by the target operation unit, and simultaneously prepare data for the task to be executed by the target operation unit.

Specifically, after determining the ready task in the task queue, the multi-core digital signal processing system determines a target arithmetic unit capable of executing the ready task, and then executes the determined target computing unit The ready task is performed, and at the same time, data is prepared for the task to be executed through the target arithmetic unit.

Therefore, the method for processing a task according to an embodiment of the present invention prepares data for a task to be executed by the target operation unit while executing a ready task by the target operation unit, thereby parallelizing data loading and operation, thereby reducing waiting cost of data loading. Improve parallelism between tasks and reduce system scheduling overhead.

It should be noted that the ready task refers to the task that has been completed and can start the running task. The task to be executed can be understood as the task that needs to be executed after the ready task. An arithmetic unit can be understood as a core.

Optionally, in S110, the task queue is a plurality of parallel task queues;

Correspondingly, S110 is specifically: polling the plurality of parallel task queues in a priority order to determine the ready task.

That is to say, the multi-core digital signal processing system can create a parallel task queue. These parallel tasks are in parallel relationship, but these parallel task queues have different priorities. After the tasks are sent into the task queue, each task queue is in the queue. The task is serially executed, using the first-in, first-out order principle.

Specifically, when determining the ready task, the polling may be performed in descending order of priority of multiple parallel task queues, and if there is no ready task in the high priority task queue, polling the next priority is continued. The task queue ends polling until a ready queue is found or the lowest priority task queue has been polled.

Optionally, in S130, the ready task is executed by the target operation unit, and at the same time, the data is prepared for the task to be executed by the target unit, which can be understood as virtualizing one operation unit into two logical resources of ping-pong, when a logical resource is allocated. After the task, another logical resource can also be reassigned the task, which can ensure that one logical resource is in operation, and the data of another logical resource is also prepared, thereby reducing data waiting and improving computing resource utilization.

Optionally, S120 is specifically: when determining that the operation unit of the dependent task that executes the ready task is idle, determining the operation unit of the dependent task that executes the ready task as the target operation unit.

Specifically, the dependent task of the ready task refers to a task of outputting data as input data of the ready task, and the multi-core digital signal processing system may select the idle resource according to the data location to perform the ready task, preferably, multi-core digital signal processing. The system records an operation unit that processes the dependent task of the ready task, and the multi-core digital signal processing system determines the arithmetic unit that executes the dependent task When in the idle state, the ready task is allocated to the arithmetic unit that executes the dependent task, that is, the arithmetic unit that executes the dependent task is determined as the target arithmetic unit. Since the arithmetic unit that processes the ready task is the same arithmetic unit as the arithmetic unit that processes the ready task, it is not necessary to load data for the ready task again, and the congestion on the loading path can be alleviated.

Optionally, when the multi-core digital signal processing system determines that the operation unit that performs the dependent task is not in an idle state, an idle arithmetic unit may be randomly selected from the other idle arithmetic units as the target operation unit.

Optionally, as shown in FIG. 7, before performing the ready task by the target computing unit, the method 100 further includes:

S140. Determine a memory block in the near-end memory of the target operation unit for storing input data corresponding to the ready task.

S150. Move the input data corresponding to the ready task to the memory block.

Specifically, the data required to perform the ready task can be executed on other operating units or in external memory (for example, Double Rate Rate (DDR), L3 cache, etc.). Before the task, you need to move the data on the memory to the near-end memory (such as L1 cache or L2 cache) of the arithmetic unit that will run the task. Before moving the data, you need to determine the memory block used to store the data, or you need to apply for the memory used to store the data, and then move the data to the determined or applied memory.

Optionally, in S140, the memory block is determined according to a fixed resource pool algorithm, wherein data stored in a near-end memory of the target operation unit supports camping until the user releases or replaces the data to the near-end memory. Remote memory.

That is to say, the memory space can be ranked according to the distance from the memory, and then processed according to the memory level. The memory allocation algorithm for the near-end memory uses a fixed resource pool algorithm, thereby reducing memory fragmentation and increasing the application. Release efficiency.

It should be understood that, in the embodiment of the present invention, the near-end memory of the target computing unit may be applied according to other algorithms. For example, the near-end may be applied according to a linked list memory allocation algorithm, a partner algorithm, a memory pool-based buddy algorithm, a working set algorithm, and the like. Memory, but the invention is not limited thereto.

Optionally, the S140 is specifically: determining, according to a ratio of a total of the data blocks corresponding to all parameters required by the ready task to a size of a single memory block in the near-end memory, determining a memory block in the near-end memory that needs to be applied for. Quantity.

Specifically, the task is equivalent to a function with parameters, and each parameter may be a piece. Data, or numerical values, can be assembled into the same memory block or chunks after being assembled. For example, suppose task A has 10 parameters, each of which is a data block type. If the total size of the data blocks corresponding to the 10 parameters is 31K, and the size of a single memory block in the near-end memory is 4K, you need to apply. The number of memory blocks is eight. This can improve memory usage efficiency and reduce memory waste.

Optionally, as shown in FIG. 8, before S110, the method 100 further includes:

S160: Perform abstract processing on the service including the ready task, and obtain abstract processing information, where the abstract processing information includes at least one of the following information: task dependency information, data dependency information, and sequence information of task execution.

Specifically, a business can be split into multiple tasks to abstract the business. In the process of abstracting the service including the ready task, a buffer may be created according to the needs of the service, and the data dependency information is determined according to the cached ID of the cache. Buffer is a data storage space that loads data before the task starts and is destroyed when the task does not need this buffer. Each buffer has an ID, and the data relationship between tasks is related by this ID. Assuming that the output data of task A is the input data of task B, the output buffer of task A is buffer2, and the input of task B is also buffer2. .

It should be understood that the creation of the buffer may be determined by the programmer according to the business needs, and the number of buffers actually created is dynamically determined according to the actual task execution process.

In S160, the task dependency information is used to indicate the dependencies between the tasks, and the dependencies between the tasks may be associated by the event. For example, the task A may choose to publish an event ID, and the task B needs to wait for the task A to complete. The event ID of event A is filled in the waiting event list of B, and the description of the waiting event includes the number of waiting events and the ID of the waiting event.

In the embodiment of the present invention, optionally, task input and output data features may be described, and a limited number of input and output parameters are supported. The parameters support different features: input buffer, output buffer, external input pointer, incoming value, global pointer, and the like.

In the embodiment of the present invention, since the service can be abstracted, different multi-core chip software does not need to be reconstructed, and the specification change software does not need to be redeployed, and only one resource set needs to be created. It simplifies the complexity of software programmers' design requirements based on different computational, load, and hardware constraints.

In the embodiment of the present invention, optionally, the output data of the ready task may be saved in the near-end memory when the ready task is executed by the target operation unit. Thus the next task When loading into the same arithmetic unit, it is not necessary to load again, and the congestion on the loading path is alleviated.

S160 will be described in detail below with reference to FIG. 9. The S160 may be executed by the master core execution layer in the architecture diagram shown in FIG. 1. The master core execution layer may be a software programming interface, and the software interface generates command submission to the schedule. Execution. As shown in FIG. 9, in the embodiment of the present invention, optionally, the process of abstracting the service including the ready task, and obtaining the abstract processing information may include the following steps:

S161, creating a task execution function, the execution function library is called from the kernel execution layer, and the main core execution layer registers the function pointer or index into the function list;

S162, creating a usage function resource set;

The set of functional resources includes resources for performing tasks, that is, which operating units are running tasks, and which direct memory access (Direct Memory Access, referred to as "DMA") is used.

S163, creating a parallel queue;

Queues have different priorities. Queues are in parallel relationship. Tasks are sent to the queue and serially executed in the queue. The first-in, first-out ordering principle is adopted.

S164, creating a cache buffer;

The buffer is a piece of data storage space. The data is loaded before the task starts, and is destroyed when the task does not need the buffer data. Each buffer has an ID. The ID is used to correlate the data relationship between tasks. That is, the output data of task A is the input data of task B, then the output buffer of task A is buffer2, and the input buffer of task B is also buffer2. .

S165, describing a dependency relationship between tasks;

The dependencies between the tasks can be associated by the event, that is, the task A can choose to publish an event ID, the task B needs to wait for the task A to complete, fill in the event ID of the event A in its waiting event list, and wait for the description of the event. Includes a list of the number of wait events and the ID of the wait event.

S166, describing task input and output data characteristics;

The multi-core digital signal processing system of the embodiment of the invention supports a limited number of input and output parameters, the parameters support different features, input buffer, output buffer, external input pointer, incoming value and global pointer.

S167, setting an algorithm service parameter, and filling the actual parameter value into a corresponding parameter table;

S168, submit an algorithm service execution request;

S169, waiting to perform a callback, or receiving an external task.

An illustration of a method of implementing a processing task in a specific case will be described in detail below with reference to FIG. Intentional flow chart. Figure 10 will be explained in conjunction with Figure 11. As shown in FIG. 10, a method 200 for processing a task according to an embodiment of the present invention includes:

S201, the scheduler is initialized;

Create the resources (queues, events, buffers, commands) that need to be used, put all the resources into the shared queue, and apply for use by the main core execution layer.

S202. The scheduler creates a response process of the function resource set.

The response processing for creating a set of functional resources includes initialization of the arithmetic unit, initialization of the storage manager in the arithmetic unit, and initialization of the shared memory.

S203. The scheduler waits for a command sent by the main core execution layer, and performs command processing.

S204. The scheduler polls the parallel queue according to the priority of high to low, and finds the ready task.

S205. If there is a ready task, the scheduler selects an idle operation unit in the operation unit set, and prepares data required for the task;

The idle arithmetic unit can be virtualized into two logical resources of ping-pong. When one logical resource allocates a task, another logical resource can also re-allocate the task, so that one logical resource is operated, and the other logical resource is also Prepare to reduce data waiting and improve computing resource utilization.

Before preparing the data, you need to apply for the near-end memory of the arithmetic unit, then set the DMA to move the data and move the data into the near-end memory.

The memory can be processed according to the level of memory. The memory allocation algorithm for the near-end memory uses a fixed resource pool application to reduce memory fragmentation and improve application release efficiency. The data for the near-end memory supports docking until the user releases, or the memory is not enough to replace the data with the remote memory (substitution according to the user-set permutation level). Because it is a fixed-size memory application, memory waste is reduced in order to improve memory usage efficiency.

In S205, if there is no ready task, the feedback of the already delivered task is processed, the event list is updated, the ready task table is updated, the corresponding memory is released, and then the process returns to S203.

In the embodiment of the present invention, optionally, a memory lock can be set to ensure consistency of data reading and writing, thereby automatically solving the consistency problem of simultaneous reading and writing of multiple cores (arithmetic units). Specifically, it may be set that the data in the file cannot be read by the task when the data of the file is rewritten; or the data of the file cannot be overwritten when the data of the file is read by the task. However, it is permissible for multiple tasks to simultaneously read the data in the file.

Determining the ready task and idle in steps S204 and S205 will be specifically described below with reference to FIG. The method of the arithmetic unit. The method of Figure 11 is performed by a scheduler in a digital signal processing system.

As shown in FIG. 11, in S301, the parallel queue is polled, and it is determined whether the lowest priority queue is polled;

S302. When it is determined that the queue with the lowest priority is not polled, obtain a ready queue.

S303. Determine whether there is a ready task in the ready queue and whether there is an idle arithmetic unit in the system;

In a queue, after the task is sent to the queue, it is executed serially, and is executed according to the first-in, first-out order.

S304. Acquire a ready task and an idle arithmetic unit when determining that there is a ready task and an idle arithmetic unit;

S305. Prepare data for the acquired ready task on the obtained idle computing unit.

After S305, continue to find the ready task in the current priority queue, and determine whether there is an idle arithmetic unit in the system, that is, perform S303 and its subsequent steps.

Optionally, in S303, if there is no ready task in the ready queue of the current priority, it is queried whether there is a ready queue in the queue of the next priority, that is, S301 and its subsequent steps are re-executed.

Optionally, in S301, if the lowest priority queue has been polled (that is, all queues have been polled), perform the following steps:

S306. Determine whether there is an idle computing resource.

S307. When there is an idle computing resource, find an idle computing resource.

An idle computing resource should be understood as a logical resource of an arithmetic unit.

S308. Obtain a partner resource of an idle computing resource.

The buddy resource of the idle computing resource refers to another logical resource of the arithmetic unit in S307.

S309, preparing data for the task on the found idle computing resource.

Specifically, when the task has been deployed on the partner resource acquired in S308, the task is prepared on the idle computing resource that is dependent on the task deployed on the partner resource. After the data is prepared for the task that has dependencies on the tasks deployed on the partner resource, S306 and its subsequent steps can be continued to prepare data for other tasks.

Therefore, the method for processing a task according to an embodiment of the present invention prepares a data for a task to be executed by the target operation unit while executing a ready task by the target operation unit, thereby loading the data Parallel to the operation, reducing the waiting overhead of data loading, increasing the degree of parallelism between tasks, and reducing system scheduling overhead.

An apparatus for processing a task in the multi-core digital signal processing system of the embodiment of the present invention will be described in detail below with reference to FIG. As shown in Figure 12, the apparatus 10 includes:

a determining module 11 for determining a ready task in the task queue;

The determining module 11 is further configured to determine a target computing unit that performs the ready task;

The task execution module 12 is configured to execute the ready task by the target operation unit, and simultaneously prepare data for the task to be executed by the target operation unit.

Therefore, the apparatus for processing a task in the multi-core digital signal processing system of the embodiment of the present invention prepares data for the task to be executed through the target arithmetic unit while executing the ready task by the target arithmetic unit, thereby enabling data loading and operation to be parallel. Reduce the waiting overhead of data loading, increase the degree of parallelism between tasks, and reduce system scheduling overhead.

In the embodiment of the present invention, optionally, in determining a target operation unit that executes the ready task, the determining module 11 is specifically configured to: when determining that the operation unit of the dependent task executing the ready task is idle, The task-dependent arithmetic unit of the ready task is determined to be the target arithmetic unit.

In the embodiment of the present invention, optionally, as shown in FIG. 13, the device further includes a memory application module 13;

The memory application module 13 is specifically configured to: determine, in the near-end memory of the target operation unit, the input data corresponding to the ready task, before the task execution module 12 executes the ready task by the target operation unit. Memory block; move the input data corresponding to the ready task to the memory block.

In the embodiment of the present invention, optionally, in determining a memory block for storing input data corresponding to the ready task in the near-end memory of the target operation unit, the memory application module 13 is specifically configured to: The resource pool algorithm determines the memory block, wherein the data stored in the near-end memory of the target unit supports docking until the user releases or replaces the data to the far-end memory when the near-end memory is insufficient.

In the embodiment of the present invention, optionally, in determining the memory block according to the fixed resource pool algorithm, the memory application module 13 is specifically configured to: sum the data blocks corresponding to all parameters required by the ready task and the near The ratio of the size of a single block of memory in the end memory, determining the number of blocks of memory in the near-end memory that need to be applied.

In the embodiment of the present invention, optionally, as shown in FIG. 14, the device further includes:

The service abstraction module 14 is configured to perform abstract processing on the service including the ready task before the determining module 10 determines the ready task in the task queue, to obtain abstract processing information, where the abstract processing information includes at least one of the following information: Task dependency information, data dependency information, and sequence information for task execution.

In the embodiment of the present invention, optionally, the task queue is a plurality of parallel task queues;

Wherein, in determining the ready task in the task queue, the determining module 11 is specifically configured to:

The plurality of parallel task queues are polled in order of priority to determine the ready task.

In the embodiment of the present invention, optionally, in the abstract processing of the service including the ready task, the service abstraction module 14 is specifically configured to: create a cache according to the requirement of the service; determine according to the cached cache ID of the cache. The data depends on the relationship information.

In the embodiment of the present invention, the task execution module 12 is further configured to: when the ready task is executed by the target operation unit, save the output data of the ready task in the near-end memory.

It should be understood that the above and other operations and/or functions of the apparatus 10 according to the embodiments of the present invention are respectively omitted in order to implement the respective methods in FIG. 6 to FIG. 8 for brevity.

Figure 15 shows a schematic block diagram of an apparatus 100 for processing tasks in a multi-core digital signal processing system in accordance with another embodiment of the present invention.

As shown in FIG. 15, the hardware structure of the apparatus 100 for processing tasks in the multi-core digital signal processing system may include three parts: a transceiver device 101, a software device 102, and a hardware device 103.

The transceiver device 101 is a hardware circuit for completing packet transmission and reception;

The hardware device 103 can also be a "hardware processing module" or simpler, and can also be simply referred to as "hardware". The hardware device 103 mainly includes a hardware circuit based on an FPGA, an ASIC (and other supporting devices, such as a memory). Hardware circuits that implement certain functions are often processed much faster than general-purpose processors, but once they are customized, they are difficult to change. Therefore, they are not flexible to implement. They are usually used to handle some fixed functions. It is noted that the hardware device 103 may also include an MCU (microprocessor such as a single chip microcomputer) or a CPU in practical applications. Processors, but the main function of these processors is not to complete the processing of big data, but mainly for some control. In this application scenario, the system with these devices is a hardware device.

The software device 102 (or simply "software") mainly includes a general-purpose processor (such as a CPU) and some supporting devices (such as a memory device such as a memory or a hard disk), and can be programmed to have a corresponding processing function. When implemented in software, it can be flexibly configured according to business needs, but it is often slower than hardware devices. After the software is processed, the processed data may be transmitted through the transceiver device 101 through the hardware device 103, or the processed data may be transmitted to the transceiver device 101 through an interface connected to the transceiver device 101.

Optionally, as an embodiment, the hardware device 103 is configured to: determine a ready task in the task queue; determine a target operation unit that executes the ready task; execute the ready task by the target operation unit, and simultaneously pass the target operation The unit prepares data for the task to be executed.

Optionally, as an embodiment, in determining a target operation unit that performs the ready task, the hardware device 103 is specifically configured to: when determining that the task-dependent operation unit that executes the ready task is idle, perform the ready task The task-dependent arithmetic unit is determined to be the target arithmetic unit.

Optionally, as an embodiment, before performing the ready task by the target computing unit, the hardware device 103 is specifically configured to: determine, in a near-end memory of the target computing unit, to store an input corresponding to the ready task. A memory block of data; the input data corresponding to the ready task is moved to the memory block.

Optionally, as an embodiment, in determining a memory block in the near-end memory of the target operation unit for storing input data corresponding to the ready task, the hardware device 103 is specifically configured to: according to a fixed resource pool algorithm Determining the memory block, wherein the data stored in the near-end memory of the target arithmetic unit supports camping until the user releases or replaces the data to the remote memory when the near-end memory is insufficient.

Optionally, as an embodiment, in determining the memory block according to the fixed resource pool algorithm, the hardware device 103 is specifically configured to: a sum of a database corresponding to all parameters required by the ready task and a memory in the near end memory The ratio of the size of a single memory block to determine the number of memory blocks.

Optionally, as an embodiment, the hardware device 103 is further configured to: after determining a ready task in the task queue, perform abstract processing on the service including the ready task to obtain abstract processing information, where the abstract processing information includes the following information. At least one of: task dependency information, data dependency information, and sequence information of task execution.

Optionally, as an embodiment, the task queue is a plurality of parallel task queues; In the abstract processing of the service including the ready task, the hardware device 103 is specifically configured to: create a cache according to the needs of the service; and determine the data dependency information according to the cached cache ID.

Optionally, as an embodiment, the hardware device 103 is further configured to: when the ready task is executed by the target operation unit, save the output data of the ready task in the near-end memory.

Through the method of soft and hard combining of the embodiment, it is possible to prepare data for the task to be executed through the target operation unit while executing the ready task by the target arithmetic unit, thereby parallelizing the data loading and the operation, and reducing the waiting overhead of the data loading. Improve the degree of parallelism between tasks and reduce system scheduling overhead.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods for implementing the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present invention.

A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the system, the device and the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

The functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including The instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

The above is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. It should be covered by the scope of the present invention. Therefore, the scope of the invention should be determined by the scope of the claims.

Claims

A method for processing a task in a multi-core digital signal processing system, the method comprising:

Determine the ready task in the task queue;

Determining a target arithmetic unit that performs the ready task;

The ready task is executed by the target operation unit, and at the same time, data is prepared for the task to be executed by the target operation unit.
The method of claim 1, wherein the determining the target computing unit that performs the ready task comprises:

When it is determined that the operation unit of the dependent task executing the ready task is idle, the operation unit that executes the dependent task of the ready task is determined as the target operation unit.
The method according to claim 1 or 2, wherein, before the performing the task is performed by the target computing unit, the method further comprises:

Determining, in a near-end memory of the target arithmetic unit, a memory block for storing input data corresponding to the ready task;

The input data corresponding to the ready task is moved into the memory block.
The method according to claim 3, wherein the determining a memory block in the near-end memory of the target operation unit for storing input data corresponding to the ready task comprises:

The memory block is determined according to a fixed resource pool algorithm, wherein data stored in the near-end memory of the target arithmetic unit supports camping until the user releases or replaces the data to the remote memory when the near-end memory is insufficient.
The method according to claim 4, wherein the determining the memory block according to a fixed resource pool algorithm comprises:

The number of the memory blocks is determined according to a ratio of a sum of data blocks corresponding to all parameters required by the ready task to a size of a single memory block in the near-end memory.
The method according to any one of claims 1 to 5, wherein before determining the ready task in the task queue, the method further comprises:

The abstract processing is performed on the service including the ready task, and the abstract processing information includes at least one of the following information: task dependency information, data dependency information, and sequence information of task execution.
The method according to any one of claims 1 to 6, wherein the task queue is a plurality of parallel task queues;

The determining the ready task in the task queue includes:

The plurality of parallel task queues are polled in order of priority to determine the ready task.
The method according to claim 6, wherein the abstract processing of the service including the ready task comprises:

Create a cache according to the needs of the business;

Determining the data dependency information according to the cached cache identifier ID.
The method according to any one of claims 1 to 8, wherein the method further comprises:

The output data of the ready task is saved in the near-end memory when the ready task is executed by the target operation unit.
A device for processing a task in a multi-core digital signal processing system, the device comprising:

Determining a module for determining a ready task in the task queue;

The determining module is further configured to determine a target computing unit that executes the ready task;

a task execution module, configured to execute the ready task by the target operation unit, and simultaneously prepare data for the task to be executed by the target operation unit.
The apparatus according to claim 10, wherein in determining a target computing unit that performs the ready task, the determining module is specifically configured to:

When it is determined that the operation unit of the dependent task executing the ready task is idle, the operation unit that executes the dependent task of the ready task is determined as the target operation unit.
The device according to claim 10 or 11, wherein the device further comprises a memory application module;

The memory application module is specifically configured to: before the task execution module executes the ready task by using the target operation unit:

Determining, in a near-end memory of the target arithmetic unit, a memory block for storing input data corresponding to the ready task;

The input data corresponding to the ready task is moved into the memory block.
The apparatus according to claim 12, wherein a memory block for storing input data corresponding to said ready task in a near-end memory of said target arithmetic unit is determined The memory application module is specifically configured to:

The memory block is determined according to a fixed resource pool algorithm, wherein data stored in the near-end memory of the target arithmetic unit supports camping until the user releases or replaces the data to the remote memory when the near-end memory is insufficient.
The device according to claim 13, wherein in the determining the memory block according to a fixed resource pool algorithm, the memory application module is specifically configured to:

The number of the memory blocks is determined according to a ratio of a sum of data blocks corresponding to all parameters required by the ready task to a size of a single memory block in the near-end memory.
The device according to any one of claims 10 to 14, wherein the device further comprises:

a service abstraction module, configured to perform abstract processing on the service including the ready task before the determining module determines the ready task in the task queue, to obtain abstract processing information, where the abstract processing information includes at least one of the following information : Task dependency information, data dependency information, and sequence of task execution.
The apparatus according to any one of claims 10 to 15, wherein the task queue is a plurality of parallel task queues;

Wherein, in determining the ready task in the task queue, the determining module is specifically configured to:

The plurality of parallel task queues are polled in order of priority to determine the ready task.
The apparatus according to claim 15, wherein in the abstract processing of the service including the ready task, the service abstraction module is specifically configured to:

Create a cache according to the needs of the business;

Determining the data dependency information according to the cached cache identifier ID.
The device according to any one of claims 10 to 17, wherein the task execution module is further configured to:

The output data of the ready task is saved in the near-end memory when the ready task is executed by the target operation unit.