WO2017070900A1 - Method and apparatus for processing task in a multi-core digital signal processing system - Google Patents

Method and apparatus for processing task in a multi-core digital signal processing system Download PDF

Info

Publication number
WO2017070900A1
WO2017070900A1 PCT/CN2015/093248 CN2015093248W WO2017070900A1 WO 2017070900 A1 WO2017070900 A1 WO 2017070900A1 CN 2015093248 W CN2015093248 W CN 2015093248W WO 2017070900 A1 WO2017070900 A1 WO 2017070900A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
ready
memory
data
determining
Prior art date
Application number
PCT/CN2015/093248
Other languages
French (fr)
Chinese (zh)
Inventor
范冰
周卫荣
李海龙
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2015/093248 priority Critical patent/WO2017070900A1/en
Priority to CN201580083942.3A priority patent/CN108351783A/en
Publication of WO2017070900A1 publication Critical patent/WO2017070900A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]

Definitions

  • Embodiments of the present invention relate to the field of digital signal processors and, more particularly, to methods and apparatus for processing tasks in a multi-core digital signal processing system.
  • Static Task Scheduling Method in Related Art the software designer obtains the basic performance of each functional module according to the software task icon and the performance simulation of each functional algorithm module, and performs matching according to the capability of mapping hardware resources according to the target. According to the functional granularity, resource consumption deploys different software functions to different hardware resources, but the static task scheduling method has limited application scenarios, high scheduling complexity, and low memory resource utilization.
  • the dynamic task scheduling scheme in the related art adopts a resource pool scheme of master-slave distributed scheduling, and each processor carries a tailored operating system (OS), which can support different priorities.
  • the task can respond to external interrupts, etc., and the main core divides the task into the appropriate granularity into the task cache pool.
  • the core is idle, the task is automatically acquired from the main core and executed.
  • each slave core needs to carry an operating system, task switching, data loading will occupy a lot of slave load, and computing resources and memory resource utilization are low.
  • Embodiments of the present invention provide a method and apparatus for processing a task in a multi-core digital signal processing system, which can determine a running scheduling process when a task runs, dynamically allocate computing resources, improve utilization of computing resources, and reduce system scheduling overhead.
  • a method for processing a task in a multi-core digital signal processing system includes: determining a ready task in a task queue; determining a target computing unit that executes the ready task; performing the ready task by the target computing unit, and At the same time, the target computing unit is used as the task to be executed. Prepare data.
  • the method for processing a task in a multi-core digital signal processing system when a task is executed by one operation unit, data is prepared for other tasks through the operation unit at the same time, thereby enabling data loading and algorithm service execution to be parallel, thereby reducing The waiting cost of data loading increases the degree of parallelism between tasks and reduces system scheduling overhead.
  • the operation unit that executes the ready task is determined as the target operation unit.
  • the operation unit that executes the ready task and the task dependent task of the ready task is the same operation unit, and when the ready task is executed, there is no need to load data again, and the congestion condition on the loading path is alleviated.
  • the method before the performing the task is performed by the target computing unit, the method further includes: determining, in the near-end memory of the target computing unit, a memory block storing input data corresponding to the ready task; moving the input data corresponding to the ready task to the memory block.
  • the determining, by the operating unit, the memory block of the near-end memory of the target computing unit for storing the input data corresponding to the ready task includes: The memory block is determined according to a fixed resource pool algorithm, wherein data stored in the near-end memory of the target unit supports docking until the user releases or replaces the data to the far-end memory when the near-end memory is insufficient.
  • the method further includes: when the ready task is executed by the target computing unit, saving the output data of the ready task in the Near-end memory.
  • the read and write memory when performing the task is the near-end memory, and the task does not wait for the data to reach the consumption delay when executing the task, and the fixed resource pool algorithm can reduce the memory fragmentation when applying for the memory, improve the memory turnover efficiency, and save the memory. .
  • the determining, by the fixed resource pool algorithm, the memory block includes: a data block corresponding to all parameters required by the ready task
  • the ratio of the sum of the sum to the size of a single block of memory in the near-end memory determines the number of blocks.
  • the memory can be further improved. Use efficiency to reduce memory waste.
  • the method before determining a ready task in the task queue, the method further includes: performing abstract processing on the service including the ready task, Obtaining abstract processing information, the abstract processing information including at least one of the following information: task dependency information, data dependency information, and sequence information of task execution.
  • the task queue is a plurality of parallel task queues, wherein the determining the ready tasks in the task queue includes: prioritizing The level sequence polls the plurality of parallel task queues to determine the ready task.
  • the abstract processing of the service including the ready task includes: creating a cache according to the requirement of the service; The cache ID is determined to determine the data dependency information.
  • apparatus for processing a task in a multi-core digital signal processing system for performing the method of any of the first aspect or the first aspect of the first aspect, in particular A module of the method of the first aspect or any of the possible implementations of the first aspect.
  • a computer readable medium for storing a computer program comprising instructions for performing the method of the first aspect or any of the possible implementations of the first aspect.
  • a computer program product comprising: computer program code, wherein the computer program code is executed by a device of a processing task of a multi-core digital signal processing system, such that the device performs the first aspect or A method in any of the possible implementations of the first aspect.
  • FIG. 1 is a schematic structural diagram of an application system to which an embodiment of the present invention is applied;
  • FIG. 2 is a diagram of various management modules included in the scheduler in the application system shown in FIG. Schematic diagram of the system;
  • FIG. 3 is a schematic diagram of data dependency relationships in an application system to which an embodiment of the present invention is applied;
  • 4 is a schematic diagram of scheduling results when only one core is scheduled in an application system to which an embodiment of the present invention is applied;
  • FIG. 5 is a schematic diagram of scheduling results when three cores are scheduled in an application system to which an embodiment of the present invention is applied;
  • FIG. 6 is a schematic flowchart of a method for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention
  • FIG. 7 is another schematic flowchart of a method for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention.
  • FIG. 8 is still another schematic flowchart of a method for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention.
  • FIG. 9 is a schematic flowchart of a method for abstracting a service according to an embodiment of the present invention.
  • FIG. 10 is a schematic flowchart of a method for implementing a processing task in a specific case according to an embodiment of the present invention.
  • FIG. 11 is a schematic flowchart of a method of determining a ready task and an idle operation unit according to an embodiment of the present invention
  • FIG. 12 is a schematic block diagram of an apparatus for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention
  • FIG. 13 is another schematic block diagram of an apparatus for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention
  • FIG. 14 is still another schematic block diagram of an apparatus for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention.
  • 15 is a schematic block diagram of an apparatus for processing a task in a multi-core digital signal processing system in accordance with another embodiment of the present invention.
  • the technical solution of the embodiments of the present invention is mainly applied to a digital signal processing system that requires multi-core processing, and has a large number of parallel computing scenarios, such as a macro station baseband chip and a terminal chip.
  • the multi-core feature is embodied by integrating the number of computing modules on a single chip, including but not limited to multiple general-purpose processors, multiple IP cores, multiple dedicated processors, and the like. If the number of calculation modules is greater than one, it is multi-core.
  • FIG. 1 is a schematic structural diagram of an application system (multi-core digital signal processing system) to which an embodiment of the present invention is applied.
  • the application system is composed of three parts: a main control layer, an execution layer, and an operation layer.
  • the main control layer carries user software, and completes high-level information interaction, process control, task decomposition, and task dependency definition.
  • the execution layer consists of three parts, the master core execution layer, the scheduler, and the slave core execution layer.
  • the main control core execution layer provides a software programming interface, submits commands to the scheduler and receives command feedback or callback notifications;
  • the scheduler is a hardware part, which is responsible for processing task scheduling, and specific functions include: dependency processing between tasks, memory management, Task assignment, data movement and other functions.
  • the scheduler is internally managed by multiple management modules: command management, command queue management, event management, buffer descriptor management, shared memory management, computation memory management, computing resource state management, and scheduling master.
  • Control module from the core execution layer to the software part, mainly responsible for receiving task messages, calling the algorithm function library to execute the cloud, and sending the task end feedback message after the operation ends.
  • the computing layer can be hardware or software, and is mainly responsible for processing tasks.
  • Kernels 0 to 2 are high priority processing
  • Kernels 3 to 5 are medium priority processing
  • Kernels 7 to 9 are low priority processing.
  • FIG. 3 in order to indicate the processed data flow (the direction pointed by the arrow indicates the direction in which the data flows), the data flow between the host and the device is indicated as a buffer input/output according to the direction (Buff_In). /Out).
  • the data flow between Kernel and Kernel processing is marked as cache_M (Buff_M).
  • Data dependencies between different cores can be described as:
  • Kernel_0/3/7 The input of Kernel_0/3/7 is Buff_In0/1/2, which is prepared by Host (the actual application may be the output from external interface or Hardware Accelerate Control (HAC)).
  • Host the actual application may be the output from external interface or Hardware Accelerate Control (HAC)).
  • HAC Hardware Accelerate Control
  • the data is output to the Host (in actual applications, the Host usually needs to send data processed by Digital Signal Processing ("DSP”) to the off-chip, or continue processing for Hac).
  • DSP Digital Signal Processing
  • Kernel_2 depends on the output of Kernel_1 and also depends on the output of Kernel_5.
  • Kernel_4 depends on the output of Kernel_9 and also depends on the output of Kernel_3.
  • Kernel_8 depends on the output of Kernel_7 and also depends on the output of Kernel_3.
  • Kernel_3 The output processed by Kernel_3 is used by Kernel_4 in addition to Kernel_4 (Kenrel_8 may only use a part of it).
  • the scheduler schedules according to the priority according to the number of actual execution cores that are scheduled, while ensuring that the data dependencies are correct.
  • the Host will all submit the Kernel to the Command Queue (CommandQueue), and the input data Buff_In0/1/2 is all ready, then the scheduler can only be scheduled if only one core can be scheduled.
  • the scheduling result is shown in Figure 4.
  • the scheduling result is shown in Figure 5.
  • Figure 5 whether the dotted line Kernel_2 and Kernel_4 are scheduled on DSP2 depends on the amount of data input. .
  • Kernel_4 should be dispatched to DSP2, otherwise it should be dispatched to the core where the output of Kernel_3 is located.
  • a service refers to a program that processes data independently of hardware, and is a concept different from an operating system and a driver.
  • the service may be, for example, data channel estimation, Fast Fourier Transformation (Fast Fourier Transformation). , referred to as "FFT" for short, and other operations such as decoding.
  • FFT Fast Fourier Transformation
  • a task is a software task that is a piece of program that implements a function and usually needs to run on a core processor.
  • FIG. 6 is a schematic flow diagram of a method 100 of processing a task in accordance with an embodiment of the present invention.
  • the method 100 can be performed by the multi-core digital signal processing system shown in FIG. 1, as shown in FIG. 6, the method 100 includes:
  • the multi-core digital signal processing system determines a target arithmetic unit capable of executing the ready task, and then executes the determined target computing unit
  • the ready task is performed, and at the same time, data is prepared for the task to be executed through the target arithmetic unit.
  • the method for processing a task prepares data for a task to be executed by the target operation unit while executing a ready task by the target operation unit, thereby parallelizing data loading and operation, thereby reducing waiting cost of data loading. Improve parallelism between tasks and reduce system scheduling overhead.
  • the ready task refers to the task that has been completed and can start the running task.
  • the task to be executed can be understood as the task that needs to be executed after the ready task.
  • An arithmetic unit can be understood as a core.
  • the task queue is a plurality of parallel task queues
  • S110 is specifically: polling the plurality of parallel task queues in a priority order to determine the ready task.
  • the multi-core digital signal processing system can create a parallel task queue. These parallel tasks are in parallel relationship, but these parallel task queues have different priorities. After the tasks are sent into the task queue, each task queue is in the queue. The task is serially executed, using the first-in, first-out order principle.
  • the polling may be performed in descending order of priority of multiple parallel task queues, and if there is no ready task in the high priority task queue, polling the next priority is continued.
  • the task queue ends polling until a ready queue is found or the lowest priority task queue has been polled.
  • the ready task is executed by the target operation unit, and at the same time, the data is prepared for the task to be executed by the target unit, which can be understood as virtualizing one operation unit into two logical resources of ping-pong, when a logical resource is allocated.
  • another logical resource can also be reassigned the task, which can ensure that one logical resource is in operation, and the data of another logical resource is also prepared, thereby reducing data waiting and improving computing resource utilization.
  • S120 is specifically: when determining that the operation unit of the dependent task that executes the ready task is idle, determining the operation unit of the dependent task that executes the ready task as the target operation unit.
  • the dependent task of the ready task refers to a task of outputting data as input data of the ready task
  • the multi-core digital signal processing system may select the idle resource according to the data location to perform the ready task, preferably, multi-core digital signal processing.
  • the system records an operation unit that processes the dependent task of the ready task, and the multi-core digital signal processing system determines the arithmetic unit that executes the dependent task
  • the ready task is allocated to the arithmetic unit that executes the dependent task, that is, the arithmetic unit that executes the dependent task is determined as the target arithmetic unit. Since the arithmetic unit that processes the ready task is the same arithmetic unit as the arithmetic unit that processes the ready task, it is not necessary to load data for the ready task again, and the congestion on the loading path can be alleviated.
  • an idle arithmetic unit may be randomly selected from the other idle arithmetic units as the target operation unit.
  • the method 100 before performing the ready task by the target computing unit, the method 100 further includes:
  • S140 Determine a memory block in the near-end memory of the target operation unit for storing input data corresponding to the ready task.
  • the data required to perform the ready task can be executed on other operating units or in external memory (for example, Double Rate Rate (DDR), L3 cache, etc.).
  • DDR Double Rate Rate
  • L3 cache L3 cache
  • the memory block is determined according to a fixed resource pool algorithm, wherein data stored in a near-end memory of the target operation unit supports camping until the user releases or replaces the data to the near-end memory.
  • Remote memory
  • the memory space can be ranked according to the distance from the memory, and then processed according to the memory level.
  • the memory allocation algorithm for the near-end memory uses a fixed resource pool algorithm, thereby reducing memory fragmentation and increasing the application. Release efficiency.
  • the near-end memory of the target computing unit may be applied according to other algorithms.
  • the near-end may be applied according to a linked list memory allocation algorithm, a partner algorithm, a memory pool-based buddy algorithm, a working set algorithm, and the like. Memory, but the invention is not limited thereto.
  • the S140 is specifically: determining, according to a ratio of a total of the data blocks corresponding to all parameters required by the ready task to a size of a single memory block in the near-end memory, determining a memory block in the near-end memory that needs to be applied for. Quantity.
  • the task is equivalent to a function with parameters, and each parameter may be a piece.
  • Data, or numerical values can be assembled into the same memory block or chunks after being assembled. For example, suppose task A has 10 parameters, each of which is a data block type. If the total size of the data blocks corresponding to the 10 parameters is 31K, and the size of a single memory block in the near-end memory is 4K, you need to apply. The number of memory blocks is eight. This can improve memory usage efficiency and reduce memory waste.
  • the method 100 further includes:
  • S160 Perform abstract processing on the service including the ready task, and obtain abstract processing information, where the abstract processing information includes at least one of the following information: task dependency information, data dependency information, and sequence information of task execution.
  • a business can be split into multiple tasks to abstract the business.
  • a buffer may be created according to the needs of the service, and the data dependency information is determined according to the cached ID of the cache.
  • Buffer is a data storage space that loads data before the task starts and is destroyed when the task does not need this buffer.
  • Each buffer has an ID, and the data relationship between tasks is related by this ID. Assuming that the output data of task A is the input data of task B, the output buffer of task A is buffer2, and the input of task B is also buffer2. .
  • the creation of the buffer may be determined by the programmer according to the business needs, and the number of buffers actually created is dynamically determined according to the actual task execution process.
  • the task dependency information is used to indicate the dependencies between the tasks, and the dependencies between the tasks may be associated by the event.
  • the task A may choose to publish an event ID, and the task B needs to wait for the task A to complete.
  • the event ID of event A is filled in the waiting event list of B, and the description of the waiting event includes the number of waiting events and the ID of the waiting event.
  • task input and output data features may be described, and a limited number of input and output parameters are supported.
  • the parameters support different features: input buffer, output buffer, external input pointer, incoming value, global pointer, and the like.
  • the output data of the ready task may be saved in the near-end memory when the ready task is executed by the target operation unit.
  • the next task When loading into the same arithmetic unit, it is not necessary to load again, and the congestion on the loading path is alleviated.
  • the S160 may be executed by the master core execution layer in the architecture diagram shown in FIG. 1.
  • the master core execution layer may be a software programming interface, and the software interface generates command submission to the schedule. Execution.
  • the process of abstracting the service including the ready task, and obtaining the abstract processing information may include the following steps:
  • the execution function library is called from the kernel execution layer, and the main core execution layer registers the function pointer or index into the function list;
  • the set of functional resources includes resources for performing tasks, that is, which operating units are running tasks, and which direct memory access (Direct Memory Access, referred to as "DMA") is used.
  • DMA Direct Memory Access
  • Queues have different priorities. Queues are in parallel relationship. Tasks are sent to the queue and serially executed in the queue. The first-in, first-out ordering principle is adopted.
  • the buffer is a piece of data storage space. The data is loaded before the task starts, and is destroyed when the task does not need the buffer data.
  • Each buffer has an ID. The ID is used to correlate the data relationship between tasks. That is, the output data of task A is the input data of task B, then the output buffer of task A is buffer2, and the input buffer of task B is also buffer2. .
  • the dependencies between the tasks can be associated by the event, that is, the task A can choose to publish an event ID, the task B needs to wait for the task A to complete, fill in the event ID of the event A in its waiting event list, and wait for the description of the event. Includes a list of the number of wait events and the ID of the wait event.
  • the multi-core digital signal processing system of the embodiment of the invention supports a limited number of input and output parameters, the parameters support different features, input buffer, output buffer, external input pointer, incoming value and global pointer.
  • a method 200 for processing a task includes:
  • the scheduler creates a response process of the function resource set.
  • the response processing for creating a set of functional resources includes initialization of the arithmetic unit, initialization of the storage manager in the arithmetic unit, and initialization of the shared memory.
  • the scheduler waits for a command sent by the main core execution layer, and performs command processing.
  • the scheduler polls the parallel queue according to the priority of high to low, and finds the ready task.
  • the scheduler selects an idle operation unit in the operation unit set, and prepares data required for the task;
  • the idle arithmetic unit can be virtualized into two logical resources of ping-pong.
  • Another logical resource can also re-allocate the task, so that one logical resource is operated, and the other logical resource is also Prepare to reduce data waiting and improve computing resource utilization.
  • the memory can be processed according to the level of memory.
  • the memory allocation algorithm for the near-end memory uses a fixed resource pool application to reduce memory fragmentation and improve application release efficiency.
  • the data for the near-end memory supports docking until the user releases, or the memory is not enough to replace the data with the remote memory (substitution according to the user-set permutation level). Because it is a fixed-size memory application, memory waste is reduced in order to improve memory usage efficiency.
  • a memory lock can be set to ensure consistency of data reading and writing, thereby automatically solving the consistency problem of simultaneous reading and writing of multiple cores (arithmetic units). Specifically, it may be set that the data in the file cannot be read by the task when the data of the file is rewritten; or the data of the file cannot be overwritten when the data of the file is read by the task. However, it is permissible for multiple tasks to simultaneously read the data in the file.
  • Determining the ready task and idle in steps S204 and S205 will be specifically described below with reference to FIG.
  • the method of Figure 11 is performed by a scheduler in a digital signal processing system.
  • the parallel queue is polled, and it is determined whether the lowest priority queue is polled;
  • a queue After the task is sent to the queue, it is executed serially, and is executed according to the first-in, first-out order.
  • S303 if there is no ready task in the ready queue of the current priority, it is queried whether there is a ready queue in the queue of the next priority, that is, S301 and its subsequent steps are re-executed.
  • An idle computing resource should be understood as a logical resource of an arithmetic unit.
  • the buddy resource of the idle computing resource refers to another logical resource of the arithmetic unit in S307.
  • the task when the task has been deployed on the partner resource acquired in S308, the task is prepared on the idle computing resource that is dependent on the task deployed on the partner resource. After the data is prepared for the task that has dependencies on the tasks deployed on the partner resource, S306 and its subsequent steps can be continued to prepare data for other tasks.
  • the method for processing a task prepares a data for a task to be executed by the target operation unit while executing a ready task by the target operation unit, thereby loading the data Parallel to the operation, reducing the waiting overhead of data loading, increasing the degree of parallelism between tasks, and reducing system scheduling overhead.
  • the apparatus 10 includes:
  • a determining module 11 for determining a ready task in the task queue
  • the determining module 11 is further configured to determine a target computing unit that performs the ready task
  • the task execution module 12 is configured to execute the ready task by the target operation unit, and simultaneously prepare data for the task to be executed by the target operation unit.
  • the apparatus for processing a task in the multi-core digital signal processing system of the embodiment of the present invention prepares data for the task to be executed through the target arithmetic unit while executing the ready task by the target arithmetic unit, thereby enabling data loading and operation to be parallel. Reduce the waiting overhead of data loading, increase the degree of parallelism between tasks, and reduce system scheduling overhead.
  • the determining module 11 is specifically configured to: when determining that the operation unit of the dependent task executing the ready task is idle, The task-dependent arithmetic unit of the ready task is determined to be the target arithmetic unit.
  • the device further includes a memory application module 13;
  • the memory application module 13 is specifically configured to: determine, in the near-end memory of the target operation unit, the input data corresponding to the ready task, before the task execution module 12 executes the ready task by the target operation unit. Memory block; move the input data corresponding to the ready task to the memory block.
  • the memory application module 13 is specifically configured to: The resource pool algorithm determines the memory block, wherein the data stored in the near-end memory of the target unit supports docking until the user releases or replaces the data to the far-end memory when the near-end memory is insufficient.
  • the memory application module 13 is specifically configured to: sum the data blocks corresponding to all parameters required by the ready task and the near The ratio of the size of a single block of memory in the end memory, determining the number of blocks of memory in the near-end memory that need to be applied.
  • the device further includes:
  • the service abstraction module 14 is configured to perform abstract processing on the service including the ready task before the determining module 10 determines the ready task in the task queue, to obtain abstract processing information, where the abstract processing information includes at least one of the following information: Task dependency information, data dependency information, and sequence information for task execution.
  • the task queue is a plurality of parallel task queues
  • the determining module 11 is specifically configured to:
  • the plurality of parallel task queues are polled in order of priority to determine the ready task.
  • the service abstraction module 14 is specifically configured to: create a cache according to the requirement of the service; determine according to the cached cache ID of the cache. The data depends on the relationship information.
  • the task execution module 12 is further configured to: when the ready task is executed by the target operation unit, save the output data of the ready task in the near-end memory.
  • the apparatus for processing a task in the multi-core digital signal processing system of the embodiment of the present invention prepares data for the task to be executed through the target arithmetic unit while executing the ready task by the target arithmetic unit, thereby enabling data loading and operation to be parallel. Reduce the waiting overhead of data loading, increase the degree of parallelism between tasks, and reduce system scheduling overhead.
  • Figure 15 shows a schematic block diagram of an apparatus 100 for processing tasks in a multi-core digital signal processing system in accordance with another embodiment of the present invention.
  • the hardware structure of the apparatus 100 for processing tasks in the multi-core digital signal processing system may include three parts: a transceiver device 101, a software device 102, and a hardware device 103.
  • the transceiver device 101 is a hardware circuit for completing packet transmission and reception;
  • the hardware device 103 can also be a "hardware processing module” or simpler, and can also be simply referred to as "hardware”.
  • the hardware device 103 mainly includes a hardware circuit based on an FPGA, an ASIC (and other supporting devices, such as a memory). Hardware circuits that implement certain functions are often processed much faster than general-purpose processors, but once they are customized, they are difficult to change. Therefore, they are not flexible to implement. They are usually used to handle some fixed functions. It is noted that the hardware device 103 may also include an MCU (microprocessor such as a single chip microcomputer) or a CPU in practical applications. Processors, but the main function of these processors is not to complete the processing of big data, but mainly for some control. In this application scenario, the system with these devices is a hardware device.
  • MCU microprocessor such as a single chip microcomputer
  • the software device 102 (or simply "software") mainly includes a general-purpose processor (such as a CPU) and some supporting devices (such as a memory device such as a memory or a hard disk), and can be programmed to have a corresponding processing function. When implemented in software, it can be flexibly configured according to business needs, but it is often slower than hardware devices.
  • the processed data may be transmitted through the transceiver device 101 through the hardware device 103, or the processed data may be transmitted to the transceiver device 101 through an interface connected to the transceiver device 101.
  • the hardware device 103 is configured to: determine a ready task in the task queue; determine a target operation unit that executes the ready task; execute the ready task by the target operation unit, and simultaneously pass the target operation The unit prepares data for the task to be executed.
  • the hardware device 103 in determining a target operation unit that performs the ready task, is specifically configured to: when determining that the task-dependent operation unit that executes the ready task is idle, perform the ready task
  • the task-dependent arithmetic unit is determined to be the target arithmetic unit.
  • the hardware device 103 before performing the ready task by the target computing unit, is specifically configured to: determine, in a near-end memory of the target computing unit, to store an input corresponding to the ready task. A memory block of data; the input data corresponding to the ready task is moved to the memory block.
  • the hardware device 103 in determining a memory block in the near-end memory of the target operation unit for storing input data corresponding to the ready task, is specifically configured to: according to a fixed resource pool algorithm Determining the memory block, wherein the data stored in the near-end memory of the target arithmetic unit supports camping until the user releases or replaces the data to the remote memory when the near-end memory is insufficient.
  • the hardware device 103 in determining the memory block according to the fixed resource pool algorithm, is specifically configured to: a sum of a database corresponding to all parameters required by the ready task and a memory in the near end memory The ratio of the size of a single memory block to determine the number of memory blocks.
  • the hardware device 103 is further configured to: after determining a ready task in the task queue, perform abstract processing on the service including the ready task to obtain abstract processing information, where the abstract processing information includes the following information. At least one of: task dependency information, data dependency information, and sequence information of task execution.
  • the task queue is a plurality of parallel task queues;
  • the hardware device 103 is specifically configured to: create a cache according to the needs of the service; and determine the data dependency information according to the cached cache ID.
  • the hardware device 103 is further configured to: when the ready task is executed by the target operation unit, save the output data of the ready task in the near-end memory.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present invention which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A method and apparatus for processing a task in a multi-core digital signal processing system. In the process of processing a task, the waiting overhead for data loading is reduced, a degree of parallelism between tasks is improved, and the scheduling overhead of a system is reduced. The method comprises: determining a ready task in a task queue (S110); determining a target operation unit for executing the ready task (S120); and executing the ready task by means of the target operation unit, and at the same time, preparing, by means of the target operation unit, data for a task to be executed (S130).

Description

多核数字信号处理系统中处理任务的方法和装置Method and apparatus for processing tasks in a multi-core digital signal processing system 技术领域Technical field
本发明实施例涉及数字信号处理器领域,并且更具体地,涉及多核数字信号处理系统中处理任务的方法和装置。Embodiments of the present invention relate to the field of digital signal processors and, more particularly, to methods and apparatus for processing tasks in a multi-core digital signal processing system.
背景技术Background technique
随着移动互联网技术发展,数据处理量急速增加,数字信号处理器芯片正在朝着多核大数据量处理方向迈进。数字信号处理器在进行数字运算时通常用软件代码实现,核数目增多给软件开发以及硬件资源利用和调试都带来诸多困难。当需求发生变换时,软件架构都需要重新划分多个不同核的功能映射关系,对一些硬件资源如内存,数据通道,消息资源等使用不充分造成浪费。With the development of mobile Internet technology, the amount of data processing has increased rapidly, and digital signal processor chips are moving toward multi-core and large data processing. Digital signal processors are usually implemented in software code when performing digital operations. The increase in the number of cores brings many difficulties to software development and hardware resource utilization and debugging. When the requirements change, the software architecture needs to re-divide the functional mapping relationship of multiple different cores, which is wasteful for insufficient use of some hardware resources such as memory, data channels, and message resources.
相关技术中的静态任务调度的方法在静态任务调度过程中,软件设计人员根据软件任务图标和各功能算法模块性能仿真评估后获得各功能模块的基本性能,在根据目标映射硬件资源的能力进行匹配,按照功能粒度,资源消耗将不同的软件功能部署到不同的硬件资源上,但是静态任务调度的方法适用场景受限、调度的复杂度高、内存资源利用率低。Static Task Scheduling Method in Related Art In the static task scheduling process, the software designer obtains the basic performance of each functional module according to the software task icon and the performance simulation of each functional algorithm module, and performs matching according to the capability of mapping hardware resources according to the target. According to the functional granularity, resource consumption deploys different software functions to different hardware resources, but the static task scheduling method has limited application scenarios, high scheduling complexity, and low memory resource utilization.
相关技术中的动态任务调度方案中采用主从分布式调度的资源池方案,每个处理器上均承载一个裁剪的操作系统(Operating System,简称为“OS”),可支持创建不同优先级的任务,可响应外部中断等,由主核将任务划分为适当的粒度放入任务缓存池,当从核空闲时,自主从主核中获取任务并执行。但是该方案中每个从核上均需要承载一个操作系统,任务切换,数据装载均会占用很多的从核负载,计算资源和内存资源利用率较低。The dynamic task scheduling scheme in the related art adopts a resource pool scheme of master-slave distributed scheduling, and each processor carries a tailored operating system (OS), which can support different priorities. The task can respond to external interrupts, etc., and the main core divides the task into the appropriate granularity into the task cache pool. When the core is idle, the task is automatically acquired from the main core and executed. However, in this solution, each slave core needs to carry an operating system, task switching, data loading will occupy a lot of slave load, and computing resources and memory resource utilization are low.
发明内容Summary of the invention
本发明实施例提供一种多核数字信号处理系统中处理任务的方法和装置,能够在任务运行时决定运行的调度过程,动态分配运算资源,提高运算资源的利用率,减少系统调度开销。Embodiments of the present invention provide a method and apparatus for processing a task in a multi-core digital signal processing system, which can determine a running scheduling process when a task runs, dynamically allocate computing resources, improve utilization of computing resources, and reduce system scheduling overhead.
第一方面,提供了一种多核数字信号处理系统中处理任务的方法,包括:确定任务队列中的就绪任务;确定执行该就绪任务的目标运算单元;通过该目标运算单元执行该就绪任务,并同时通过该目标运算单元为待执行任务准 备数据。In a first aspect, a method for processing a task in a multi-core digital signal processing system includes: determining a ready task in a task queue; determining a target computing unit that executes the ready task; performing the ready task by the target computing unit, and At the same time, the target computing unit is used as the task to be executed. Prepare data.
根据本发明实施例的多核数字信号处理系统中处理任务的方法,在通过一个运算单元执行任务时,同时通过该运算单元为其他任务准备数据,由此能够实现数据装载与算法业务执行并行,减少数据装载的等待开销,提高任务间的并行度,减少系统调度开销。According to the method for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention, when a task is executed by one operation unit, data is prepared for other tasks through the operation unit at the same time, thereby enabling data loading and algorithm service execution to be parallel, thereby reducing The waiting cost of data loading increases the degree of parallelism between tasks and reduces system scheduling overhead.
结合第一方面,在第一方面的一种实现方式中,在确定执行该就绪任务的依赖任务的运算单元空闲时,将该执行该就绪任务的运算单元确定为该目标运算单元。In conjunction with the first aspect, in an implementation of the first aspect, when it is determined that the task-dependent operation unit that executes the ready task is idle, the operation unit that executes the ready task is determined as the target operation unit.
此时,执行该就绪任务与该就绪任务的依赖任务的运算单元为同一个运算单元,在执行该就绪任务时,不需要再次装载数据,减轻装载路径上的拥堵情况。At this time, the operation unit that executes the ready task and the task dependent task of the ready task is the same operation unit, and when the ready task is executed, there is no need to load data again, and the congestion condition on the loading path is alleviated.
结合第一方面及其上述实现方式,在第一方面的另一种实现方式中,在通过该目标运算单元执行该就绪任务之前,该方法还包括:确定该目标运算单元的近端内存中用于存放与该就绪任务相对应的输入数据的内存块;将该与该就绪任务相对应的输入数据搬移到该内存块中。In conjunction with the first aspect and the foregoing implementation manner, in another implementation manner of the first aspect, before the performing the task is performed by the target computing unit, the method further includes: determining, in the near-end memory of the target computing unit, a memory block storing input data corresponding to the ready task; moving the input data corresponding to the ready task to the memory block.
结合第一方面及其上述实现方式,在第一方面的另一实现方式中,该确定该目标运算单元的近端内存中用于存放与该就绪任务相对应的输入数据的内存块,包括:根据固定资源池算法,确定该内存块,其中,该目标运算单元的近端内存中存储的数据支持驻留直到用户释放或在近端内存不够时将数据置换到远端内存。In conjunction with the first aspect and the foregoing implementation manner, in another implementation manner of the first aspect, the determining, by the operating unit, the memory block of the near-end memory of the target computing unit for storing the input data corresponding to the ready task includes: The memory block is determined according to a fixed resource pool algorithm, wherein data stored in the near-end memory of the target unit supports docking until the user releases or replaces the data to the far-end memory when the near-end memory is insufficient.
结合第一方面及其上述实现方式,在第一方面的另一种实现方式中,该方法还包括:在通过该目标运算单元执行完该就绪任务时,将该就绪任务的输出数据保存在该近端内存中。In conjunction with the first aspect and the foregoing implementation manner, in another implementation manner of the first aspect, the method further includes: when the ready task is executed by the target computing unit, saving the output data of the ready task in the Near-end memory.
这样,执行任务时的读写内存均为近端内存,执行任务时不会因等待数据到达消耗时延,并且申请内存时采用固定资源池算法能够减小内存碎片,提高内存周转效率,节省内存。In this way, the read and write memory when performing the task is the near-end memory, and the task does not wait for the data to reach the consumption delay when executing the task, and the fixed resource pool algorithm can reduce the memory fragmentation when applying for the memory, improve the memory turnover efficiency, and save the memory. .
结合第一方面及其上述实现方式,在第一方面的另一种可能的实现方式中,该根据固定资源池算法,确定该内存块,包括:根据该就绪任务需要的所有参数对应的数据块的总和与该近端内存中的单个内存块的大小的比值,确定该内存块的数量。With reference to the first aspect and the foregoing implementation manner, in another possible implementation manner of the first aspect, the determining, by the fixed resource pool algorithm, the memory block includes: a data block corresponding to all parameters required by the ready task The ratio of the sum of the sum to the size of a single block of memory in the near-end memory determines the number of blocks.
由此,通过将任务内的内存数据进行拼装处理,可以进一步提高内存的 使用效率,减少内存浪费。Thus, by assembling the memory data in the task, the memory can be further improved. Use efficiency to reduce memory waste.
结合第一方面及其上述实现方式,在第一方面的另一种可能的实现方式中,在确定任务队列中的就绪任务之前,该方法还包括:对包括该就绪任务的业务进行抽象处理,得到抽象处理信息,该抽象处理信息包括下列信息中的至少一种:任务依赖关系信息、数据依赖关系信息和任务执行的先后顺序信息。In conjunction with the first aspect and the foregoing implementation manner, in another possible implementation manner of the first aspect, before determining a ready task in the task queue, the method further includes: performing abstract processing on the service including the ready task, Obtaining abstract processing information, the abstract processing information including at least one of the following information: task dependency information, data dependency information, and sequence information of task execution.
结合第一方面及其上述实现方式,在第一方面的另一种可能的实现方式中,该任务队列为多个并行的任务队列;其中,该确定任务队列中的就绪任务,包括:按照优先级顺序轮询该多个并行的任务队列,确定该就绪任务。In conjunction with the first aspect and the foregoing implementation manner, in another possible implementation manner of the first aspect, the task queue is a plurality of parallel task queues, wherein the determining the ready tasks in the task queue includes: prioritizing The level sequence polls the plurality of parallel task queues to determine the ready task.
结合第一方面及其上述实现方式,在第一方面的另一种可能的实现方式中,该对包括该就绪任务的业务进行抽象处理,包括:根据该业务的需要创建缓存;根据该缓存的缓存标识ID,确定该数据依赖关系信息。In conjunction with the first aspect and the foregoing implementation manner, in another possible implementation manner of the first aspect, the abstract processing of the service including the ready task includes: creating a cache according to the requirement of the service; The cache ID is determined to determine the data dependency information.
第二方面,提供了一种多核数字信号处理系统中处理任务的装置,用于执行上述第一方面或第一方面的任一可能的实现方式中的方法,具体地,该装置包括用于执行上述第一方面或第一方面的任一可能的实现方式中的方法的模块。In a second aspect, there is provided apparatus for processing a task in a multi-core digital signal processing system, for performing the method of any of the first aspect or the first aspect of the first aspect, in particular A module of the method of the first aspect or any of the possible implementations of the first aspect.
第三方面,提供了一种计算机可读介质,用于存储计算机程序,该计算机程序包括用于执行第一方面或第一方面的任意可能的实现方式中的方法的指令。In a third aspect, a computer readable medium is provided for storing a computer program comprising instructions for performing the method of the first aspect or any of the possible implementations of the first aspect.
第四方面,提供了一种计算机程序产品,该计算机程序产品包括:计算机程序代码,但该计算机程序代码被多核数字信号处理系统的处理任务的装置运行时,使得该装置执行上述第一方面或第一方面的任一可能的实现方式中的方法。In a fourth aspect, a computer program product is provided, the computer program product comprising: computer program code, wherein the computer program code is executed by a device of a processing task of a multi-core digital signal processing system, such that the device performs the first aspect or A method in any of the possible implementations of the first aspect.
附图说明DRAWINGS
为了更清楚地说明本发明实施例的技术方案,下面将对本发明实施例中所需要使用的附图作简单地介绍,显而易见地,下面所描述的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the present invention, Those skilled in the art can also obtain other drawings based on these drawings without paying any creative work.
图1是应用本发明实施例的应用系统的示意性架构图;1 is a schematic structural diagram of an application system to which an embodiment of the present invention is applied;
图2是图1所示的应用系统中的调度器包括的各个管理模块及其相互关 系的示意图;2 is a diagram of various management modules included in the scheduler in the application system shown in FIG. Schematic diagram of the system;
图3是应用本发明实施例的应用系统中的数据依赖关系的示意图;3 is a schematic diagram of data dependency relationships in an application system to which an embodiment of the present invention is applied;
图4是应用本发明实施例的应用系统中只有一个核被调度时的调度结果示意图;4 is a schematic diagram of scheduling results when only one core is scheduled in an application system to which an embodiment of the present invention is applied;
图5是应用本发明实施例的应用系统中有三个核被调度时的调度结果示意图;5 is a schematic diagram of scheduling results when three cores are scheduled in an application system to which an embodiment of the present invention is applied;
图6是根据本发明实施例的多核数字信号处理系统中处理任务的方法的示意性流程图;6 is a schematic flowchart of a method for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention;
图7是根据本发明实施例的多核数字信号处理系统中处理任务的方法的另一示意性流程图;7 is another schematic flowchart of a method for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention;
图8是根据本发明实施例的多核数字信号处理系统中处理任务的方法的再一示意性流程图;FIG. 8 is still another schematic flowchart of a method for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention; FIG.
图9是根据本发明实施例的对业务进行抽象处理的方法的示意性流程图;FIG. 9 is a schematic flowchart of a method for abstracting a service according to an embodiment of the present invention; FIG.
图10是根据本发明实施例的一种具体情况下实现处理任务的方法的示意性流程图;FIG. 10 is a schematic flowchart of a method for implementing a processing task in a specific case according to an embodiment of the present invention; FIG.
图11是根据本发明实施例的确定就绪任务和空闲运算单元的方法的示意性流程图;11 is a schematic flowchart of a method of determining a ready task and an idle operation unit according to an embodiment of the present invention;
图12是根据本发明实施例的多核数字信号处理系统中处理任务的装置的示意性框图;12 is a schematic block diagram of an apparatus for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention;
图13是根据本发明实施例的多核数字信号处理系统中处理任务的装置的另一示意性框图;13 is another schematic block diagram of an apparatus for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention;
图14是根据本发明实施例的多核数字信号处理系统中处理任务的装置的再一示意性框图;14 is still another schematic block diagram of an apparatus for processing a task in a multi-core digital signal processing system according to an embodiment of the present invention;
图15是根据本发明另一实施例的多核数字信号处理系统中处理任务的装置的示意性框图。15 is a schematic block diagram of an apparatus for processing a task in a multi-core digital signal processing system in accordance with another embodiment of the present invention.
具体实施方式detailed description
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明的一部分实施例,而不是全部实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创 造性劳动的前提下所获得的所有其他实施例,都应属于本发明保护的范围。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are a part of the embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, those of ordinary skill in the art are not making innovations. All other embodiments obtained under the premise of causative labor are within the scope of protection of the present invention.
应理解,本发明实施例的技术方案主要应用于需要多核处理的数字信号处理系统,具有大量并行计算的场景,如宏站基带芯片,终端芯片。该多核特征,具体体现在于,在单芯片上集成了计算模块的数量,包括但不限于多个通用处理器、多个IP核、多个专用处理器等。如计算模块的数量大于一,则为多核。It should be understood that the technical solution of the embodiments of the present invention is mainly applied to a digital signal processing system that requires multi-core processing, and has a large number of parallel computing scenarios, such as a macro station baseband chip and a terminal chip. The multi-core feature is embodied by integrating the number of computing modules on a single chip, including but not limited to multiple general-purpose processors, multiple IP cores, multiple dedicated processors, and the like. If the number of calculation modules is greater than one, it is multi-core.
图1示出了应用本发明实施例的应用系统(多核数字信号处理系统)的示意性架构图,应用系统由三部分组成:主控层、执行层和运算层。其中,主控层承载用户软件,完成高层的信息交互、流程控制、任务分解、任务依赖关系定义等功能。执行层由三部分组成,主控核执行层、调度器、从核执行层。主控核执行层提供软件编程接口,向调度器提交命令并且接收命令反馈或回调通知;调度器为硬件部分,负责处理任务的调度,具体的功能包括:任务间的依赖关系处理、内存管理、任务分配、数据搬移等功能。如图2所示,调度器内部由多个管理模块进行管理:命令管理、命令队列管理、事件管理、缓存(buffer)描述符管理、共享内存管理、运算内存管理、运算资源状态管理和调度主控模块;从核执行层为软件部分,主要负责接收任务消息,调用算法函数库到执行云端,运算结束后发送任务结束反馈消息。运算层可以为硬件也可以为软件,主要负责对任务进行处理。FIG. 1 is a schematic structural diagram of an application system (multi-core digital signal processing system) to which an embodiment of the present invention is applied. The application system is composed of three parts: a main control layer, an execution layer, and an operation layer. The main control layer carries user software, and completes high-level information interaction, process control, task decomposition, and task dependency definition. The execution layer consists of three parts, the master core execution layer, the scheduler, and the slave core execution layer. The main control core execution layer provides a software programming interface, submits commands to the scheduler and receives command feedback or callback notifications; the scheduler is a hardware part, which is responsible for processing task scheduling, and specific functions include: dependency processing between tasks, memory management, Task assignment, data movement and other functions. As shown in Figure 2, the scheduler is internally managed by multiple management modules: command management, command queue management, event management, buffer descriptor management, shared memory management, computation memory management, computing resource state management, and scheduling master. Control module; from the core execution layer to the software part, mainly responsible for receiving task messages, calling the algorithm function library to execute the cloud, and sending the task end feedback message after the operation ends. The computing layer can be hardware or software, and is mainly responsible for processing tasks.
下面将举例说明本发明实施例的方法应用的场景。假设已知有三种优先级的信道处理片断;其中核(Kernel)0~2是高优先级的处理,Kernel3~5是中优先级的处理,Kernel7~9是低优先级的处理。如图3所示,为了标示处理的数据流(箭头指向的方向表明了数据流动的方向),将主机(Host)与设备(Device)之间的数据流根据方向标示为缓存输入/输出(Buff_In/Out)。而Kernel与Kernel处理中间的数据流标示为缓存_M(Buff_M)。不同核之间的数据依赖关系可以描述为:The scenario in which the method of the embodiment of the present invention is applied will be exemplified below. It is assumed that there are three priority channel processing segments; among them, Kernels 0 to 2 are high priority processing, Kernels 3 to 5 are medium priority processing, and Kernels 7 to 9 are low priority processing. As shown in FIG. 3, in order to indicate the processed data flow (the direction pointed by the arrow indicates the direction in which the data flows), the data flow between the host and the device is indicated as a buffer input/output according to the direction (Buff_In). /Out). The data flow between Kernel and Kernel processing is marked as cache_M (Buff_M). Data dependencies between different cores can be described as:
Kernel_0/3/7的输入分别为Buff_In0/1/2,由Host准备(实际应用中可能是来自外部接口或硬件加速控制(Hardware Accelerate Control,简称为“HAC”)的输出)。The input of Kernel_0/3/7 is Buff_In0/1/2, which is prepared by Host (the actual application may be the output from external interface or Hardware Accelerate Control (HAC)).
Kernel_2完成处理后数据输出给Host(实际应用中Host通常需要将数字信号处理(Digital Signal Processing,简称为“DSP”)处理的数据发送到片外,或者给Hac继续处理)。 After the Kernel_2 finishes processing, the data is output to the Host (in actual applications, the Host usually needs to send data processed by Digital Signal Processing ("DSP") to the off-chip, or continue processing for Hac).
Kernel_2处理的输入依赖Kernel_1的输出,同时还依赖于Kernel_5的输出。The input processed by Kernel_2 depends on the output of Kernel_1 and also depends on the output of Kernel_5.
Kernel_4处理的输入依赖Kernel_9的输出,同时还依赖于Kernel_3的输出。The input processed by Kernel_4 depends on the output of Kernel_9 and also depends on the output of Kernel_3.
Kernel_8处理的输入依赖Kernel_7的输出,同时还依赖于Kernel_3的输出。The input processed by Kernel_8 depends on the output of Kernel_7 and also depends on the output of Kernel_3.
Kernel_3处理的输出除了给Kernel_4使用,还需要提供给Kernel_8使用(Kenrel_8可能只用其中的一部分)。The output processed by Kernel_3 is used by Kernel_4 in addition to Kernel_4 (Kenrel_8 may only use a part of it).
调度器根据被调度的实际执行核的数目,在保证数据依赖正确的情况下,按照优先级进行调度。上文中描述的Kernel依赖关系,在初始阶段,Host将全部提交Kernel到命令队列(CommandQueue),同时输入的数据Buff_In0/1/2全部就绪,则调度器在只有一个核可以被调度的情况下,其调度结果如图4所示,对于存在三个核的情况下,其调度结果如图5所示,在图5中,虚线Kernel_2,Kernel_4是否被调度在DSP2上依赖其输入的数据量的大小。如当Kernel_4输入的数据Buff_M8(Kernel_9的输出)大于或等于Buff_M2(Kernel_3的输出),则应将Kernel_4调度到DSP2上,否则应调度到Kernel_3的输出所在的核上。The scheduler schedules according to the priority according to the number of actual execution cores that are scheduled, while ensuring that the data dependencies are correct. In the Kernel dependency described above, in the initial stage, the Host will all submit the Kernel to the Command Queue (CommandQueue), and the input data Buff_In0/1/2 is all ready, then the scheduler can only be scheduled if only one core can be scheduled. The scheduling result is shown in Figure 4. For the case where there are three cores, the scheduling result is shown in Figure 5. In Figure 5, whether the dotted line Kernel_2 and Kernel_4 are scheduled on DSP2 depends on the amount of data input. . For example, when the data Buff_M8 (the output of Kernel_9) input by Kernel_4 is greater than or equal to Buff_M2 (the output of Kernel_3), Kernel_4 should be dispatched to DSP2, otherwise it should be dispatched to the core where the output of Kernel_3 is located.
应理解,在本发明实施例中,业务指的是独立于硬件处理数据的程序,是与操作系统、驱动相区别的概念,业务例如可以是数据经过信道估计、快速傅氏变换(Fast Fourier Transformation,简称为“FFT”)、解码等操作。任务指的是软件任务,该软件任务为实现某一功能的一段程序,通常需要运行在核心处理器上。It should be understood that, in the embodiment of the present invention, a service refers to a program that processes data independently of hardware, and is a concept different from an operating system and a driver. The service may be, for example, data channel estimation, Fast Fourier Transformation (Fast Fourier Transformation). , referred to as "FFT" for short, and other operations such as decoding. A task is a software task that is a piece of program that implements a function and usually needs to run on a core processor.
图6是根据本发明实施例的处理任务的方法100的示意性流程图。该方法100可以由图1所示的多核数字信号处理系统执行,如图6所示,该方法100包括:FIG. 6 is a schematic flow diagram of a method 100 of processing a task in accordance with an embodiment of the present invention. The method 100 can be performed by the multi-core digital signal processing system shown in FIG. 1, as shown in FIG. 6, the method 100 includes:
S110,确定任务队列中的就绪任务;S110. Determine a ready task in the task queue.
S120,确定执行该就绪任务的目标运算单元;S120. Determine a target operation unit that performs the ready task.
S130,通过该目标运算单元执行该就绪任务,并同时通过该目标运算单元为待执行任务准备数据。S130. Perform the ready task by the target operation unit, and simultaneously prepare data for the task to be executed by the target operation unit.
具体而言,多核数字信号处理系统在确定任务队列中的就绪任务后,确定能够执行该就绪任务的目标运算单元,之后通过确定的该目标运算单元执 行该就绪任务,并同时通过该目标运算单元为待执行任务准备数据。Specifically, after determining the ready task in the task queue, the multi-core digital signal processing system determines a target arithmetic unit capable of executing the ready task, and then executes the determined target computing unit The ready task is performed, and at the same time, data is prepared for the task to be executed through the target arithmetic unit.
因此,本发明实施例的处理任务的方法,在通过目标运算单元执行就绪任务的同时通过该目标运算单元为待执行任务准备数据,由此可以将数据装载与运算并行,减少数据装载的等待开销,提高任务间的并行度,减少系统调度开销。Therefore, the method for processing a task according to an embodiment of the present invention prepares data for a task to be executed by the target operation unit while executing a ready task by the target operation unit, thereby parallelizing data loading and operation, thereby reducing waiting cost of data loading. Improve parallelism between tasks and reduce system scheduling overhead.
需要说明的是,就绪任务指的是依赖的任务已经完成,可以启动运行的任务,待执行任务可以理解为在该就绪任务之后需要被执行的任务。一个运算单元可以理解为一个核。It should be noted that the ready task refers to the task that has been completed and can start the running task. The task to be executed can be understood as the task that needs to be executed after the ready task. An arithmetic unit can be understood as a core.
可选地,在S110中,该任务队列为多个并行的任务队列;Optionally, in S110, the task queue is a plurality of parallel task queues;
相应地,S110具体为:按照优先级顺序轮询该多个并行的任务队列,确定该就绪任务。Correspondingly, S110 is specifically: polling the plurality of parallel task queues in a priority order to determine the ready task.
也就是说,多核数字信号处理系统可以创建并行任务队列,这些并行的任务之间为并行关系,但是这些并行的任务队列具有不同的优先级,任务下发进任务队列后,每个任务队列中的任务是串行执行的,采用先入先出的包序原则。That is to say, the multi-core digital signal processing system can create a parallel task queue. These parallel tasks are in parallel relationship, but these parallel task queues have different priorities. After the tasks are sent into the task queue, each task queue is in the queue. The task is serially executed, using the first-in, first-out order principle.
具体的,在确定就绪任务时,可以按照多个并行的任务队列的优先级由高到低的顺序进行轮询,如果高优先级的任务队列中没有就绪任务则继续轮询下一优先级的任务队列,直到找到一个就绪队列或者已经轮询到最低优先级的任务队列时,结束轮询。Specifically, when determining the ready task, the polling may be performed in descending order of priority of multiple parallel task queues, and if there is no ready task in the high priority task queue, polling the next priority is continued. The task queue ends polling until a ready queue is found or the lowest priority task queue has been polled.
可选地,在S130中,通过目标运算单元执行该就绪任务,并同时通过该目标单元为待执行任务准备数据,可以理解为将一个运算单元虚拟为乒乓两个逻辑资源,当一个逻辑资源分配任务后,另一个逻辑资源也可以再分配任务,这样可以保证一个逻辑资源在运算时,另一个逻辑资源的数据也在准备中,由此可以减少数据等待,提高运算资源利用率。Optionally, in S130, the ready task is executed by the target operation unit, and at the same time, the data is prepared for the task to be executed by the target unit, which can be understood as virtualizing one operation unit into two logical resources of ping-pong, when a logical resource is allocated. After the task, another logical resource can also be reassigned the task, which can ensure that one logical resource is in operation, and the data of another logical resource is also prepared, thereby reducing data waiting and improving computing resource utilization.
可选地,S120具体为:在确定执行该就绪任务的依赖任务的运算单元空闲时,将该执行该就绪任务的依赖任务的运算单元确定为该目标运算单元。Optionally, S120 is specifically: when determining that the operation unit of the dependent task that executes the ready task is idle, determining the operation unit of the dependent task that executes the ready task as the target operation unit.
具体来说,该就绪任务的依赖任务指的是输出数据为该就绪任务的输入数据的任务,多核数字信号处理系统可以依据数据位置选择空闲资源来执行该就绪任务,优选地,多核数字信号处理系统记录了处理该就绪任务的依赖任务的运算单元,多核数字信号处理系统在确定执行该依赖任务的运算单元 处于空闲状态时,则将就绪任务分配在该执行该依赖任务的运算单元上,即将该执行依赖任务的运算单元确定为该目标运算单元。由于处理该就绪任务的运算单元与处理该就绪任务的运算单元为同一个运算单元,由此可以不需要再次为该就绪任务装载数据,减轻装载路径上的拥堵情况。Specifically, the dependent task of the ready task refers to a task of outputting data as input data of the ready task, and the multi-core digital signal processing system may select the idle resource according to the data location to perform the ready task, preferably, multi-core digital signal processing. The system records an operation unit that processes the dependent task of the ready task, and the multi-core digital signal processing system determines the arithmetic unit that executes the dependent task When in the idle state, the ready task is allocated to the arithmetic unit that executes the dependent task, that is, the arithmetic unit that executes the dependent task is determined as the target arithmetic unit. Since the arithmetic unit that processes the ready task is the same arithmetic unit as the arithmetic unit that processes the ready task, it is not necessary to load data for the ready task again, and the congestion on the loading path can be alleviated.
可选地,在多核数字信号处理系统在确定执行该依赖任务的运算单元不处于空闲状态时,可以从其他空闲的运算单元中随机选取一个空闲的运算单元作为目标运算单元。Optionally, when the multi-core digital signal processing system determines that the operation unit that performs the dependent task is not in an idle state, an idle arithmetic unit may be randomly selected from the other idle arithmetic units as the target operation unit.
可选地,如图7所示,在通过该目标运算单元执行该就绪任务之前,该方法100还包括:Optionally, as shown in FIG. 7, before performing the ready task by the target computing unit, the method 100 further includes:
S140,确定该目标运算单元的近端内存中用于存放与该就绪任务相对应的输入数据的内存块;S140. Determine a memory block in the near-end memory of the target operation unit for storing input data corresponding to the ready task.
S150,将该与该就绪任务相对应的输入数据搬移到该内存块中。S150. Move the input data corresponding to the ready task to the memory block.
具体来说,执行就绪任务所需要的数据可以在其他运算单元上或者在外部内存(例如双倍速率同步动态随机存储器(Double Data Rate,简称为“DDR”)、L3缓存等)上,在执行任务前需要把这些内存上的数据搬移到即将运行任务的运算单元的近端内存(例如L1缓存或L2缓存)上。在搬移数据之前需要先确定用于存储数据的内存块,或者需要先申请用于存储数据的内存,之后将数据搬移到确定的或申请到的内存上。Specifically, the data required to perform the ready task can be executed on other operating units or in external memory (for example, Double Rate Rate (DDR), L3 cache, etc.). Before the task, you need to move the data on the memory to the near-end memory (such as L1 cache or L2 cache) of the arithmetic unit that will run the task. Before moving the data, you need to determine the memory block used to store the data, or you need to apply for the memory used to store the data, and then move the data to the determined or applied memory.
可选地,在S140中,根据固定资源池算法,确定该内存块,其中,该目标运算单元的近端内存中存储的数据支持驻留直到用户释放或在近端内存不够时将数据置换到远端内存。Optionally, in S140, the memory block is determined according to a fixed resource pool algorithm, wherein data stored in a near-end memory of the target operation unit supports camping until the user releases or replaces the data to the near-end memory. Remote memory.
也就是说,可以按照内存离核的远近将内存空间分级,然后根据内存的级别做相应的处理,对于近端内存的内存分配算法采用固定资源池算法,由此可以减小内存碎片和提高申请释放效率。That is to say, the memory space can be ranked according to the distance from the memory, and then processed according to the memory level. The memory allocation algorithm for the near-end memory uses a fixed resource pool algorithm, thereby reducing memory fragmentation and increasing the application. Release efficiency.
应理解,本发明实施例中还可以根据其他的算法申请该目标运算单元的近端内存,例如可以根据链表内存分配算法、伙伴算法、基于内存池的伙伴算法、工作集算法等算法申请近端内存,但本发明并不限于此。It should be understood that, in the embodiment of the present invention, the near-end memory of the target computing unit may be applied according to other algorithms. For example, the near-end may be applied according to a linked list memory allocation algorithm, a partner algorithm, a memory pool-based buddy algorithm, a working set algorithm, and the like. Memory, but the invention is not limited thereto.
可选地,S140具体为:根据该就绪任务需要的所有参数对应的数据块的总和与该近端内存中的单个内存块的大小的比值,确定需要申请的该近端内存中的内存块的数量。Optionally, the S140 is specifically: determining, according to a ratio of a total of the data blocks corresponding to all parameters required by the ready task to a size of a single memory block in the near-end memory, determining a memory block in the near-end memory that needs to be applied for. Quantity.
具体来说,任务相当于一个功能函数,具有参数,每个参数可能是一块 数据,也可能是数值,可以将数据进行拼装处理之后放到同一个或多个内存块中。举例来说,假设任务A有10个参数,每个参数都是数据块类型,假设这10个参数对应的数据块大小总和为31K,近端内存中单个内存块的大小为4K,则需要申请的内存块的个数为8个。由此可以提高内存使用效率减少内存浪费。Specifically, the task is equivalent to a function with parameters, and each parameter may be a piece. Data, or numerical values, can be assembled into the same memory block or chunks after being assembled. For example, suppose task A has 10 parameters, each of which is a data block type. If the total size of the data blocks corresponding to the 10 parameters is 31K, and the size of a single memory block in the near-end memory is 4K, you need to apply. The number of memory blocks is eight. This can improve memory usage efficiency and reduce memory waste.
可选地,如图8所示,在S110之前,该方法100还包括:Optionally, as shown in FIG. 8, before S110, the method 100 further includes:
S160,对包括该就绪任务的业务进行抽象处理,得到抽象处理信息,该抽象处理信息包括下列信息中的至少一种:任务依赖关系信息、数据依赖关系信息和任务执行的先后顺序信息。S160: Perform abstract processing on the service including the ready task, and obtain abstract processing information, where the abstract processing information includes at least one of the following information: task dependency information, data dependency information, and sequence information of task execution.
具体来说,可以将一个业务拆分为多个任务,对该业务进行抽象处理。在对包括该就绪任务的业务进行抽象的过程中,可以根据该业务的需要创建缓存(buffer),根据该缓存的缓存标识ID,确定该数据依赖关系信息。Buffer为一块数据存储空间,在任务开始前装载数据,在任务不需要此buffer时销毁。每个buffer都有一个ID,通过此ID进行任务间的数据关系的关联,假设任务A的输出的数据为任务B的输入数据,那么任务A的输出buffer为buffer2,任务B的输入也为buffer2。Specifically, a business can be split into multiple tasks to abstract the business. In the process of abstracting the service including the ready task, a buffer may be created according to the needs of the service, and the data dependency information is determined according to the cached ID of the cache. Buffer is a data storage space that loads data before the task starts and is destroyed when the task does not need this buffer. Each buffer has an ID, and the data relationship between tasks is related by this ID. Assuming that the output data of task A is the input data of task B, the output buffer of task A is buffer2, and the input of task B is also buffer2. .
应理解,buffer的创建可以是程序人员根据业务需要确定的,实际创建的buffer的个数是根据实际任务执行过程动态确定的。It should be understood that the creation of the buffer may be determined by the programmer according to the business needs, and the number of buffers actually created is dynamically determined according to the actual task execution process.
在S160中,任务依赖关系信息用于指示任务间的依赖关系,可以由事件关联出任务之间的依赖关系,例如任务A完成可以选择发布一个事件ID,任务B需要等待任务A完成,在任务B的等待事件列表中填入事件A的事件ID,等待事件的描述包括等待事件的个数和等待事件的ID列表。In S160, the task dependency information is used to indicate the dependencies between the tasks, and the dependencies between the tasks may be associated by the event. For example, the task A may choose to publish an event ID, and the task B needs to wait for the task A to complete. The event ID of event A is filled in the waiting event list of B, and the description of the waiting event includes the number of waiting events and the ID of the waiting event.
在本发明实施例中,可选地,可以描述任务输入输出数据特征,支持有限个输入输出参数,参数支持不同的特征:输入缓存、输出缓存、外部输入指针、传入值、全局指针等。In the embodiment of the present invention, optionally, task input and output data features may be described, and a limited number of input and output parameters are supported. The parameters support different features: input buffer, output buffer, external input pointer, incoming value, global pointer, and the like.
在本发明实施例中,由于可以对业务进行抽象处理,因此不同的多核芯片软件不需要重构,规格变化软件也不需重新部署,仅需创建一个资源集合即可。简化了软件编程人员需要根据不同的运算、负载和硬件的限制进行设计的复杂度。In the embodiment of the present invention, since the service can be abstracted, different multi-core chip software does not need to be reconstructed, and the specification change software does not need to be redeployed, and only one resource set needs to be created. It simplifies the complexity of software programmers' design requirements based on different computational, load, and hardware constraints.
在本发明实施例中,可选地,可以在通过该目标运算单元执行完该就绪任务时,将该就绪任务的输出数据保存在该近端内存中。由此当下一个任务 装载到相同的运算单元中时,不需要再次装载,减轻装载路径上的拥堵情况。In the embodiment of the present invention, optionally, the output data of the ready task may be saved in the near-end memory when the ready task is executed by the target operation unit. Thus the next task When loading into the same arithmetic unit, it is not necessary to load again, and the congestion on the loading path is alleviated.
下面将结合图9对S160进行详细描述,S160可以由图1所示的架构图中的主控核执行层执行,该主控核执行层可以为软件编程接口,这些软件接口生成命令提交到调度器执行。如图9所示,在本发明实施例中,可选地,对包括该就绪任务的业务进行抽象处理,得到抽象处理信息的过程可以包括以下步骤:S160 will be described in detail below with reference to FIG. 9. The S160 may be executed by the master core execution layer in the architecture diagram shown in FIG. 1. The master core execution layer may be a software programming interface, and the software interface generates command submission to the schedule. Execution. As shown in FIG. 9, in the embodiment of the present invention, optionally, the process of abstracting the service including the ready task, and obtaining the abstract processing information may include the following steps:
S161,创建任务执行函数,执行函数库由从核执行层调用,主核执行层将函数的指针或者索引注册入函数列表中;S161, creating a task execution function, the execution function library is called from the kernel execution layer, and the main core execution layer registers the function pointer or index into the function list;
S162,创建使用功能资源集合;S162, creating a usage function resource set;
该功能资源集合包括执行任务的资源,即在哪些运算单元上运行任务,使用哪些直接内存存取(Direct Memory Access,简称为“DMA”)等。The set of functional resources includes resources for performing tasks, that is, which operating units are running tasks, and which direct memory access (Direct Memory Access, referred to as "DMA") is used.
S163,创建并行队列;S163, creating a parallel queue;
队列有不同的优先级,队列之间为并行关系,任务下发进入队列,队列内串行执行,采用先入先出的保序原则。Queues have different priorities. Queues are in parallel relationship. Tasks are sent to the queue and serially executed in the queue. The first-in, first-out ordering principle is adopted.
S164,创建缓存buffer;S164, creating a cache buffer;
buffer为一块数据存储空间,任务开始前装载数据,在任务不需要此buffer数据时销毁。每个buffer都有一个ID,通过此ID进行任务间数据关系的关联,即任务A的输出的数据为任务B的输入数据,那么任务A的输出buffer为buffer2,任务B的输入buffer也为buffer2。The buffer is a piece of data storage space. The data is loaded before the task starts, and is destroyed when the task does not need the buffer data. Each buffer has an ID. The ID is used to correlate the data relationship between tasks. That is, the output data of task A is the input data of task B, then the output buffer of task A is buffer2, and the input buffer of task B is also buffer2. .
S165,描述任务间的依赖关系;S165, describing a dependency relationship between tasks;
可以由事件关联出任务之间的依赖关系,即任务A完成可以选择发布一个事件ID,任务B需要等待任务A完成,在它的等待事件列表中填入事件A的事件ID,等待事件的描述包括等待事件的个数和等待事件的ID列表。The dependencies between the tasks can be associated by the event, that is, the task A can choose to publish an event ID, the task B needs to wait for the task A to complete, fill in the event ID of the event A in its waiting event list, and wait for the description of the event. Includes a list of the number of wait events and the ID of the wait event.
S166,描述任务输入输出数据特征;S166, describing task input and output data characteristics;
本发明实施例的多核数字信号处理系统,支持有限个输入输出参数个数,参数支持不同的特征,输入缓存、输出缓存、外部输入指针、传入值和全局指针等。The multi-core digital signal processing system of the embodiment of the invention supports a limited number of input and output parameters, the parameters support different features, input buffer, output buffer, external input pointer, incoming value and global pointer.
S167,设置算法业务参数,将实际参数值填入相应的参数表中;S167, setting an algorithm service parameter, and filling the actual parameter value into a corresponding parameter table;
S168,提交算法业务执行请求;S168, submit an algorithm service execution request;
S169,等待执行回调,或者接收外部任务。S169, waiting to perform a callback, or receiving an external task.
下面将结合图10详细描述在一种具体情况下实现处理任务的方法的示 意性流程图。图10将结合图11进行说明。如图10所示,根据本发明实施例的处理任务的方法200包括:An illustration of a method of implementing a processing task in a specific case will be described in detail below with reference to FIG. Intentional flow chart. Figure 10 will be explained in conjunction with Figure 11. As shown in FIG. 10, a method 200 for processing a task according to an embodiment of the present invention includes:
S201,调度器初始化;S201, the scheduler is initialized;
创建需要使用的资源(队列、事件、buffer、命令),将所有资源放入共享队列,由主核执行层申请使用。Create the resources (queues, events, buffers, commands) that need to be used, put all the resources into the shared queue, and apply for use by the main core execution layer.
S202,调度器创建功能资源集合的响应处理;S202. The scheduler creates a response process of the function resource set.
创建功能资源集合的响应处理包括对运算单元的初始化、运算单元内存储管理器的初始化、共享内存的初始化。The response processing for creating a set of functional resources includes initialization of the arithmetic unit, initialization of the storage manager in the arithmetic unit, and initialization of the shared memory.
S203,调度器等待主核执行层下发的命令,并进行命令处理;S203. The scheduler waits for a command sent by the main core execution layer, and performs command processing.
S204,调度器按照优先级有高到低轮询并行队列,找到就绪任务;S204. The scheduler polls the parallel queue according to the priority of high to low, and finds the ready task.
S205,若有就绪任务,调度器在运算单元集合中选择一个空闲的运算单元,并准备就绪任务需要的数据;S205. If there is a ready task, the scheduler selects an idle operation unit in the operation unit set, and prepares data required for the task;
可以将空闲的运算单元虚拟为乒乓两个逻辑资源,当一个逻辑资源分配任务后,另一个逻辑资源也可以再分配任务,这样可以保证一个逻辑资源在运算时,另一个逻辑资源的数据也在准备中,以减少数据等待,提高运算资源利用率。The idle arithmetic unit can be virtualized into two logical resources of ping-pong. When one logical resource allocates a task, another logical resource can also re-allocate the task, so that one logical resource is operated, and the other logical resource is also Prepare to reduce data waiting and improve computing resource utilization.
在准备数据前,需要先申请运算单元的近端内存,之后设置DMA进行数据搬移,将数据搬入近端内存中。Before preparing the data, you need to apply for the near-end memory of the arithmetic unit, then set the DMA to move the data and move the data into the near-end memory.
可以根据内存的级别对内存做相应的处理,对于近端内存的内存分配算法采用固定资源池申请以减小内存碎片和提高申请释放效率。对于近端内存的数据支持驻留直到用户释放,或内存不够将数据置换到远端内存(根据用户设定的置换级别进行置换)。由于是固定大小的内存申请,为了提高内存使用效率减少内存浪费。The memory can be processed according to the level of memory. The memory allocation algorithm for the near-end memory uses a fixed resource pool application to reduce memory fragmentation and improve application release efficiency. The data for the near-end memory supports docking until the user releases, or the memory is not enough to replace the data with the remote memory (substitution according to the user-set permutation level). Because it is a fixed-size memory application, memory waste is reduced in order to improve memory usage efficiency.
在S205中,若没有就绪任务,则处理已经下发的任务的反馈,更新事件列表,更新就绪任务表,释放相应内存,之后返回到S203。In S205, if there is no ready task, the feedback of the already delivered task is processed, the event list is updated, the ready task table is updated, the corresponding memory is released, and then the process returns to S203.
在本发明实施例中,可选地,可以设置内存锁,保证数据读写的一致性,从而能够自动解决多个核(运算单元)同时读写的一致性问题。具体地,可以设定在文件的数据被改写时,任务不能读取该文件中的数据;或者设备定在文件的数据在被任务读取时,该文件的数据不能被改写。但多个任务同时读取该文件中的数据是允许的。In the embodiment of the present invention, optionally, a memory lock can be set to ensure consistency of data reading and writing, thereby automatically solving the consistency problem of simultaneous reading and writing of multiple cores (arithmetic units). Specifically, it may be set that the data in the file cannot be read by the task when the data of the file is rewritten; or the data of the file cannot be overwritten when the data of the file is read by the task. However, it is permissible for multiple tasks to simultaneously read the data in the file.
下面将结合图11具体说明步骤S204和S205中的确定就绪任务和空闲 运算单元的方法。图11中的方法由数字信号处理系统中的调度器执行。Determining the ready task and idle in steps S204 and S205 will be specifically described below with reference to FIG. The method of the arithmetic unit. The method of Figure 11 is performed by a scheduler in a digital signal processing system.
如图11所示,在S301中,轮询并行队列,并判断是否轮询到最低优先级的队列;As shown in FIG. 11, in S301, the parallel queue is polled, and it is determined whether the lowest priority queue is polled;
S302,确定没有轮询到最低优先级的队列时,获取就绪队列;S302. When it is determined that the queue with the lowest priority is not polled, obtain a ready queue.
S303,确定就绪队列中是否存在就绪任务和系统中是否存在空闲的运算单元;S303. Determine whether there is a ready task in the ready queue and whether there is an idle arithmetic unit in the system;
在一个队列中,任务下发到队列后,是串行执行的,根据先入先出的保序原则执行。In a queue, after the task is sent to the queue, it is executed serially, and is executed according to the first-in, first-out order.
S304,在确定存在就绪任务和空闲的运算单元时,获取就绪任务和空闲的运算单元;S304. Acquire a ready task and an idle arithmetic unit when determining that there is a ready task and an idle arithmetic unit;
S305,在获取到的空闲的运算单元上为获取到的就绪任务准备数据;S305. Prepare data for the acquired ready task on the obtained idle computing unit.
在S305之后,继续在当前优先级的队列中查找就绪任务,及确定系统中是否存在空闲的运算单元,即执行S303及其后续步骤。After S305, continue to find the ready task in the current priority queue, and determine whether there is an idle arithmetic unit in the system, that is, perform S303 and its subsequent steps.
可选地,在S303中,如果确定当前优先级的就绪队列中没有就绪任务,则查询下一优先级的队列中是否存在就绪队列,即重新执行S301及其后续步骤。Optionally, in S303, if there is no ready task in the ready queue of the current priority, it is queried whether there is a ready queue in the queue of the next priority, that is, S301 and its subsequent steps are re-executed.
可选地,在S301中,如果已经轮询到最低优先级的队列(也就是说,所有的队列已经轮询完毕),则执行以下步骤:Optionally, in S301, if the lowest priority queue has been polled (that is, all queues have been polled), perform the following steps:
S306,判断是否有空闲的运算资源;S306. Determine whether there is an idle computing resource.
S307,在有空闲的运算资源时,找到空闲的运算资源;S307. When there is an idle computing resource, find an idle computing resource.
空闲的运算资源应理解为一个运算单元的一个逻辑资源。An idle computing resource should be understood as a logical resource of an arithmetic unit.
S308,获取空闲的运算资源的伙伴资源;S308. Obtain a partner resource of an idle computing resource.
空闲的运算资源的伙伴资源指的是S307中的运算单元的另一个逻辑资源。The buddy resource of the idle computing resource refers to another logical resource of the arithmetic unit in S307.
S309,在找到的空闲的运算资源上为任务准备数据。S309, preparing data for the task on the found idle computing resource.
具体地,在S308中获取到的该伙伴资源上已经部署有任务时,在找到的空闲运算资源上为与该伙伴资源上部署的任务有依赖关系的任务准备数据。在为该与该伙伴资源上部署的任务有依赖关系的任务准备数据后,可以继续执行S306及其后续步骤,进而为其他的任务准备数据。Specifically, when the task has been deployed on the partner resource acquired in S308, the task is prepared on the idle computing resource that is dependent on the task deployed on the partner resource. After the data is prepared for the task that has dependencies on the tasks deployed on the partner resource, S306 and its subsequent steps can be continued to prepare data for other tasks.
因此,本发明实施例的处理任务的方法,在通过目标运算单元执行就绪任务的同时通过该目标运算单元为待执行任务准备数据,由此可以将数据装 载与运算并行,减少数据装载的等待开销,提高任务间的并行度,减少系统调度开销。Therefore, the method for processing a task according to an embodiment of the present invention prepares a data for a task to be executed by the target operation unit while executing a ready task by the target operation unit, thereby loading the data Parallel to the operation, reducing the waiting overhead of data loading, increasing the degree of parallelism between tasks, and reducing system scheduling overhead.
下面将结合图12详细描述本发明实施例的多核数字信号处理系统中处理任务的装置。如图12所示,该装置10包括:An apparatus for processing a task in the multi-core digital signal processing system of the embodiment of the present invention will be described in detail below with reference to FIG. As shown in Figure 12, the apparatus 10 includes:
确定模块11,用于确定任务队列中的就绪任务;a determining module 11 for determining a ready task in the task queue;
该确定模块11,还用于确定执行该就绪任务的目标运算单元;The determining module 11 is further configured to determine a target computing unit that performs the ready task;
任务执行模块12,用于通过该目标运算单元执行该就绪任务,并同时通过该目标运算单元为待执行任务准备数据。The task execution module 12 is configured to execute the ready task by the target operation unit, and simultaneously prepare data for the task to be executed by the target operation unit.
因此,本发明实施例的多核数字信号处理系统中处理任务的装置,在通过目标运算单元执行就绪任务的同时通过该目标运算单元为待执行任务准备数据,由此可以将数据装载与运算并行,减少数据装载的等待开销,提高任务间的并行度,减少系统调度开销。Therefore, the apparatus for processing a task in the multi-core digital signal processing system of the embodiment of the present invention prepares data for the task to be executed through the target arithmetic unit while executing the ready task by the target arithmetic unit, thereby enabling data loading and operation to be parallel. Reduce the waiting overhead of data loading, increase the degree of parallelism between tasks, and reduce system scheduling overhead.
在本发明实施例中,可选地,在确定执行该就绪任务的目标运算单元方面,该确定模块11具体用于:在确定执行该就绪任务的依赖任务的运算单元空闲时,将该执行该就绪任务的依赖任务的运算单元确定为该目标运算单元。In the embodiment of the present invention, optionally, in determining a target operation unit that executes the ready task, the determining module 11 is specifically configured to: when determining that the operation unit of the dependent task executing the ready task is idle, The task-dependent arithmetic unit of the ready task is determined to be the target arithmetic unit.
在本发明实施例中,可选地,如图13所示,该装置还包括内存申请模块13;In the embodiment of the present invention, optionally, as shown in FIG. 13, the device further includes a memory application module 13;
其中,在该任务执行模块12通过该目标运算单元执行该就绪任务之前,该内存申请模块13具体用于:确定该目标运算单元的近端内存中用于存放与该就绪任务相对应的输入数据的内存块;将该与该就绪任务相对应的输入数据搬移到该内存块中。The memory application module 13 is specifically configured to: determine, in the near-end memory of the target operation unit, the input data corresponding to the ready task, before the task execution module 12 executes the ready task by the target operation unit. Memory block; move the input data corresponding to the ready task to the memory block.
在本发明实施例中,可选地,在确定该目标运算单元的近端内存中用于存放与该就绪任务相对应的输入数据的内存块方面,该内存申请模块13具体用于:根据固定资源池算法,确定该内存块,其中,该目标运算单元的近端内存中存储的数据支持驻留直到用户释放或在近端内存不够时将数据置换到远端内存。In the embodiment of the present invention, optionally, in determining a memory block for storing input data corresponding to the ready task in the near-end memory of the target operation unit, the memory application module 13 is specifically configured to: The resource pool algorithm determines the memory block, wherein the data stored in the near-end memory of the target unit supports docking until the user releases or replaces the data to the far-end memory when the near-end memory is insufficient.
在本发明实施例中,可选地,在根据固定资源池算法,确定该内存块方面,该内存申请模块13具体用于:根据该就绪任务需要的所有参数对应的数据块的总和与该近端内存中的单个内存块的大小的比值,确定需要申请的该近端内存中的内存块的数量。 In the embodiment of the present invention, optionally, in determining the memory block according to the fixed resource pool algorithm, the memory application module 13 is specifically configured to: sum the data blocks corresponding to all parameters required by the ready task and the near The ratio of the size of a single block of memory in the end memory, determining the number of blocks of memory in the near-end memory that need to be applied.
在本发明实施例中,可选地,如图14所示,该装置还包括:In the embodiment of the present invention, optionally, as shown in FIG. 14, the device further includes:
业务抽象模块14,用于在该确定模块10确定任务队列中的就绪任务之前,对包括该就绪任务的业务进行抽象处理,得到抽象处理信息,该抽象处理信息包括下列信息中的至少一种:任务依赖关系信息、数据依赖关系信息和任务执行的先后顺序信息。The service abstraction module 14 is configured to perform abstract processing on the service including the ready task before the determining module 10 determines the ready task in the task queue, to obtain abstract processing information, where the abstract processing information includes at least one of the following information: Task dependency information, data dependency information, and sequence information for task execution.
在本发明实施例中,可选地,该任务队列为多个并行的任务队列;In the embodiment of the present invention, optionally, the task queue is a plurality of parallel task queues;
其中,在确定任务队列中的就绪任务方面,该确定模块11具体用于:Wherein, in determining the ready task in the task queue, the determining module 11 is specifically configured to:
按照优先级顺序轮询该多个并行的任务队列,确定该就绪任务。The plurality of parallel task queues are polled in order of priority to determine the ready task.
在本发明实施例中,可选地,在对包括该就绪任务的业务进行抽象处理方面,该业务抽象模块14具体用于:根据该业务的需要创建缓存;根据该缓存的缓存标识ID,确定该数据依赖关系信息。In the embodiment of the present invention, optionally, in the abstract processing of the service including the ready task, the service abstraction module 14 is specifically configured to: create a cache according to the requirement of the service; determine according to the cached cache ID of the cache. The data depends on the relationship information.
在本发明实施例中,可选地,该任务执行模块12还用于:在通过该目标运算单元执行完该就绪任务时,将该就绪任务的输出数据保存在该近端内存中。In the embodiment of the present invention, the task execution module 12 is further configured to: when the ready task is executed by the target operation unit, save the output data of the ready task in the near-end memory.
应理解,根据本发明实施例的装置10的上述和其它操作和/或功能分别为了实现图6至图8中的各个方法,为了简洁,在此不再赘述。It should be understood that the above and other operations and/or functions of the apparatus 10 according to the embodiments of the present invention are respectively omitted in order to implement the respective methods in FIG. 6 to FIG. 8 for brevity.
因此,本发明实施例的多核数字信号处理系统中处理任务的装置,在通过目标运算单元执行就绪任务的同时通过该目标运算单元为待执行任务准备数据,由此可以将数据装载与运算并行,减少数据装载的等待开销,提高任务间的并行度,减少系统调度开销。Therefore, the apparatus for processing a task in the multi-core digital signal processing system of the embodiment of the present invention prepares data for the task to be executed through the target arithmetic unit while executing the ready task by the target arithmetic unit, thereby enabling data loading and operation to be parallel. Reduce the waiting overhead of data loading, increase the degree of parallelism between tasks, and reduce system scheduling overhead.
图15示出了根据本发明另一实施例的多核数字信号处理系统中处理任务的装置100的示意性框图。Figure 15 shows a schematic block diagram of an apparatus 100 for processing tasks in a multi-core digital signal processing system in accordance with another embodiment of the present invention.
如图15所示,该多核数字信号处理系统中处理任务的装置100的硬件结构可以包括:收发器件101、软件器件102以及硬件器件103三个部分。As shown in FIG. 15, the hardware structure of the apparatus 100 for processing tasks in the multi-core digital signal processing system may include three parts: a transceiver device 101, a software device 102, and a hardware device 103.
其中,收发器件101为用于完成包收发的硬件电路;The transceiver device 101 is a hardware circuit for completing packet transmission and reception;
硬件器件103也可以成为“硬件处理模块”或者更简单的,也可以简称为“硬件”,硬件器件103主要包括基于FPGA、ASIC之类的硬件电路(也会配合其他配套器件,如存储器)来实现某些特定功能的硬件电路,其处理速度相比通用处理器往往要快很多,但功能一经定制,便很难更改,因此,实现起来并不灵活,通常用来处理一些固定的功能,需要说明的是,硬件器件103在实际应用中,也可以包括MCU(微处理器,如单片机)、或者CPU 等处理器,但这些处理器的主要功能并不是完成大数据的处理,而主要用于进行一些控制,在这种应用场景下,由这些器件搭配的系统为硬件器件。The hardware device 103 can also be a "hardware processing module" or simpler, and can also be simply referred to as "hardware". The hardware device 103 mainly includes a hardware circuit based on an FPGA, an ASIC (and other supporting devices, such as a memory). Hardware circuits that implement certain functions are often processed much faster than general-purpose processors, but once they are customized, they are difficult to change. Therefore, they are not flexible to implement. They are usually used to handle some fixed functions. It is noted that the hardware device 103 may also include an MCU (microprocessor such as a single chip microcomputer) or a CPU in practical applications. Processors, but the main function of these processors is not to complete the processing of big data, but mainly for some control. In this application scenario, the system with these devices is a hardware device.
软件器件102(或者也简单“软件”)主要包括通用的处理器(例如CPU)及其一些配套的器件(如内存、硬盘等存储设备),可以通过编程来让处理器具备相应的处理功能,用软件来实现时,可以根据业务需求灵活配置,但往往速度相比硬件器件来说要慢。软件处理完后,可以通过硬件器件103将处理完的数据通过收发器件101进行发送,也可以通过一个与收发器件101相连的接口向收发器件101发送处理完的数据。The software device 102 (or simply "software") mainly includes a general-purpose processor (such as a CPU) and some supporting devices (such as a memory device such as a memory or a hard disk), and can be programmed to have a corresponding processing function. When implemented in software, it can be flexibly configured according to business needs, but it is often slower than hardware devices. After the software is processed, the processed data may be transmitted through the transceiver device 101 through the hardware device 103, or the processed data may be transmitted to the transceiver device 101 through an interface connected to the transceiver device 101.
可选地,作为一个实施例,该硬件器件103:用于确定任务队列中的就绪任务;确定执行该就绪任务的目标运算单元;通过该目标运算单元执行该就绪任务,并同时通过该目标运算单元为待执行任务准备数据。Optionally, as an embodiment, the hardware device 103 is configured to: determine a ready task in the task queue; determine a target operation unit that executes the ready task; execute the ready task by the target operation unit, and simultaneously pass the target operation The unit prepares data for the task to be executed.
可选地,作为一个实施例,在确定执行该就绪任务的目标运算单元方面,该硬件器件103具体用于:在确定执行该就绪任务的依赖任务的运算单元空闲时,将该执行该就绪任务的依赖任务的运算单元确定为该目标运算单元。Optionally, as an embodiment, in determining a target operation unit that performs the ready task, the hardware device 103 is specifically configured to: when determining that the task-dependent operation unit that executes the ready task is idle, perform the ready task The task-dependent arithmetic unit is determined to be the target arithmetic unit.
可选地,作为一个实施例,在通过该目标运算单元执行该就绪任务之前,该硬件器件103具体用于:确定该目标运算单元的近端内存中用于存放与该就绪任务相对应的输入数据的内存块;将该与该就绪任务相对应的输入数据搬移到该内存块中。Optionally, as an embodiment, before performing the ready task by the target computing unit, the hardware device 103 is specifically configured to: determine, in a near-end memory of the target computing unit, to store an input corresponding to the ready task. A memory block of data; the input data corresponding to the ready task is moved to the memory block.
可选地,作为一个实施例,在确定该目标运算单元的近端内存中用于存放与该就绪任务相对应的输入数据的内存块方面,该硬件器件103具体用于:根据固定资源池算法,确定该内存块,其中,该目标运算单元的近端内存中存储的数据支持驻留直到用户释放或在近端内存不够时将数据置换到远端内存。Optionally, as an embodiment, in determining a memory block in the near-end memory of the target operation unit for storing input data corresponding to the ready task, the hardware device 103 is specifically configured to: according to a fixed resource pool algorithm Determining the memory block, wherein the data stored in the near-end memory of the target arithmetic unit supports camping until the user releases or replaces the data to the remote memory when the near-end memory is insufficient.
可选地,作为一个实施例,在根据固定资源池算法,确定该内存块方面,该硬件器件103具体用于:根据该就绪任务需要的所有参数对应的数据库的总和与该近端内存中的单个内存块的大小的比值,确定该内存块的数量。Optionally, as an embodiment, in determining the memory block according to the fixed resource pool algorithm, the hardware device 103 is specifically configured to: a sum of a database corresponding to all parameters required by the ready task and a memory in the near end memory The ratio of the size of a single memory block to determine the number of memory blocks.
可选地,作为一个实施例,该硬件器件103还用于:在确定任务队列中的就绪任务之前,对包括该就绪任务的业务进行抽象处理,得到抽象处理信息,该抽象处理信息包括下列信息中的至少一种:任务依赖关系信息、数据依赖关系信息和任务执行的先后顺序信息。Optionally, as an embodiment, the hardware device 103 is further configured to: after determining a ready task in the task queue, perform abstract processing on the service including the ready task to obtain abstract processing information, where the abstract processing information includes the following information. At least one of: task dependency information, data dependency information, and sequence information of task execution.
可选地,作为一个实施例,该任务队列为多个并行的任务队列;其中, 在对包括该就绪任务的业务进行抽象处理方面,该硬件器件103具体用于:根据该业务的需要创建缓存;根据该缓存的缓存标识ID,确定该数据依赖关系信息。Optionally, as an embodiment, the task queue is a plurality of parallel task queues; In the abstract processing of the service including the ready task, the hardware device 103 is specifically configured to: create a cache according to the needs of the service; and determine the data dependency information according to the cached cache ID.
可选地,作为一个实施例,该硬件器件103还用于:在通过该目标运算单元执行完该就绪任务时,将该就绪任务的输出数据保存在该近端内存中。Optionally, as an embodiment, the hardware device 103 is further configured to: when the ready task is executed by the target operation unit, save the output data of the ready task in the near-end memory.
通过本实施例软硬结合的方法,能够在通过目标运算单元执行就绪任务的同时通过该目标运算单元为待执行任务准备数据,由此可以将数据装载与运算并行,减少数据装载的等待开销,提高任务间的并行度,减少系统调度开销。Through the method of soft and hard combining of the embodiment, it is possible to prepare data for the task to be executed through the target operation unit while executing the ready task by the target arithmetic unit, thereby parallelizing the data loading and the operation, and reducing the waiting overhead of the data loading. Improve the degree of parallelism between tasks and reduce system scheduling overhead.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods for implementing the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present invention.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the system, the device and the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。 In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。The functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including The instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应所述以权利要求的保护范围为准。 The above is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. It should be covered by the scope of the present invention. Therefore, the scope of the invention should be determined by the scope of the claims.

Claims (18)

  1. 一种多核数字信号处理系统中处理任务的方法,其特征在于,所述方法包括:A method for processing a task in a multi-core digital signal processing system, the method comprising:
    确定任务队列中的就绪任务;Determine the ready task in the task queue;
    确定执行所述就绪任务的目标运算单元;Determining a target arithmetic unit that performs the ready task;
    通过所述目标运算单元执行所述就绪任务,并同时通过所述目标运算单元为待执行任务准备数据。The ready task is executed by the target operation unit, and at the same time, data is prepared for the task to be executed by the target operation unit.
  2. 根据权利要求1所述的方法,其特征在于,所述确定执行所述就绪任务的目标运算单元,包括:The method of claim 1, wherein the determining the target computing unit that performs the ready task comprises:
    在确定执行所述就绪任务的依赖任务的运算单元空闲时,将所述执行所述就绪任务的依赖任务的运算单元确定为所述目标运算单元。When it is determined that the operation unit of the dependent task executing the ready task is idle, the operation unit that executes the dependent task of the ready task is determined as the target operation unit.
  3. 根据权利要求1或2所述的方法,其特征在于,在通过所述目标运算单元执行所述就绪任务之前,所述方法还包括:The method according to claim 1 or 2, wherein, before the performing the task is performed by the target computing unit, the method further comprises:
    确定所述目标运算单元的近端内存中用于存放与所述就绪任务相对应的输入数据的内存块;Determining, in a near-end memory of the target arithmetic unit, a memory block for storing input data corresponding to the ready task;
    将所述与所述就绪任务相对应的输入数据搬移到所述内存块中。The input data corresponding to the ready task is moved into the memory block.
  4. 根据权利要求3所述的方法,其特征在于,所述确定所述目标运算单元的近端内存中用于存放与所述就绪任务相对应的输入数据的内存块,包括:The method according to claim 3, wherein the determining a memory block in the near-end memory of the target operation unit for storing input data corresponding to the ready task comprises:
    根据固定资源池算法,确定所述内存块,其中,所述目标运算单元的近端内存中存储的数据支持驻留直到用户释放或在近端内存不够时将数据置换到远端内存。The memory block is determined according to a fixed resource pool algorithm, wherein data stored in the near-end memory of the target arithmetic unit supports camping until the user releases or replaces the data to the remote memory when the near-end memory is insufficient.
  5. 根据权利要求4所述的方法,其特征在于,所述根据固定资源池算法,确定所述内存块,包括:The method according to claim 4, wherein the determining the memory block according to a fixed resource pool algorithm comprises:
    根据所述就绪任务需要的所有参数对应的数据块的总和与所述近端内存中的单个内存块的大小的比值,确定所述内存块的数量。The number of the memory blocks is determined according to a ratio of a sum of data blocks corresponding to all parameters required by the ready task to a size of a single memory block in the near-end memory.
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,在确定任务队列中的就绪任务之前,所述方法还包括:The method according to any one of claims 1 to 5, wherein before determining the ready task in the task queue, the method further comprises:
    对包括所述就绪任务的业务进行抽象处理,得到抽象处理信息,所述抽象处理信息包括下列信息中的至少一种:任务依赖关系信息、数据依赖关系信息和任务执行的先后顺序信息。 The abstract processing is performed on the service including the ready task, and the abstract processing information includes at least one of the following information: task dependency information, data dependency information, and sequence information of task execution.
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述任务队列为多个并行的任务队列;The method according to any one of claims 1 to 6, wherein the task queue is a plurality of parallel task queues;
    其中,所述确定任务队列中的就绪任务,包括:The determining the ready task in the task queue includes:
    按照优先级顺序轮询所述多个并行的任务队列,确定所述就绪任务。The plurality of parallel task queues are polled in order of priority to determine the ready task.
  8. 根据权利要求6所述的方法,其特征在于,所述对包括所述就绪任务的业务进行抽象处理,包括:The method according to claim 6, wherein the abstract processing of the service including the ready task comprises:
    根据所述业务的需要创建缓存;Create a cache according to the needs of the business;
    根据所述缓存的缓存标识ID,确定所述数据依赖关系信息。Determining the data dependency information according to the cached cache identifier ID.
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 8, wherein the method further comprises:
    在通过所述目标运算单元执行完所述就绪任务时,将所述就绪任务的输出数据保存在所述近端内存中。The output data of the ready task is saved in the near-end memory when the ready task is executed by the target operation unit.
  10. 一种多核数字信号处理系统中处理任务的装置,其特征在于,所述装置包括:A device for processing a task in a multi-core digital signal processing system, the device comprising:
    确定模块,用于确定任务队列中的就绪任务;Determining a module for determining a ready task in the task queue;
    所述确定模块,还用于确定执行所述就绪任务的目标运算单元;The determining module is further configured to determine a target computing unit that executes the ready task;
    任务执行模块,用于通过所述目标运算单元执行所述就绪任务,并同时通过所述目标运算单元为待执行任务准备数据。a task execution module, configured to execute the ready task by the target operation unit, and simultaneously prepare data for the task to be executed by the target operation unit.
  11. 根据权利要求10所述的装置,其特征在于,在确定执行所述就绪任务的目标运算单元方面,所述确定模块具体用于:The apparatus according to claim 10, wherein in determining a target computing unit that performs the ready task, the determining module is specifically configured to:
    在确定执行所述就绪任务的依赖任务的运算单元空闲时,将所述执行所述就绪任务的依赖任务的运算单元确定为所述目标运算单元。When it is determined that the operation unit of the dependent task executing the ready task is idle, the operation unit that executes the dependent task of the ready task is determined as the target operation unit.
  12. 根据权利要求10或11所述的装置,其特征在于,所述装置还包括内存申请模块;The device according to claim 10 or 11, wherein the device further comprises a memory application module;
    其中,在所述任务执行模块通过所述目标运算单元执行所述就绪任务之前,所述内存申请模块具体用于:The memory application module is specifically configured to: before the task execution module executes the ready task by using the target operation unit:
    确定所述目标运算单元的近端内存中用于存放与所述就绪任务相对应的输入数据的内存块;Determining, in a near-end memory of the target arithmetic unit, a memory block for storing input data corresponding to the ready task;
    将所述与所述就绪任务相对应的输入数据搬移到所述内存块中。The input data corresponding to the ready task is moved into the memory block.
  13. 根据权利要求12所述的装置,其特征在于,在确定所述目标运算单元的近端内存中用于存放与所述就绪任务相对应的输入数据的内存块方 面,所述内存申请模块具体用于:The apparatus according to claim 12, wherein a memory block for storing input data corresponding to said ready task in a near-end memory of said target arithmetic unit is determined The memory application module is specifically configured to:
    根据固定资源池算法,确定所述内存块,其中,所述目标运算单元的近端内存中存储的数据支持驻留直到用户释放或在近端内存不够时将数据置换到远端内存。The memory block is determined according to a fixed resource pool algorithm, wherein data stored in the near-end memory of the target arithmetic unit supports camping until the user releases or replaces the data to the remote memory when the near-end memory is insufficient.
  14. 根据权利要求13所述的装置,其特征在于,在根据固定资源池算法,确定所述内存块方面,所述内存申请模块具体用于:The device according to claim 13, wherein in the determining the memory block according to a fixed resource pool algorithm, the memory application module is specifically configured to:
    根据所述就绪任务需要的所有参数对应的数据块的总和与所述近端内存中的单个内存块的大小的比值,确定所述内存块的数量。The number of the memory blocks is determined according to a ratio of a sum of data blocks corresponding to all parameters required by the ready task to a size of a single memory block in the near-end memory.
  15. 根据权利要求10至14中任一项所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 10 to 14, wherein the device further comprises:
    业务抽象模块,用于在所述确定模块确定任务队列中的就绪任务之前,对包括所述就绪任务的业务进行抽象处理,得到抽象处理信息,所述抽象处理信息包括下列信息中的至少一种:任务依赖关系信息、数据依赖关系信息和任务执行的先后顺序信息。a service abstraction module, configured to perform abstract processing on the service including the ready task before the determining module determines the ready task in the task queue, to obtain abstract processing information, where the abstract processing information includes at least one of the following information : Task dependency information, data dependency information, and sequence of task execution.
  16. 根据权利要求10至15中任一项所述的装置,其特征在于,所述任务队列为多个并行的任务队列;The apparatus according to any one of claims 10 to 15, wherein the task queue is a plurality of parallel task queues;
    其中,在确定任务队列中的就绪任务方面,所述确定模块具体用于:Wherein, in determining the ready task in the task queue, the determining module is specifically configured to:
    按照优先级顺序轮询所述多个并行的任务队列,确定所述就绪任务。The plurality of parallel task queues are polled in order of priority to determine the ready task.
  17. 根据权利要求15所述的装置,其特征在于,在对包括所述就绪任务的业务进行抽象处理方面,所述业务抽象模块具体用于:The apparatus according to claim 15, wherein in the abstract processing of the service including the ready task, the service abstraction module is specifically configured to:
    根据所述业务的需要创建缓存;Create a cache according to the needs of the business;
    根据所述缓存的缓存标识ID,确定所述数据依赖关系信息。Determining the data dependency information according to the cached cache identifier ID.
  18. 根据权利要求10至17中任一项所述的装置,其特征在于,所述任务执行模块还用于:The device according to any one of claims 10 to 17, wherein the task execution module is further configured to:
    在通过所述目标运算单元执行完所述就绪任务时,将所述就绪任务的输出数据保存在所述近端内存中。 The output data of the ready task is saved in the near-end memory when the ready task is executed by the target operation unit.
PCT/CN2015/093248 2015-10-29 2015-10-29 Method and apparatus for processing task in a multi-core digital signal processing system WO2017070900A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2015/093248 WO2017070900A1 (en) 2015-10-29 2015-10-29 Method and apparatus for processing task in a multi-core digital signal processing system
CN201580083942.3A CN108351783A (en) 2015-10-29 2015-10-29 The method and apparatus that task is handled in multinuclear digital information processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/093248 WO2017070900A1 (en) 2015-10-29 2015-10-29 Method and apparatus for processing task in a multi-core digital signal processing system

Publications (1)

Publication Number Publication Date
WO2017070900A1 true WO2017070900A1 (en) 2017-05-04

Family

ID=58629684

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/093248 WO2017070900A1 (en) 2015-10-29 2015-10-29 Method and apparatus for processing task in a multi-core digital signal processing system

Country Status (2)

Country Link
CN (1) CN108351783A (en)
WO (1) WO2017070900A1 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697122A (en) * 2017-10-20 2019-04-30 华为技术有限公司 Task processing method, equipment and computer storage medium
CN109725994A (en) * 2018-06-15 2019-05-07 中国平安人寿保险股份有限公司 Data pick-up task executing method, device, terminal and readable storage medium storing program for executing
CN110968418A (en) * 2018-09-30 2020-04-07 北京忆恒创源科技有限公司 Signal-slot-based large-scale constrained concurrent task scheduling method and device
CN111104167A (en) * 2018-10-25 2020-05-05 杭州嘉楠耘智信息科技有限公司 Calculation result submitting method and device
CN111104168A (en) * 2018-10-25 2020-05-05 杭州嘉楠耘智信息科技有限公司 Calculation result submitting method and device
CN111309482A (en) * 2020-02-20 2020-06-19 浙江亿邦通信科技有限公司 Ore machine controller task distribution system, device and storable medium thereof
CN112148454A (en) * 2020-09-29 2020-12-29 行星算力(深圳)科技有限公司 Edge computing method supporting serial and parallel and electronic equipment
CN112365002A (en) * 2020-11-11 2021-02-12 深圳力维智联技术有限公司 Spark-based model construction method, device and system and storage medium
CN112667386A (en) * 2021-01-18 2021-04-16 青岛海尔科技有限公司 Task management method and device, storage medium and electronic equipment
CN112823343A (en) * 2020-03-11 2021-05-18 深圳市大疆创新科技有限公司 Direct memory access unit, processor, device, processing method, and storage medium
CN113138812A (en) * 2021-04-23 2021-07-20 中国人民解放军63920部队 Spacecraft task scheduling method and device
CN115658325A (en) * 2022-11-18 2023-01-31 北京市大数据中心 Data processing method, data processing device, multi-core processor, electronic device, and medium
CN116107724A (en) * 2023-04-04 2023-05-12 山东浪潮科学研究院有限公司 AI (advanced technology attachment) acceleration core scheduling management method, device, equipment and storage medium
CN117633914A (en) * 2024-01-25 2024-03-01 深圳市纽创信安科技开发有限公司 Chip-based password resource scheduling method, device and storage medium

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110825461B (en) * 2018-08-10 2024-01-05 北京百度网讯科技有限公司 Data processing method and device
CN111026539B (en) * 2018-10-10 2022-12-02 上海寒武纪信息科技有限公司 Communication task processing method, task cache device and storage medium
CN111324427B (en) * 2018-12-14 2023-07-28 深圳云天励飞技术有限公司 Task scheduling method and device based on DSP
CN111767121B (en) * 2019-04-02 2022-11-01 上海寒武纪信息科技有限公司 Operation method, device and related product
CN114900486B (en) * 2022-05-09 2023-08-08 江苏新质信息科技有限公司 Multi-algorithm core calling method and system based on FPGA

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955685A (en) * 2011-08-17 2013-03-06 上海贝尔股份有限公司 Multicore DSP (digital signal processor), system with multicore DSP and scheduler
CN103440173A (en) * 2013-08-23 2013-12-11 华为技术有限公司 Scheduling method and related devices of multi-core processors
US20140068624A1 (en) * 2012-09-04 2014-03-06 Microsoft Corporation Quota-based resource management
CN104598426A (en) * 2013-10-30 2015-05-06 联发科技股份有限公司 task scheduling method applied to a heterogeneous multi-core processor system
CN104714785A (en) * 2015-03-31 2015-06-17 中芯睿智(北京)微电子科技有限公司 Task scheduling device, task scheduling method and data parallel processing device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005284749A (en) * 2004-03-30 2005-10-13 Kyushu Univ Parallel computer
CN1329825C (en) * 2004-10-08 2007-08-01 华为技术有限公司 Multi-task processing method based on digital signal processor
CN100361081C (en) * 2005-01-18 2008-01-09 华为技术有限公司 Method for processing multi-thread, multi-task and multi-processor
WO2007104330A1 (en) * 2006-03-15 2007-09-20 Freescale Semiconductor, Inc. Task scheduling method and apparatus
CN101610399B (en) * 2009-07-22 2010-12-08 杭州华三通信技术有限公司 Planning business dispatching system and method for realizing dispatch of planning business
CN102542379B (en) * 2010-12-20 2015-03-11 中国移动通信集团公司 Method, system and device for processing scheduled tasks
CN102096857B (en) * 2010-12-27 2013-05-29 大唐软件技术股份有限公司 Collaboration method and device for data processing process

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955685A (en) * 2011-08-17 2013-03-06 上海贝尔股份有限公司 Multicore DSP (digital signal processor), system with multicore DSP and scheduler
US20140068624A1 (en) * 2012-09-04 2014-03-06 Microsoft Corporation Quota-based resource management
CN103440173A (en) * 2013-08-23 2013-12-11 华为技术有限公司 Scheduling method and related devices of multi-core processors
CN104598426A (en) * 2013-10-30 2015-05-06 联发科技股份有限公司 task scheduling method applied to a heterogeneous multi-core processor system
CN104714785A (en) * 2015-03-31 2015-06-17 中芯睿智(北京)微电子科技有限公司 Task scheduling device, task scheduling method and data parallel processing device

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697122B (en) * 2017-10-20 2024-03-15 华为技术有限公司 Task processing method, device and computer storage medium
CN109697122A (en) * 2017-10-20 2019-04-30 华为技术有限公司 Task processing method, equipment and computer storage medium
CN109725994A (en) * 2018-06-15 2019-05-07 中国平安人寿保险股份有限公司 Data pick-up task executing method, device, terminal and readable storage medium storing program for executing
CN109725994B (en) * 2018-06-15 2024-02-06 中国平安人寿保险股份有限公司 Method and device for executing data extraction task, terminal and readable storage medium
CN110968418A (en) * 2018-09-30 2020-04-07 北京忆恒创源科技有限公司 Signal-slot-based large-scale constrained concurrent task scheduling method and device
CN111104168B (en) * 2018-10-25 2023-05-12 上海嘉楠捷思信息技术有限公司 Calculation result submitting method and device
CN111104167A (en) * 2018-10-25 2020-05-05 杭州嘉楠耘智信息科技有限公司 Calculation result submitting method and device
CN111104168A (en) * 2018-10-25 2020-05-05 杭州嘉楠耘智信息科技有限公司 Calculation result submitting method and device
CN111104167B (en) * 2018-10-25 2023-07-21 上海嘉楠捷思信息技术有限公司 Calculation result submitting method and device
CN111309482B (en) * 2020-02-20 2023-08-15 浙江亿邦通信科技有限公司 Hash algorithm-based block chain task allocation system, device and storable medium
CN111309482A (en) * 2020-02-20 2020-06-19 浙江亿邦通信科技有限公司 Ore machine controller task distribution system, device and storable medium thereof
CN112823343A (en) * 2020-03-11 2021-05-18 深圳市大疆创新科技有限公司 Direct memory access unit, processor, device, processing method, and storage medium
CN112148454A (en) * 2020-09-29 2020-12-29 行星算力(深圳)科技有限公司 Edge computing method supporting serial and parallel and electronic equipment
CN112365002A (en) * 2020-11-11 2021-02-12 深圳力维智联技术有限公司 Spark-based model construction method, device and system and storage medium
CN112667386A (en) * 2021-01-18 2021-04-16 青岛海尔科技有限公司 Task management method and device, storage medium and electronic equipment
CN113138812A (en) * 2021-04-23 2021-07-20 中国人民解放军63920部队 Spacecraft task scheduling method and device
CN115658325A (en) * 2022-11-18 2023-01-31 北京市大数据中心 Data processing method, data processing device, multi-core processor, electronic device, and medium
CN115658325B (en) * 2022-11-18 2024-01-23 北京市大数据中心 Data processing method, device, multi-core processor, electronic equipment and medium
CN116107724A (en) * 2023-04-04 2023-05-12 山东浪潮科学研究院有限公司 AI (advanced technology attachment) acceleration core scheduling management method, device, equipment and storage medium
CN117633914A (en) * 2024-01-25 2024-03-01 深圳市纽创信安科技开发有限公司 Chip-based password resource scheduling method, device and storage medium
CN117633914B (en) * 2024-01-25 2024-05-10 深圳市纽创信安科技开发有限公司 Chip-based password resource scheduling method, device and storage medium

Also Published As

Publication number Publication date
CN108351783A (en) 2018-07-31

Similar Documents

Publication Publication Date Title
WO2017070900A1 (en) Method and apparatus for processing task in a multi-core digital signal processing system
US10467725B2 (en) Managing access to a resource pool of graphics processing units under fine grain control
US10891158B2 (en) Task scheduling method and apparatus
US10275558B2 (en) Technologies for providing FPGA infrastructure-as-a-service computing capabilities
US9973512B2 (en) Determining variable wait time in an asynchronous call-back system based on calculated average sub-queue wait time
US10013264B2 (en) Affinity of virtual processor dispatching
JP2009265963A (en) Information processing system and task execution control method
US9471387B2 (en) Scheduling in job execution
US9218201B2 (en) Multicore system and activating method
US11347546B2 (en) Task scheduling method and device, and computer storage medium
CN111240813A (en) DMA scheduling method, device and computer readable storage medium
CN114168271B (en) Task scheduling method, electronic device and storage medium
Abeni et al. EDF scheduling of real-time tasks on multiple cores: Adaptive partitioning vs. global scheduling
US10261817B2 (en) System on a chip and method for a controller supported virtual machine monitor
US11494228B2 (en) Calculator and job scheduling between jobs within a job switching group
Shih et al. Virtual cloud core: Opencl workload sharing framework for connected devices
US9176910B2 (en) Sending a next request to a resource before a completion interrupt for a previous request
US20150363227A1 (en) Data processing unit and method for operating a data processing unit
CN113439260A (en) I/O completion polling for low latency storage devices
US11915041B1 (en) Method and system for sequencing artificial intelligence (AI) jobs for execution at AI accelerators
US20150293780A1 (en) Method and System for Reconfigurable Virtual Single Processor Programming Model
CN116795490A (en) vCPU scheduling method, device, equipment and storage medium
US8656375B2 (en) Cross-logical entity accelerators
CN114567520A (en) Method, computer equipment and communication system for realizing collective communication
Lin et al. Global Scheduling for the Embedded Virtualization System in the Multi-core Platform.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15906963

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15906963

Country of ref document: EP

Kind code of ref document: A1