CN115981833A

CN115981833A - Task processing method and device

Info

Publication number: CN115981833A
Application number: CN202111205427.1A
Authority: CN
Inventors: 朱琦; 刘昊程
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2023-04-18

Abstract

The application provides a task processing method and a device, the method can be applied to computing equipment, the computing equipment comprises a main processor and a coprocessor, the main processor sends a request command to the coprocessor to schedule a task to the coprocessor for processing, the coprocessor executes the task, and after the task is executed, according to a first address indicated by the request command, information of a thread used for executing the task on the main processor is obtained, and the information of the thread is stored in a running queue maintained by the main processor. By the method, the information of the thread related to the task and the related behavior of the running queue are unloaded to the coprocessor, the main processor can acquire the information of the thread and wake up the thread based on the running queue, so that the main processor does not need to inquire the coprocessor, and the coprocessor does not need to inform the main processor of completing the task execution through an interrupt request, so that the resource overhead of the CPU can be reduced, and the CPU utilization rate and the system performance can be improved.

Description

Task processing method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a task processing method and apparatus.

Background

In the era of diversity computing power, there may be multiple types of processors in the same computing device, such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Data Processing Unit (DPU), etc., where the CPU may be used as a main processor to schedule user tasks to a coprocessor such as a GPU for processing. The GPU may be used as a coprocessor to perform tasks assigned by the CPU.

After the task is executed in the coprocessor, in a first implementation manner, the coprocessor may send an interrupt request to the CPU to indicate that the task is executed, and after receiving the interrupt request, the CPU may obtain a data processing result obtained by the coprocessor, and then the CPU uses the data processing result to perform the next processing. It should be understood that the number of coprocessors may be multiple, in this way, the CPU needs to receive interrupt requests of various coprocessors, frequent interaction occupies a large amount of CPU resources, and there is a problem of large delay in responding to the interrupt request and interrupt processing.

In a second general implementation, the coprocessor does not need to send an interrupt request to inform that the task is completed, but the CPU determines whether the task is completed by each coprocessor through a polling method. In this way, since there is no interrupt-related processing, the latency performance is better than that of the first implementation, but the polling method occupies more resources of the CPU for a long time and the CPU occupancy rate is high.

Disclosure of Invention

The application provides a task processing method and a task processing device, which are used for reducing the CPU resource overhead of a wake-up thread.

In a first aspect, an embodiment of the present application provides a task processing method, which may be applied to a computing device (e.g., a server, a host, and the like), where the computing device includes at least a main processor (e.g., a CPU) and a coprocessor (e.g., a GPU, a DPU, and the like), in the method, the main processor may send a request command to the coprocessor to request the coprocessor to process a task, where the main processor maintains a run queue, the request command is used to indicate a first address and a second address, and the first address is an address of a memory space where information of a thread running on the main processor (related to the task) is located; the second address is an address of a memory space where the running queue maintained by the main processor is located; and the coprocessor executes the task indicated in the request command and stores the thread information acquired from the memory space corresponding to the first address into the running queue indicated by the second address after the task is executed.

In the above manner, the main processor sends a request command to the coprocessor to request the coprocessor to process a task, and the request command indicates a first address of information of a thread (related to the task) on the main processor and a second address of an operation queue maintained by the main processor, so that after the coprocessor completes the task, the information of the thread can be acquired from the first address, and the acquired information of the thread is stored in the operation queue maintained by the main processor according to the second address, and thus, the coprocessor does not need to inform the coprocessor of completion of task execution by an interrupt request, and does not need to inquire whether the task of the coprocessor is completed or not by the main processor, so that the resource overhead of the CPU can be reduced, and the utilization rate and the system performance of the CPU can be improved.

In one possible implementation, the request command includes a first address and a second address. Or, the request command includes a first address, and the information of the thread stored in the memory space indicated by the first address includes a second address.

In the above manner, flexibility is provided in indicating the first address and the second address by the request command.

In one possible implementation, the main processor obtains information of the thread from the run queue and wakes up the thread based on the obtained information of the thread.

In the above manner, after the coprocessor completes the task execution, the information of the thread related to the task on the main processor is stored in the running queue of the main processor, and the main processor acquires the information of the thread according to the running queue and wakes up the thread based on the acquired information. Therefore, the main processor does not need to inquire whether the task is executed or not to the coprocessor, and the coprocessor does not need to send an interrupt request to the main processor, so that the resource overhead of the CPU can be reduced, and the utilization rate and the system performance of the CPU can be improved.

In one possible implementation, the first address and the second address are both mapped to the coprocessor by a Cache Coherence (CC) protocol.

In the above manner, the main processor maps the first address and the second address to the coprocessor through the CC protocol, so that the coprocessor can access the memory spaces corresponding to the first address and the second address, and data consistency between the main processor and the coprocessor about the memory spaces corresponding to the first address and the second address can be ensured.

In one possible embodiment, the information of the thread is thread control block TCB information of the thread.

In this manner, the thread control block information may uniquely identify a thread and include information that the host processor manages the thread, and the host processor may wake up the thread based on the thread control block information.

In one possible implementation, the task is used for instructing to process the data to be processed;

the main processor sends information for indicating a third address to the coprocessor, wherein the third address is an address of a memory space for storing data to be processed; the coprocessor receives the information used for indicating the third address; the coprocessor performs the tasks, including: and acquiring the data to be processed from the memory space corresponding to the third address.

In a possible embodiment, the third address is mapped to the coprocessor by the CC protocol.

In the above manner, the main processor maps the third address to the coprocessor through the CC protocol, so that the coprocessor can access the memory space corresponding to the third address, and thus, the main processor is not required to move the data to be processed from the memory space corresponding to the third address to the memory space accessible by the coprocessor, thereby reducing the CPU resource overhead, and ensuring the data consistency of the main processor and the coprocessor with respect to the memory space corresponding to the third address.

In one possible implementation, the main processor sends the interrupt lower half processing information to the coprocessor; the interrupt lower half processing information includes information indicating a fourth address, the fourth address being an address of a memory space for storing a processing result of the task; and the coprocessor receives the processing information of the lower half part of the interrupt and stores the obtained processing result of the task to the memory space corresponding to the fourth address.

In a possible embodiment, the fourth address is mapped to the coprocessor by the CC protocol.

In the mode, the main processor unloads the data processing result returning operation in the lower half interruption flow into the coprocessor through the CC protocol, so that the dependence on the processing capacity of the CPU is reduced, the resource utilization rate of the CPU is improved, and the transverse expansion (scale out) capacity of the coprocessor is improved.

In a second aspect, the present application further provides another computing device, where the computing device includes a main processor, a coprocessor, a memory, and a memory, where the memory is used to store computer program instructions;

the main processor executes the program instructions in the memory to execute the operations executed by the main processor provided by any one of the first aspect or the first aspect in any possible implementation manner; the coprocessor performs the operations performed by the coprocessor provided in the first aspect or any of the possible implementations of the first aspect. The computing device may be a server or the like.

In one possible embodiment, the main processor comprises a CPU;

the coprocessor includes a GPU, a DPU, an Application Specific Integrated Circuit (ASIC), a System On Chip (SOC), a programmable gate array (FPGA), an embedded neural Network Processor (NPU), a hardware computing engine (HW AE), a Hardware Acceleration Controller (HAC), and a CPU.

In a third aspect, the present application further provides a computer chip, where the chip is connected to a memory, and the chip is configured to read and execute a software program stored in the memory, to perform operations of a storage device in each possible implementation manner of the first aspect and the first aspect, or to perform operations of a storage device in each possible implementation manner of the second aspect and the second aspect.

For the beneficial effects achieved by the second aspect to the third aspect, please refer to the description of the beneficial effects of the first aspect, which is not repeated herein.

Drawings

Fig. 1 is a schematic hardware architecture diagram of a computing device 120 according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a software architecture of a computing device 120 according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a task processing method according to an embodiment of the present application;

fig. 4 is a schematic diagram illustrating an implementation of a task processing method according to an embodiment of the present application.

Detailed Description

First, some technical terms in the embodiments of the present application will be explained.

1, process (process);

the process may refer to a running activity of a program with a certain independent function, and is a carrier for running the application program, or may be understood as a running instance of the application program, and is a dynamic execution of the application program. For example, when a user runs the Notepad program (Notepad), the user creates a process to accommodate the code that makes up Notepad. The process is an independent unit for resource allocation and scheduling of the system, each process has a memory space of the process, and different processes are independent of each other.

2, thread (thread);

threads are created in a process, and a process usually includes multiple threads, where the multiple threads can share resources (such as CPU resources and memory resources) of the process, and a thread is a minimum unit for performing operations in the process, or a basic unit for allocating CPU time resources by the system.

Threads are concurrent, i.e., multiple concurrent threads are supported within a single process, each thread concurrently performing a different task. For example, one thread may be used to write a file to a disk, and another thread may be used to receive a key operation of a user and react in time, and the threads do not interfere with each other. The concurrency means that each thread executes in turn, and runs for a fixed time period in turn, the time period (fixed time period) for each thread to execute can be called a time slice, and since the execution efficiency of a Central Processing Unit (CPU) is very high, the time slice can be made very short, and the CPU can rapidly switch among the threads, thereby exhibiting an effect that a plurality of tasks are performed simultaneously, which is called concurrency of threads.

Specifically, the state of the thread includes, but is not limited to: a runnable/ready state, a running state, and a waiting state. The thread in the operable state is qualified to obtain the time slice, the state of the thread is the operating state after the time slice is obtained, and when the time slice is used up, the thread can be switched back to the operable state again to wait for the next scheduled execution. While a thread in a wait state is not assigned a time slice until awakened.

3, processing Control Block (PCB)/Thread Control Block (TCB)

The PCB is a data structure of the kernel of the operating system, and is used for recording relevant information of a process. PCBs include, but are not limited to, one or more of the following:

(1) Identification information for uniquely identifying a process.

(2) Program counter: followed by the address of the instruction to be executed.

(3) And (3) process state: such as runnable, running, waiting, etc., as previously described.

(4) A CPU register: such as stack pointers and general purpose registers, etc.

(5) Scheduling information: including but not limited to: the process control block comprises one or more items of process state, priority of the process, information of a running queue to which the process control block belongs, and the like.

The TCB is similar to a PCB in that, for example, the TCB includes identification information of threads, thread states, etc., it is understood that a data structure for recording related information of a process is referred to as a PCB, a data structure for recording related information of threads is referred to as a TCB, and an area is that information of threads may be less than that of a process.

It should be noted that, the above description does not limit the PCB or the TCB, information contained in the PCB or the TCB may be different or have different names in different systems, and the information contained in the PCB or the TCB is not limited in the embodiments of the present application.

4, running a queue (run queue);

the CPU may be configured to provide resources such as time slices for the threads and the threads waiting to use the resources (i.e., the threads in a runnable state) need to be queued. If queuing is needed, unified queuing at a given place is needed, and each CPU core is provided with a running queue for indicating the queued threads for the convenience of management. In the Linux kernel, this run queue is a memory variable defined for each CPU core, such as runqueue variable of struct _ rq structure, and the value of the variable may be used to indicate information of a thread, such as TCB of the thread.

The CPU may select a thread from the running queue to execute, for example, the selection policy is to select according to a queued order, or select according to a priority of the thread, and the like, which is not limited in this embodiment of the present application. The CPU selects a thread to be executed from the running queue, and acquires a TCB of the thread according to the indication of the running queue, wherein the TCB comprises information (or called context) for executing the thread, such as a program counter, a CPU register value and the like, and the CPU is switched to the context of the thread to execute the thread.

It should be noted that the description is provided herein for threads, and the scheduling and execution method for processes is similar to that for threads, and will not be described herein again.

5, heterogeneous computing (Heterogeneous computing);

heterogeneous computing refers to a computing manner of forming a system by using different types of computing units, where the different types may refer to computing units using different types of instruction sets or different architectures, and the computing units include processors or components with computing functions, such as network cards (NICs) configured with the processors, and the following description will be made in detail.

6, calculating the diversity;

according to different application scenes, on the basis of heterogeneous calculation, multiple calculation forces are cooperated to participate in the calculation process, so that the overall calculation force of the system is fully exerted, and the performance of the system is improved.

Application scenarios of embodiments of the present application include scenarios that require a computing device to complete a task using a diversity algorithm, such as a database, big data, artificial Intelligence (AI), high Performance Computing (HPC), storage, web (web) applications, and the like. The task processing method provided by the embodiment of the application can be applied to computing equipment with heterogeneous computing capability, and the computing equipment can be, but is not limited to, a desktop computer, a server, a notebook computer and mobile equipment.

Fig. 1 is a schematic architecture diagram of a computing device according to an embodiment of the present application. At the hardware level, the computing device 120 includes one or more main processors 123 (two main processors 123 are shown in fig. 1, but the present application is not limited to two main processors 123), one or more coprocessors 124 (three coprocessors 124 are shown in fig. 1, but the present application is not limited to three coprocessors 124), and memory 125. The host processor 123, the coprocessor 124 and the memory 125 are connected by a bus 126.

The main processor 123 may be a general-purpose processor, such as a Central Processing Unit (CPU), and may be configured to manage or schedule a task, such as scheduling the task to different coprocessors 124 for processing, where the task may be triggered by an application program by a user, or may be generated inside the computing device 120, which is not limited in this embodiment of the present application. In addition, one CPU123 may have one or more CPU cores. The number of CPUs and the number of CPU cores are not limited in this embodiment.

The coprocessor 124 has an arithmetic function, and may be configured to perform computation or processing on data, such as data de-duplication, data compression, image processing, and the like, and of course, the CPU123 also has an arithmetic function, and may be configured to perform computation or processing on data, but performance may be lower than that of the coprocessor 124, or computing resources of the CPU123 may easily become a bottleneck, so that the CPU123 may schedule tasks to the coprocessor 124 for processing, so as to improve the overall performance of the computing device 120. Illustratively, the coprocessor 124 receives a task scheduled by the CPU123, such as a task of compressing data to be compressed, and the coprocessor 124 performs the following tasks: the data to be compressed is acquired and compressed to obtain compressed data, and then the compressed data is returned to the CPU123.

The coprocessor 124, which may be a general purpose processor or a special purpose processor, such as the coprocessor 124 including, but not limited to: a Graphics Processing Unit (GPU), a data computing unit (DPU), an Application Specific Integrated Circuit (ASIC), a System On Chip (SOC), a Field Programmable Gate Array (FPGA), an embedded neural Network Processor (NPU), a hardware computing engine (HW AE), a Hardware Acceleration Controller (HAC), a CPU, and so on.

It is noted that coprocessor 124 may be a stand-alone component or may be combined with other components of computing device 120, such as a network card configured with coprocessor 124, a Solid-state drive/Solid-state disk (SSD) configured with coprocessor 124, and so on. The embodiment of the present application does not limit the form of the coprocessor, and any processor in the components having the arithmetic function may be used as the coprocessor.

The memory 125 is an internal memory for directly exchanging data with the CPU123 or an internal memory for directly exchanging data with the CPU123 and the coprocessor 124, can read and write data at any time, and is fast and used as a temporary data storage for an operating system or other programs in operation.

The memory 125 includes at least two types of memory, for example, the memory 125 may be a random access memory (ram) or a Read Only Memory (ROM). For example, the random access memory is a Dynamic Random Access Memory (DRAM), or a Storage Class Memory (SCM). DRAM is a semiconductor memory, and belongs to a volatile memory (volatile memory) device, like most Random Access Memories (RAMs). SCM is a hybrid storage technology that combines the features of both conventional storage devices and memory, memory-class memory that provides faster read and write speeds than hard disks, but slower access speeds and cheaper cost than DRAM. However, the DRAM and the SCM are only exemplary in this embodiment, and the memory 125 may further include other random access memories, such as a Static Random Access Memory (SRAM). The rom may be, for example, a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), and the like. In addition, the memory 125 may also be a dual in-line memory module (DIMM), a module composed of Dynamic Random Access Memory (DRAM), or a Solid State Disk (SSD). In practice, multiple memories 125, as well as different types of memories 125, may be configured in the computing device 120. The number and type of the memories 125 are not limited in this embodiment. In addition, the memory 125 may be configured to have a power conservation function. The power conservation function means that when the system is powered off and powered on again, the data stored in the memory 125 will not be lost. A memory having a power retention function is called a nonvolatile memory.

Specifically, CPU123 and coprocessor 124 may share memory 125. Further illustratively, the CPU123 and the coprocessor 124 each have a respective dedicated memory. As another example, the CPU123 may have its own dedicated memory, and the coprocessor 124 does not have its own dedicated memory, for example, the memory 125 is the dedicated memory of the CPU123, and the coprocessor 124 does not have its own dedicated memory, which is not limited in this embodiment.

Those skilled in the art can determine that a cache (not shown in fig. 1) is also provided in the CPU123 and the coprocessor 124. The memory type of the cache may be SRAM, DRAM, or the like, which is not limited in this embodiment of the present application, and the operation speed of the cache is generally faster than that of the memory, so that data in the memory may be read into the cache, and when a certain component, such as the CPU123, needs to read data, the data is first searched from its own cache, and if the data is not hit in the cache, the data is then searched from the memory 125. If data in memory 125 is shared by multiple components, such as CPU123 and coprocessor 124, the data may have copies of the data in multiple caches, such as the cache of CPU123 and the cache of coprocessor 124, respectively, for the shared data in memory 125. If one of the components modifies the data in its own cache, the problem of data inconsistency will occur. Cache Coherence (CC) comes to solve the problem, in other words, cache coherence is used to guarantee data consistency.

Specifically, the cache coherency protocol requires that a data unit (such as a cache line) in a cache has at least 4 states, namely modified, exclusive, shared, and invalid, where modified indicates that cache line data is modified but not updated to a memory, and data in the state exists in a cache of only one component. Exclusive indicates that cache line data is Exclusive, i.e. no data exists in other caches, and the data is not modified, i.e. consistent with data in memory. Share indicates that the data in the cache is shared, that is, the data in this state exists in the caches of multiple components and is consistent with the memory. Invalid indicates that the cache line data is Invalid.

For a cache line in an exclusive state, other operations trying to read a memory address corresponding to the cache line need to be monitored all the time, and if the operations are monitored, the cache line state is set to share. For a cache line in the share state, a request for invalidating the cache line or exclusively sharing the cache line needs to be constantly snooped, and if the request is snooped, the state of the cache line is set to Invalid. For a cache line in a modified state, all operations attempting to read a memory address corresponding to the cache line need to be snooped all the time, and if snooped, data in the cache line must be written back to the memory before the operations are executed, so that data consistency is guaranteed.

It should be noted that the above manner is only an example, and the CC protocol is not limited in this embodiment.

In this embodiment, the CPU123 may map part or all of the memory 125 that can be accessed by the CPU123 to the coprocessor 124 through cache coherency, so that the CPU123 and the coprocessor 124 may share the same internal memory space, and the coprocessor 124 maintains a mapping relationship between the mapped memory address and one or more cache lines (for storing data read from the memory space corresponding to the mapped memory address), and ensures data coherency in this manner.

Bus 126, including but not limited to: an Extended Industrial Standard Architecture (EISA) bus, a unified bus (Ubus or UB), a computer express link (CXL), a cache coherent interconnect protocol (CCIX) for access, a generation Z (GENZ), an open source continuous accelerator processor interface (capopeni), an english language link (NVlink), and an interconnect bus supporting multiple protocols. The embodiment of the present application does not limit the type of the bus 126, and any bus supporting the CC protocol is suitable for the embodiment of the present application.

It should be noted that the structure shown in fig. 1 does not specifically limit the computing device 120. In other embodiments of the present application, the computing device 120 may include more or less components than those shown, for example, the computing device 120 may also include a hard disk, a network card, etc., or combine some components, or split some components, or arrange different components.

Referring to fig. 2, at a software level, a computing device 120 runs an Operating System (OS) 101 and an application (application) 102. Application 102 is a generic term for various application programs presented to a user. The operating system 101 at least comprises an OS kernel 330 and a device driver layer 331, wherein the application 102 runs in a user mode of the operating system 101, and the OS kernel 330 and the device driver layer 331 run in a kernel mode of the operating system 101. The device driver layer 331 is installed with drivers such as a coprocessor driver 334. The OS kernel 330 is used to manage processes/threads of the system, device drivers, and the like. The coprocessor driver 334 corresponds to a hardware interface of the coprocessor 124, such as the GPU driver 334 is a hardware interface of the GPU124 in fig. 2, the DPU driver 334 is a hardware interface of the DPU124, and the operating system 101 communicates with the coprocessor 124 through the coprocessor driver 334 to control the operation of the coprocessor 124. It should be noted that the operating system 101 and the application 102 run on the CPU123. CPU123 may call a program of operating system 101 stored in memory 125 to run operating system 101.

In the task processing method provided in the embodiment of the present application, the CPU123 uses a thread to schedule a task to the coprocessor 124, and then the thread enters a waiting state, and after the coprocessor 124 completes the task, the CPU123 wakes up the thread based on the CC protocol, and by this design, the coprocessor 124 does not need to notify the CPU123 of waking up the thread through an interrupt request, and does not need the CPU123 to wake up the thread through a polling coprocessor 124, so that the CPU overhead can be reduced, the CPU utilization rate is improved, and it is beneficial to reducing the time delay and improving the system performance.

Next, a task processing method provided in an embodiment of the present application is described by taking the computing device 120 shown in fig. 1 and fig. 2 as an example. For convenience of explanation, the following embodiments of the present application will be described by taking a CPU123, a coprocessor 124, and threads as examples.

Referring to fig. 3, fig. 3 is a schematic flowchart of a task processing method provided in an embodiment of the present application, and as shown in fig. 3, the method includes the following steps:

in step 300, the os kernel 330 maps memory space for storing information of threads of the application 102 (hereinafter referred to as application threads) to the coprocessor 12 via the CC protocol.

The information of the thread includes, but is not limited to, a thread control block of the thread, and in one embodiment, when the OS kernel 330 creates the application thread, a kernel stack is created in the memory 125, and the kernel stack is used for storing the thread control block of the application thread, for example, the thread control block may be located at the top of the kernel stack, and the OS kernel 330 may map the top position of the kernel stack to the coprocessor 124 through the CC protocol. It is understood that after mapping to the coprocessor 124 through the CC protocol, the coprocessor 124 may listen and access the memory space corresponding to the first address. Notably, the application thread is used to perform a task triggered by application 102, such as application 102 invoking coprocessor driver 334 to send a task request to coprocessor driver 334.

In one embodiment, the OS kernel 330 may globally map memory space used to store information for application threads, i.e., to each coprocessor, such as a coprocessor of the computing device 120, including: and the OS kernel 330 may map the information of the application thread to the GPU and the DPU, so that when the application thread calls the GPU or the DPU, respectively, it is not necessary to map the information of the application thread in a split manner, which is beneficial to reducing the time delay and improving the system performance.

In another embodiment, the OS kernel 330 may map the memory space for storing the information of the application thread to a designated coprocessor, for example, the application thread calls the GPU, the OS kernel 330 maps the memory space for storing the information of the application thread to the GPU, and when the application thread calls the DPU again, the OS kernel 330 maps the memory space for storing the information of the application thread to the DPU. It should be noted that in this embodiment, the mapping operation may be performed after step 302.

In step 301, the os kernel 330 maps the memory space where the run queue to which the application thread belongs to the coprocessor 124 through the CC protocol.

In one embodiment, the run queue may be a shared run queue on CPU123, or may be a dedicated run queue generated by CPU123 for coprocessor 124, and the dedicated run queue may be used only for indicating information of threads related to coprocessor 124. The embodiment of the present application does not limit the run queue.

For the way that the OS kernel 330 maps the memory space where the run queue is located to the coprocessor 124, see the above description, for example, the OS kernel 330 may map the memory space where the run queue is located globally, that is, to each coprocessor, and details are not described here again. For convenience of description, an address of the memory space where the run queue is located after being mapped by the CC protocol is referred to as a second address. In one embodiment, the second address is stored in a thread control block of the application thread.

It should be noted that there is no strict timing limitation between step 300 and step 301, and the steps may be executed simultaneously, or step 300 may be executed first and then step 301 is executed, or step 301 is executed first and then step 300 is executed, which is not limited in this embodiment of the present application.

At step 302, application 102 sends a task request to coprocessor driver 334.

The task request is used to request the coprocessor 124 to perform a task, such as indicating that data to be compressed is to be compressed. Illustratively, the task request may also be used to indicate a memory space for storing data to be processed (hereinafter referred to as a first memory space) and a memory space for storing data processing results (hereinafter referred to as a second memory space).

In step 303, coprocessor driver 334 requests OS kernel 330 to obtain address information for the thread control block for the application thread.

In an alternative approach, the coprocessor driver 334 may also request the OS kernel 330 to map the first memory space and/or the lower-half of the interrupt information (e.g., the second memory space) to the coprocessor 124 via the CC. An address of the first memory space mapped to the coprocessor 124 through the CC is referred to as a third address, and an address of the second memory space mapped to the coprocessor 124 through the CC is referred to as a fourth address as follows. Correspondingly, in step 304, the OS kernel 330 sends the first address to the coprocessor driver 334, and optionally sends one or more of the second address, the third address, and the fourth address to the coprocessor driver 334. One or more of the first address, the second address, the third address, and the fourth address may be sent together or independently, which is not limited in this application.

For example, the OS kernel 330 may send the second address to the coprocessor driver 334 together with the first address, or may send the second address to the coprocessor driver 334 separately, that is, the first address and the second address are sent to the coprocessor driver 334 through different messages. It is noted that if the first indicated thread control block has a second address stored therein and the OS kernel 330 sends the first address to the coprocessor driver 334, the OS kernel 330 may no longer send the second address to the coprocessor driver 334. Coprocessor driver 334 may retrieve the second address from the thread control block.

It should be noted that, in addition to the address information of the memory space (i.e., the second memory space) for storing the data processing result, the lower-half-interrupt information may also include other information, which is not limited in this embodiment of the present application. Other information in the lower half-interrupt information may also be mapped to the coprocessor 124 through the CC protocol, and the mapping manner refers to the above description, which is not described herein again.

Coprocessor driver 334 sends a request command to coprocessor 124, step 305. Correspondingly, coprocessor 124 receives the request command.

The request command is used to request the coprocessor 124 to perform the task. Illustratively, the request command includes the first address, optionally, but not limited to one or more of the following: second address, third address, fourth address, task request in step 302.

In addition, the coprocessor driver 334 calls the OS kernel to make the application thread sleep, i.e. switch to a waiting state, and optionally, the OS kernel may select a new thread from the running queue and switch to the thread to be executed.

For convenience of description, it is assumed that the request command includes a first address, a second address, a third address, and a fourth address.

In step 306, the coprocessor 124 obtains the data to be processed from the memory space corresponding to the third address, and processes the data to be processed, so as to obtain a data processing result.

In step 307, the coprocessor 124 stores the data processing result to the memory space corresponding to the fourth address.

It should be noted that step 306 and step 307 are optional steps, and if the request command does not include the third address and the fourth address, step 306 and step 307 are not executed. The coprocessor 124 may acquire the data to be processed and process the data processing result in other manners, and the present application does not limit the manner in which the coprocessor 124 acquires the data to be processed and interrupts the lower half information.

Based on the above design, the data processing result in the lower half of the interruption flow is returned to the coprocessor 124 through the CC protocol, so that the dependence on the processing capability of the CPU123 is reduced, the resource utilization rate of the CPU123 is improved, and the lateral expansion (scale out) capability of the coprocessor 124 is improved.

In step 308, the coprocessor 124 obtains the information of the thread control block from the memory space corresponding to the first address.

In step 309, the coprocessor 124 stores the information of the thread control block into the memory space corresponding to the second address, i.e. the running queue.

It should be noted that storing herein may refer to storing a thread control block in the run queue, or storing information indicating the thread control block in the run queue, for example, the run queue is a linked list, where the linked list is composed of nodes, and the node may include a thread control block indicating a thread, for example, an index value of the thread control block, and the index value may be used to locate the thread control block.

In step 310, the OS kernel 330 retrieves the thread control block for the application thread from the run queue and wakes the application thread based on the thread control block.

Specifically, waking up the application thread may refer to the CPU123 switching to the context of the application thread based on the information recorded by the thread control block to continue running the application thread. In step 311, the coprocessor driver 334 obtains the data processing result from the memory space corresponding to the fourth address, and returns the data processing result to the application 102.

It should be noted that the embodiment shown in fig. 3 is described by taking a thread as an example, the thread may be replaced by a process in the above embodiment, and in an actual application, the process may also be managed and run by the method, which is not described herein again.

In the above manner, the CPU123 sends a request command to the coprocessor 124 to request the coprocessor 124 to execute a task, and the request command indicates a first address of information of a thread (related to the task) on the CPU123 and a second address of a run queue maintained by the CPU123, so that after the coprocessor 124 completes the task, the information of the thread can be obtained from the first address, and the obtained information of the thread is stored in the run queue maintained by the CPU123 according to the second address, and the CPU123 can obtain the information of the thread based on the information included in the run queue to wake up the thread, so that the coprocessor 124 is not required to notify the CPU123 through an interrupt request, and the CPU123 is not required to inquire whether the task is completed, thereby reducing the resource overhead of the CPU123, and being beneficial to improving the utilization rate of the CPU123 and the system performance.

Fig. 4 is a diagram illustrating an application example of a task processing method according to an embodiment of the present application. In fig. 4, the task processing method includes the following steps:

(0) When the OS kernel on the CPU side is started, a dedicated run queue of a coprocessor (e.g., HAC) may be created on each CPU kernel, and address mapping of the run queue is completed using a CC protocol (hereinafter, an address of the run queue mapped by the CC protocol is referred to as an RQ address for short), so as to map an address of a memory space of the run queue to the HAC; when the application is started, the OS kernel also performs address mapping of the thread control block of the application by using the CC protocol for the thread of the application (the address of the thread control block mapped by the CC protocol is referred to as a TCB address below), so that the thread control block data in the top space of the stack can be accessed by the hardware acceleration engine.

(1) When the application calls the HAC driver, the application stores the data to be processed in a certain memory space in the memory 125, and applies for a memory space for backfilling the data processing result. Illustratively, the memory space may be a memory space available for CC protocol mapping, and the address of the memory space mapped by the CC protocol is assumed to be ADD1. Similarly, the memory space for storing the data processing result may also be a memory space that can be used for CC protocol mapping, and it is assumed that the address of the memory space after CC protocol mapping is ADD2.

(2) The HAC driver encapsulates the RQ address, ADD1, ADD2, and TCB address as a request command and sends it to the HAC (e.g., HAC 1) over bus 126, after which the HAC driver enters a wait state to wait for the request command to complete. It should be noted that if the thread control block stores the address mapped by the dedicated run queue of the HAC, i.e. the RQ address, the request command may not carry the RQ address.

(3) And the HAC1 receives the request command, then acquires data to be processed from the memory space corresponding to the ADD1, processes the data to be processed to obtain processed data, and compresses the data to be processed to obtain compressed data if the request command is used for requesting the compression of the data to be processed.

(4) And the HAC1 stores the processed data into the memory space corresponding to the ADD2.

(5) And the HAC1 acquires the data of the thread control block from the memory space corresponding to the TCB address and stores the thread control block into the exclusive running queue indicated by the RQ address. As shown in FIG. 4, step 5 is used to instruct that the thread control block be deposited into a dedicated run queue on CPU 0.

And then, awakening the application thread from the running queue by the OS kernel, and returning the data processing result of the memory space corresponding to the ADD2 to the application by the HAC driver.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

The various illustrative logical units and circuits described in this application may be implemented or operated upon by design of a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.

The steps of a method or algorithm described in the embodiments herein may be embodied directly in hardware, in a software element executed by a processor, or in a combination of the two. The software cells may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Although the present application has been described in conjunction with specific features and embodiments thereof, it will be evident that various modifications and combinations can be made thereto without departing from the spirit and scope of the application. Accordingly, the specification and figures are merely exemplary of the present application as defined in the appended claims and are intended to cover any and all modifications, variations, combinations, or equivalents within the scope of the present application. It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include such modifications and variations.

Claims

1. A task processing method applied to a computing device including at least a main processor and a coprocessor, the method comprising:

the main processor sends a request command to the coprocessor, the request command is used for requesting the coprocessor to process tasks, and the main processor maintains a running queue; the request command is used for indicating a first address and a second address, wherein the first address is an address of a memory space where information of a thread of the task executed by the main processor is located; the second address is an address of a memory space where the running queue is located;

and the coprocessor executes the task, acquires the information of the thread according to the memory space corresponding to the first address after the task is executed, and stores the information of the thread into the running queue indicated by the second address.

2. The method of claim 1, wherein the host processor retrieves information about the thread from the run queue and wakes the thread.

3. The method of claim 1 or 2, wherein the first address and the second address are both mapped to the coprocessor by a cache coherent CC protocol.

4. The method of claim 1 or 2, wherein the information of the thread is Thread Control Block (TCB) information of the thread.

5. The method of any of claims 1-4, wherein the task is to instruct processing of data to be processed;

the method further comprises the following steps:

the main processor sends information used for indicating a third address to the coprocessor, wherein the third address is an address of a memory space used for storing the data to be processed;

the coprocessor receives the information indicating the third address;

the coprocessor performs the tasks, including: and acquiring the data to be processed from the memory space corresponding to the third address.

6. The method of claim 5, wherein the third address is mapped to the coprocessor via a CC protocol.

7. The method of any one of claims 1-6, further comprising:

the main processor sends the lower half interrupt processing information to the coprocessor; the lower half interrupt processing information includes information indicating a fourth address, where the fourth address is an address of a memory space for storing a processing result of the task;

and the coprocessor receives the processing information of the lower half part of the interrupt and stores the obtained processing result of the task to the memory space corresponding to the fourth address.

8. The method of claim 7, wherein the fourth address is mapped to the coprocessor via a CC protocol.

9. A computing device, comprising a main processor, a coprocessor, and a memory;

the memory is used for storing computer program instructions;

the main processor executing instructions that invoke the computer program in the memory to perform the method performed by the main processor according to any one of claims 1 to 8;

the coprocessor is configured to perform a method as claimed in any one of claims 1 to 8.

10. The computing device of claim 9, wherein the main processor comprises a Central Processing Unit (CPU);

the coprocessor comprises a Graphic Processing Unit (GPU), a data computing unit (DPU), a special application integrated circuit (ASIC), a system-on-chip (SOC), a programmable gate array (FPGA), an embedded neural Network Processor (NPU), a hardware computing engine (HW AE), a Hardware Acceleration Controller (HAC) and a Central Processing Unit (CPU).