CN114124589A

CN114124589A - SOC intelligent network card and task scheduling method

Info

Publication number: CN114124589A
Application number: CN202111281620.3A
Authority: CN
Inventors: 温强
Original assignee: Beijing Weilang Technology Co ltd
Current assignee: Beijing Weilang Technology Co ltd
Priority date: 2021-11-01
Filing date: 2021-11-01
Publication date: 2022-03-01

Abstract

The invention discloses an SOC intelligent network card and a task scheduling method, wherein the SOC intelligent network card comprises a host communication module and a hybrid scheduling module, and the host communication module is used for communicating with a host server to obtain a task request; the hybrid scheduling module comprises an FCFS scheduling core unit, a DWRR scheduling core unit, a monitoring unit and a scheduling core selection unit; the FCFS scheduling core unit is used for receiving and executing the task request; the DWRR dispatching core unit is used for receiving the task from the FCFS dispatching core unit when the tail delay of the task executed on the FCFS dispatching core unit exceeds a first tail delay threshold value; the monitoring unit detects whether the tail delay of the task exceeds a first tail delay threshold or whether the tail delay of the task is lower than a second tail delay threshold; the dispatching core selecting unit is used for distributing the tasks with the tail delay exceeding a first tail delay threshold value to the DWRR dispatching core unit, and distributing the tasks with the tail delay lower than a second tail delay threshold value to the FCFS dispatching core unit.

Description

SOC intelligent network card and task scheduling method

Technical Field

The invention relates to the technical field of intelligent network card application, in particular to an SOC intelligent network card and a task scheduling method.

Background

Data center servers (host servers) now typically host a wide variety of applications, especially distributed applications and different competing multi-tenant applications. These applications have different types of offload and different computational requirements. More importantly, the execution behavior of these applications by the computing modules is also different. The execution time of different offload tasks may vary by an order of magnitude and the computational cost of the host server consuming cycles may vary. The phenomenon of long tail delay is serious due to disordered resource sharing of the data center, interference among multiple tenants and sudden load, and therefore user experience is seriously influenced.

In addition, Tail delays (Tail Latency) of Remote Procedure Calls (RPCs) caused by different application programs are different, and there are two main reasons for generating high Tail delay, one is that a memory and a cache hierarchy are on a critical path, which may interfere with memory access among different programs and cause resource competition; the other is sub-optimal scheduling.

In the prior art, functions of a network are offloaded to an intelligent network card based on an FPGA, such as ClickNP and amazon cloud. The solution is mainly to adopt the traditional specific field acceleration method to unload some application programs of the host server to the FPGA for execution. Although the application programs have certain parallelism and certainty and can be used for improving the performance of the customized logic design on the FPGA, the application programs with complex data structures and algorithms cannot be realized on the intelligent network card based on the FPGA, and the realization period is long and the difficulty is high.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art.

In order to solve the above technical problem, the technical solution adopted by the present invention is to provide an SOC intelligent network card, including:

the host communication module is used for communicating with the host server to acquire the task request;

the hybrid scheduling module comprises an FCFS (FCFS) scheduling core unit, a DWRR (DWRR) scheduling core unit, a monitoring unit and a scheduling core selecting unit; the FCFS scheduling core unit is used for receiving and executing the task request; the DWRR dispatching core unit is used for receiving the task from the FCFS dispatching core unit and generating a DWRR dispatching queue when the tail delay of the task executed on the FCFS dispatching core unit exceeds a first tail delay threshold value; the monitoring unit is configured to detect whether a tail latency of a task executing on the FCFS scheduling core unit exceeds the first tail latency threshold and to detect whether a tail latency of a task executing on the DWRR scheduling core unit is below a second tail latency threshold; the scheduling core selection unit is configured to allocate the tasks whose tail delay executed on the FCFS scheduling core unit exceeds the first tail delay threshold value into the DWRR scheduling core unit, and to allocate the tasks whose tail delay executed on the DWRR scheduling core unit is lower than the second tail delay threshold value into the FCFS scheduling core unit.

In the above apparatus, the FCFS scheduling core unit sets the first tail delay threshold in advance, where the first tail delay threshold refers to a statistical value of tail delays of tasks executed by an existing intelligent network card under an FCFS scheduling algorithm, and the statistical value obeys a gaussian distribution of μ +3 σ.

In the apparatus, the DWRR scheduling core unit is preset with the second tail delay threshold, where the second tail delay threshold refers to a statistical value of tail delays of tasks executed by an existing intelligent network card under the DWRR scheduling algorithm, and the statistical value obeys a gaussian distribution of μ +3 σ.

In the device, an average request delay threshold is preset on the FCFS scheduling core unit, and when the monitoring unit detects that the average request delay time of the working of the FCFS scheduling core unit is greater than the average request delay threshold, the host communication module migrates the one with the highest load ratio on the smart network card to the host server; when the monitoring unit detects that the average request delay time of the FCFS scheduling core unit is smaller than the average request delay threshold value, some loads are extracted from the host server to the intelligent network card through the host communication module.

In the above apparatus, the average request delay threshold refers to an average value of request delays when all the computation cores on the smart network card process different tasks.

In the above apparatus, the DWRR scheduling core unit is preset with a deficit delay threshold, and when the monitoring unit detects that a deficit counter of a task in the DWRR scheduling queue is greater than the deficit delay threshold, the task is preferentially run.

In the above apparatus, the deficit delay threshold refers to a value of a deficit counter when the task delay reaches (1- α) × the second tail delay threshold, α representing a hysteresis factor.

The invention also provides a task scheduling method, which is applied to the SOC intelligent network card, wherein the SOC intelligent network card comprises a host communication module and a hybrid scheduling module, and the hybrid scheduling module comprises an FCFS scheduling core unit, a DWRR scheduling core unit, a monitoring unit and a scheduling core selection unit;

the host communication module acquires a task request from a host server;

the host communication module sends the task request to the FCFS dispatching core unit, and a first tail delay threshold value is preset on the FCFS dispatching core unit;

when the monitoring unit detects that the tail delay of the task executed on the FCFS scheduling core unit exceeds the first tail delay threshold, the scheduling core selecting unit allocates the task to the DWRR scheduling core unit for execution, and generates a DWRR scheduling queue on the DWRR scheduling core unit, wherein a second tail delay threshold is preset on the DWRR scheduling core unit;

when the monitoring unit detects that the tail delay of the task executed on the DWRR scheduling core unit is lower than the second tail delay threshold value, the scheduling core selecting unit allocates the task to the FCFS scheduling core unit for execution.

In the method, an average request delay threshold value is preset on the FCFS scheduling core unit, and when the monitoring unit detects that the average request delay time of the working of the FCFS scheduling core unit is greater than the average request delay threshold value, the host communication module migrates the one with the highest load ratio on the smart network card to the host server; when the monitoring unit detects that the average request delay time of the FCFS scheduling core unit is smaller than the average request delay threshold value, some loads are extracted from the host server to the intelligent network card through the host communication module.

In the method, a deficit delay threshold is preset on the DWRR scheduling core unit, and when the monitoring unit detects that a deficit counter of a task in the DWRR scheduling queue is greater than the deficit delay threshold, the task is preferentially run.

According to the technical scheme provided by the application, the method at least has the following beneficial effects: by adopting the SOC intelligent network card in the application, the execution condition and the related delay information of the task are detected in real time through the monitoring unit in the hybrid scheduling module, then the task is distributed to the FCFS scheduling core unit or the DWRR scheduling core unit for execution through the scheduling core selection unit in the hybrid scheduling module according to the task information provided by the monitoring unit, the advantages of the FCFS scheduling core unit and the DWRR scheduling core unit are fully utilized, the task scheduling time is saved, particularly when the task with variable execution cost is aimed at, the calculation resource utilization rate of the intelligent network card can be maximally improved, the calculation efficiency can be ensured, and the tail delay cannot be increased or the capacity of the intelligent network card for transmitting flow cannot be damaged.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a block diagram of a structure of an SOC intelligent network card provided in the embodiment of the present application;

fig. 2 is a flowchart of a task scheduling method according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, and the drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The terms appearing in the embodiments of the present application are explained below:

the high variance task comprises the following steps: meaning that the response times of the application services are more diverse.

Low variance tasks: the application service has low response time dispersion and small difference.

FCFS scheduling core unit: performing FCFS algorithm (First Come First serve algorithm)

DWRR dispatching core unit: performing DWRR algorithm (differential Weighted Round Robin, differential weight polling algorithm)

A deficit counter: i.e., the down counter, is given an initial value, the count is gradually decremented.

Tail delay: the delay of a small number of responses in the system is higher than the delay of the mean, i.e. a high delay that is rarely seen by the client.

The application provides an SOC intelligent network card and a task scheduling method, and aims to solve the problems that an existing intelligent network card is low in computing resource utilization rate and high in tail delay.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The embodiment of the first aspect of the present application provides an SOC intelligent network card, where a hybrid scheduling module is added to an existing SOC intelligent network card architecture, as shown in fig. 1, the SOC intelligent network card architecture in the present application includes a computing module 10, a board-level storage module 20, a traffic scheduling and control module 30, a host communication module 40, and a hybrid scheduling module 50.

The computing module 10 includes a general purpose ARM or MIPS multi-core processor, and a hardware accelerator (e.g., cryptography, pattern matching engine, neural network accelerator, etc.) and a dedicated function (e.g., encryption/decryption, hashing, pattern matching, compression, etc.) for packet processing (e.g., deep packet inspection, packet buffer management, etc.).

The board-level storage module 20 mainly includes a fast self-managed memory, a slower L2/DRAM, an L1/L2 cache of a processor, a private cache of various accelerators, an SSD (solid state disk) and a DRAM (dynamic random access memory) of an external smart card, and the like.

The traffic scheduling and control module 30 includes a traffic control module for transmitting data packets between the TX/RX ports and the data packet buffers, and an internal traffic manager for transmitting the data packets to the core of the intelligent network card or a switch of the intelligent network card.

The host communication module 40 includes high bandwidth ports, RDMA, link layer interfaces, PCIe and DMA engines, and the like. The host communication module 40 is used to communicate with a host server to obtain task requests.

Hybrid scheduling module 50 includes FCFS scheduling core unit 51, DWRR scheduling core unit 52, scheduling core selection unit 53, and monitoring unit 54. The FCFS scheduling core unit 51 is configured to receive and execute a task request; the DWRR scheduling core unit 52 is configured to receive the task from the FCFS scheduling core unit 51 and generate a DWRR scheduling queue when a tail delay of the task executed on the FCFS scheduling core unit 51 exceeds a first tail delay threshold; scheduling core selection unit 53 is configured to allocate a task whose tail latency executed on FCFS scheduling core unit 51 exceeds a first tail latency threshold into DWRR scheduling core unit 52, and to allocate a task whose tail latency executed on DWRR scheduling core unit 52 is below a second tail latency threshold into FCFS scheduling core unit 51; the monitoring unit 54 is configured to detect whether a tail latency of a task executing on the FCFS scheduling core unit 51 exceeds a first tail latency threshold and to detect whether a tail latency of a task executing on the DWRR scheduling core unit 52 is below a second tail latency threshold. In order to efficiently and accurately implement the conversion of tasks between the FCFS scheduling core unit 51 and the DWRR scheduling core unit 52, the monitoring unit 54 needs to collect the following information: the request execution delay distribution of each task; counting the execution cost and variance of each unloading task; and thirdly, the utilization rate of each processor core on the SOC intelligent network card comprises the utilization rate of the CPU core and the utilization rates of various hardware accelerators.

It should be noted that different modules or units of the intelligent network card are interconnected and connected through a high-bandwidth memory bus. These computing resources allow hosts to relieve the burden of general purpose computing (including complex algorithms and data structures) of a data center without sacrificing performance and program versatility.

In some embodiments of the present application, the FCFS scheduling core unit 51 is preset with a first tail delay threshold, where the first tail delay threshold refers to a statistical value of tail delays of tasks executed by an existing intelligent network card under the FCFS scheduling algorithm, and the statistical value obeys a gaussian distribution of μ +3 σ.

In some embodiments of the present application, the DWRR scheduling core unit 52 sets a second tail delay threshold in advance, where the second tail delay threshold refers to a statistical value of tail delays of tasks executed by an existing intelligent network card under the DWRR scheduling algorithm, and the statistical value obeys a gaussian distribution of μ +3 σ.

In some specific embodiments of the present application, the FCFS scheduling core unit 51 is further preset with an average request delay threshold, and when the monitoring unit 54 detects that the average request delay time of the working of the FCFS scheduling core unit 51 is greater than the average request delay threshold, the host communication module 40 migrates the one with the highest load ratio on the smart network card to the host server; when the monitoring unit 54 detects that the average request delay time of the FCFS scheduling core unit 51 is smaller than the average request delay threshold, some load is extracted from the host server to the smart card through the host communication module 40.

In some embodiments of the present application, the average request delay threshold refers to an average value of request delays when all the computing cores on the smart card process different tasks.

In some embodiments of the present application, the DWRR scheduling core unit 52 is further preset with a deficit delay threshold, and when the monitoring unit 54 detects that the deficit counter of the task in the DWRR scheduling queue is greater than the deficit delay threshold, the task is preferentially executed.

In some embodiments of the present application, the deficit delay threshold refers to a value of a deficit counter when the task delay reaches (1- α) × the second tail delay threshold, α representing a hysteresis factor.

By adopting the SOC intelligent network card in the application, the execution condition and the related delay information of the task are detected in real time through the monitoring unit in the hybrid scheduling module, then the task is distributed to the FCFS scheduling core unit or the DWRR scheduling core unit for execution through the scheduling core selection unit in the hybrid scheduling module according to the task information provided by the monitoring unit, the advantages of the FCFS scheduling core unit and the DWRR scheduling core unit are fully utilized, the task scheduling time is saved, particularly when the task with variable execution cost is aimed at, the calculation resource utilization rate of the intelligent network card can be maximally improved, the calculation efficiency can be ensured, and the tail delay cannot be increased or the capacity of the intelligent network card for transmitting flow cannot be damaged.

The embodiment of the second aspect of the present application provides a task scheduling method, based on the SOC intelligent network card provided in the embodiment of the first aspect, where the SOC intelligent network card includes a host communication module 40 and a hybrid scheduling module 50, and the hybrid scheduling module 50 includes an FCFS scheduling core unit 51, a DWRR scheduling core unit 52, a scheduling core selection unit 53, and a monitoring unit 54. As shown in fig. 2, the task scheduling method includes the following steps:

step S110: the host communication module obtains a task request from a host server.

In the present application, the task request includes, but is not limited to, an uninstalling task of the application program, and the smart network card receives a corresponding task request from the host server.

Specifically, the host server first creates a control command for a DMA (Direct Memory Access) or other interface, including information such as a command header and a packet buffer address, and then writes them into a command ring. And the DMA engine in the host communication module takes out the command from the command ring, reads data from the memory of the host server and writes the data into the board-level storage module of the intelligent network card. And then, carrying out corresponding processing on the data packet according to the processor type of the data packet, and sending the processed data to a receiving and sending interface through a DMA engine.

Step S120: the host communication module sends the task request to an FCFS (FCFS) scheduling core unit, and a first tail delay threshold value is preset on the FCFS scheduling core unit.

In the application, after receiving a corresponding task request, an intelligent network card preferentially sends the task request to an FCFS (fast forward dispatching) core unit in a hybrid dispatching module for execution, and the FCFS core unit executes a task in the dispatching core according to an algorithm rule of first-come first-served.

It should be noted that the first tail delay threshold refers to a statistical value of tail delays of tasks executed by the existing intelligent network card under the FCFS scheduling algorithm, and the statistical value obeys a gaussian distribution of μ +3 σ.

It should be noted that the number of the FCFS scheduling core units set on the smart network card is not limited to one, and all the FCFS scheduling core units share one task running queue to fully utilize the parallelism of task execution.

Step S130: when the monitoring unit detects that the tail delay of the task executed on the FCFS scheduling core unit exceeds a first tail delay threshold value, the scheduling core selecting unit allocates the task to the DWRR scheduling core unit to be executed, and generates a DWRR scheduling queue on the DWRR scheduling core unit, and a second tail delay threshold value is preset on the DWRR scheduling core unit.

In the application, when the monitoring unit detects that the tail delay of the task executed on the FCFS dispatch core unit exceeds a first tail delay threshold value, the task is a task with high variance, the task is extruded into the DWRR dispatch core unit to be executed through the dispatch core selection unit, the DWRR dispatch core unit receives the task and generates a DWRR dispatch queue on the DWRR dispatch core unit, and the task in the DWRR dispatch queue is executed according to a differential weight polling dispatch algorithm.

Furthermore, a deficit delay threshold is preset on the DWRR scheduling core unit, after the DWRR scheduling queue is generated on the DWRR scheduling core unit, the DWRR scheduling core unit scans the tasks in the DWRR scheduling queue in a polling mode, and when a deficit counter of a certain task is greater than the deficit delay threshold, the DWRR scheduling core unit can preferentially execute the task.

It should be noted that the second tail delay threshold refers to a statistical value of tail delays of tasks executed by the existing smart network card under the DWRR scheduling algorithm, and the statistical value obeys a gaussian distribution of μ +3 σ.

The deficit delay threshold refers to the value of the deficit counter when the task delay reaches (1- α) × the second tail delay threshold, α representing a hysteresis factor.

It should be noted that the number of the DWRR scheduling core units arranged on the intelligent network card is not limited to one, and all the DWRR scheduling core units share one task running queue to fully utilize the parallelism of task execution.

Step S140: when the monitoring unit detects that the tail delay of the task executed on the DWRR scheduling core unit is lower than a second tail delay threshold value, the scheduling core selecting unit allocates the task to the FCFS scheduling core unit for execution.

In this application, when the monitoring unit detects that the tail delay of the task executed on the DWRR scheduling core unit is lower than the second tail delay threshold, the task is a task with low variance, and the task is popped from the DWRR scheduling queue through the scheduling core selecting unit and is sent back to the FCFS scheduling core unit for execution.

Furthermore, an average request delay threshold is preset on the FCFS scheduling core unit, and when the monitoring unit detects that the average request delay time of the work of the FCFS scheduling core unit is greater than the average request delay threshold, it indicates that a queuing phenomenon exists on the intelligent network card, and some task requests are not sent to the FCFS scheduling core unit. At the moment, a task request with the highest load ratio on the intelligent network card can be migrated to the host server through the host communication module to be executed; when the monitoring unit detects that the average request delay time of the work of the FCFS scheduling core unit is smaller than the average request delay threshold, it indicates that the tasks executed on the FCFS scheduling core unit are not full, and the intelligent network card can extract some task requests from the host server through the host communication module to the intelligent network card for execution. Through the steps, a load balance between the host server and the intelligent network card can be realized, and in order to minimize the synchronization cost between the host server and the intelligent network card, a scheduling core can be specially arranged on the FCFS scheduling core unit to execute the scheduling task.

It should be noted that the average request delay threshold refers to an average value of request delays when all the computation cores on the smart card process different tasks.

By adopting the task scheduling method based on the SOC intelligent network card, after the SOC intelligent network card obtains a task request from a host server, the task request is sent to an FCFS scheduling core unit for execution, the FCFS scheduling core unit performs efficient scheduling on the task with low service time dispersity according to a first-come first-serve algorithm, when the task request is detected to belong to the task with large service time difference, the intelligent network card extrudes the task request from the FCFS scheduling core unit and distributes the task request to a DWRR scheduling core unit for execution, and the DWRR scheduling core unit performs efficient execution on the tasks in a DWRR scheduling queue according to a differential weight polling scheduling algorithm. A hybrid scheduling management mechanism combining the FCFS algorithm and the DWRR algorithm provides efficient computing support for an intelligent network card shared by multiple tenants and multiple application programs, and schedules various tasks, so that various application programs are dynamically and perceptibly unloaded, the resource utilization rate of the intelligent network card is provided to the maximum extent, tail response delay is reduced, and interaction between a host server and the intelligent network card is relieved.

While the present invention has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An SOC intelligent network card, comprising:

2. The SOC intelligent network card according to claim 1, wherein an average request delay threshold value is preset on the FCFS scheduling core unit, and when the monitoring unit detects that the average request delay time of the FCFS scheduling core unit is greater than the average request delay threshold value, the host communication module migrates the one with the highest load ratio on the intelligent network card to the host server; when the monitoring unit detects that the average request delay time of the FCFS scheduling core unit is smaller than the average request delay threshold value, some loads are extracted from the host server to the intelligent network card through the host communication module.

3. The SOC smart card of claim 1, wherein a deficit delay threshold is preset on the DWRR scheduling core unit, and when the monitoring unit detects that a deficit counter of a task in the DWRR scheduling queue is greater than the deficit delay threshold, the task is preferentially run.

4. A task scheduling method is applied to the SOC intelligent network card, the SOC intelligent network card comprises a host communication module and a hybrid scheduling module, the hybrid scheduling module comprises an FCFS scheduling core unit, a DWRR scheduling core unit, a monitoring unit and a scheduling core selection unit, and the task scheduling method is characterized by comprising the following steps:

the host communication module acquires a task request from a host server;

5. The task scheduling method according to claim 4, wherein an average request delay threshold is preset on the FCFS scheduling core unit, and when the monitoring unit detects that the average request delay time of the FCFS scheduling core unit is greater than the average request delay threshold, the host communication module migrates the one with the highest load ratio on the smart network card to the host server; when the monitoring unit detects that the average request delay time of the FCFS scheduling core unit is smaller than the average request delay threshold value, some loads are extracted from the host server to the intelligent network card through the host communication module.

6. The task scheduling method according to claim 4, wherein a deficit delay threshold is preset on the DWRR scheduling core unit, and when the monitoring unit detects that a deficit counter of a task in the DWRR scheduling queue is greater than the deficit delay threshold, the task is preferentially run.