WO2018018611A1 - 一种任务处理方法以及网卡 - Google Patents

一种任务处理方法以及网卡 Download PDF

Info

Publication number
WO2018018611A1
WO2018018611A1 PCT/CN2016/092316 CN2016092316W WO2018018611A1 WO 2018018611 A1 WO2018018611 A1 WO 2018018611A1 CN 2016092316 W CN2016092316 W CN 2016092316W WO 2018018611 A1 WO2018018611 A1 WO 2018018611A1
Authority
WO
WIPO (PCT)
Prior art keywords
thread
qth
jth
task processing
message
Prior art date
Application number
PCT/CN2016/092316
Other languages
English (en)
French (fr)
Inventor
吉辛•维克多
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2016/092316 priority Critical patent/WO2018018611A1/zh
Priority to CN201680002876.7A priority patent/CN107077390B/zh
Priority to CN202110713436.5A priority patent/CN113504985B/zh
Priority to CN202110711393.7A priority patent/CN113504984A/zh
Publication of WO2018018611A1 publication Critical patent/WO2018018611A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation

Definitions

  • the present application relates to the field of communications, and in particular, to a task processing method and a network card.
  • a service in an Ethernet network can include multiple tasks.
  • the TCP offload engine (TOE) service includes receiving tasks and sending tasks.
  • the task processing of the existing Ethernet network is generally implemented by the server, and the server accesses the Ethernet network through the switch (English: ethernet), please refer to FIG.
  • a network card is inserted into the server to process data exchange between the server and the switch.
  • the network device triggers network I/O interrupts during the process of sending and receiving data packets, so the server responds to a large number of I/O interrupt signals while it is working.
  • a network task sends a transmission control protocol (English: transmission control protocol, abbreviation: TCP) data at a rate of 64 Kbps
  • TCP transmission control protocol
  • the data is simply encapsulated into an Ethernet packet and the response network accepts an acknowledgment signal, which is in the server every second.
  • More than 60 I/O interrupts are triggered between the network card and the network card. A large amount of interrupt processing will take up considerable computing resources of the server and lower the overall performance of the network.
  • the related processing of the protocol stack is offloaded from the server side to the network card to solve the computing resources of the server, and the data interaction between the server and the network card is reduced, thereby improving the performance of the network.
  • some NICs at this stage can already support the function of RDMA over converged ethernet (English: RCE over converged ethernet, acronym: RoCE), and the Fibre Channel over ethernet (English: fibre channel over ethernet) Wait.
  • the application provides a task processing method and a network card for improving the task processing performance of the network card.
  • the first aspect of the present application provides a task processing method, which is applicable to a network card for task processing.
  • the task processing is divided into N stages according to the execution order, which are the first stage, the second stage, ... the Nth stage.
  • the network card includes the processor and network card memory. Running multiple threads in the processor, the multiple lines The process logically constitutes the resource pool of the processor.
  • the NIC obtains the P packets to be processed, and determines the thread corresponding to each of the P packets from the resource pool of the processor.
  • the NIC sequentially performs N stages of task processing for each message through the thread corresponding to each message, and obtains the task processing result of the Nth stage of each message.
  • the NIC memory includes context information of task processing, where the context information includes N information blocks corresponding to the N stages, and is, in order, a first information block, a second information block, ... an Nth information block. .
  • the i-th information block includes context information required to perform the task processing of the i-th stage, 1 ⁇ i ⁇ N.
  • the Qth message of the P messages corresponds to the Qth thread, where Q is any positive integer not greater than P. For example, the first message corresponds to the first thread, and the second message corresponds to the second thread.
  • the network card loads the jth information block for the Qth thread, and passes the Qth thread according to the jth information block and the jth of the Qth message.
  • the task processing of the jth stage is performed on the Qth message, and the processing result of the jth stage of the Qth message is obtained, where j sequentially traverses the integer in [1, N].
  • the processing result of the 0th stage of the Qth message is the Qth message.
  • the P packet includes a first packet and a second packet, where the first packet corresponds to the first thread, and the second packet corresponds to the second thread.
  • the network card loads the jth information block for the second thread, and performs the task of the jth stage of the second message by the second thread. deal with. This enables multiple threads to be scheduled in stages, avoiding read and write conflicts when different threads access context information.
  • the network card can lock the jth information block, so that the jth information block cannot be accessed by other threads, so as to prevent other threads from simultaneously with the Qth thread. Accessing the jth block causes read and write conflicts.
  • the network card unlocks the jth information block, so that the jth information block can be accessed by other threads.
  • the network card continues Continue to lock the j+1th information block for the Qth thread.
  • the Qth thread may be suspended to save power.
  • the network card loads the j+1th information block for the Qth thread, the network card wakes up the Qth thread to perform the task processing of the j+1th stage for the Qth message.
  • the network card may further accelerate the P packets to obtain the accelerated P packets.
  • the network card After determining the thread corresponding to the P packets, the network card sends the accelerated P packets to the corresponding threads.
  • the NIC memory may further include a global configuration table, where the address information of the N information blocks is recorded.
  • the network card can obtain the jth information block according to the record of the global configuration table.
  • the context information page is further divided into M new information blocks by the N information blocks.
  • the network card may receive a modification instruction, where the modification instruction is used to modify address information of the N pieces of information recorded in the global configuration table into address information of M new information blocks, where the M information blocks are
  • the kth new information block includes context information required to perform task processing of the kth new stage, 1 ⁇ k ⁇ M.
  • the task processing task program is saved as an executable file in the network card memory, and the executable file includes N program segments of the N stages corresponding to the task processing, respectively being the first program segment and the second program segment. , ... the Nth block.
  • the i-th program segment includes program instructions for performing the task processing of the i-th stage.
  • the network card loads the jth program segment for the Qth thread, and adjusts the pointer of the Qth thread to point to the jth program segment. Then, the network card passes the Qth thread, and executes the jth program segment according to the processing result of the jth information block and the j-1th stage of the Qth message, that is, the task processing of the jth stage is implemented.
  • the second aspect of the present application provides a network card for performing task processing on a packet in a network.
  • the network card includes a processor and a network card memory.
  • the task processing is divided into N stages according to the execution order, which are the first stage, the second stage, ... the Nth stage.
  • a plurality of threads are running in the processor, and the multiple threads logically constitute a resource pool of the processor.
  • the processor executes the program saved in the NIC memory by running an internal thread to implement the task processing method provided by the first aspect of the present application.
  • the third aspect of the present application provides a task processing method, which is applicable to a network card for task processing.
  • the task processing is divided into N stages according to the execution order, which are the first stage, the second stage, ... Phase N.
  • the network card includes the processor, network card memory, scheduler, task interface, and bus. Multiple threads are also running in the processor, which logically constitute the resource pool of the processor.
  • the task interface receives the P packets to be processed, and the scheduler determines the thread corresponding to the P packets from the resource pool of the processor, and loads the P packets into the corresponding thread.
  • the processor sequentially performs N stages of task processing for each message through the thread corresponding to each message, and obtains the task processing result of the Nth stage of each message.
  • the NIC memory includes context information of task processing, where the context information includes N information blocks corresponding to the N stages, and is, in order, a first information block, a second information block, ... an Nth information block. .
  • the i-th information block includes context information required to perform the task processing of the i-th stage, 1 ⁇ i ⁇ N.
  • the Qth message of the P messages corresponds to the Qth thread, where Q is any positive integer not greater than P. For example, the first message corresponds to the first thread, and the second message corresponds to the second thread.
  • the scheduler loads the jth information block for the Qth thread, and the processor passes the Qth thread according to the jth information block and the Qth message
  • the task processing of the jth stage is performed on the Qth message, and the processing result of the jth stage of the Qth message is obtained, where j sequentially traverses the integer in [1, N].
  • the processing result of the 0th stage of the Qth message is the Qth message.
  • the P packet includes a first packet and a second packet, where the first packet corresponds to the first thread, and the second packet corresponds to the second thread.
  • the scheduler loads the jth information block for the second thread.
  • the processor waits for the first thread to execute the task processing of the jth stage of the first message, the task processing of the jth stage of the second message is performed by the second thread. This enables multiple threads to be scheduled in stages, avoiding read and write conflicts when different threads access context information.
  • the scheduler may lock the jth information block for the Qth thread, so that the jth information block cannot be accessed by other threads, so that Prevent other threads from accessing the jth block at the same time with the Q thread, causing read and write conflicts. Passed by the processor After the Q thread executes the task processing of the jth stage of the Qth message, the scheduler unlocks the jth information block so that the jth information block can be accessed by other threads.
  • the scheduler after the scheduler unlocks the jth information block for the Qth thread, if the current j ⁇ N, the scheduler does not need to wait for the Q thread to issue an instruction to lock the j+1th information block, and automatically locks the Qth thread. j+1 information block to reduce instruction interaction between the thread and the scheduler.
  • the scheduler may temporarily suspend the Qth thread to save power. After the scheduler loads the j+1th information block for the Qth thread, the scheduler wakes up the Qth thread to continue the task processing of the j+1th stage.
  • an accelerator is also included in the network card. After receiving the P packets to be processed, the accelerator accelerates the P packets to obtain the accelerated P packets.
  • the scheduler mentioned above loads the Qth message for the Qth thread, which means that the scheduler loads the accelerated Qth message for the Qth thread.
  • the processing result of the 0th phase of the Qth packet mentioned above is the first packet, and specifically, the processing result of the 0th phase of the Qth packet is the accelerated Qth packet.
  • the acceleration operation of the message is transferred to the accelerator for processing, so that the processor does not need to accelerate the operation of the message, which can simplify the function of the processor, so that the processor does not need to additionally customize the acceleration engine, thereby reducing the cost of the network card.
  • the acceleration operation performed by the accelerator includes cyclic redundancy check (English: cyclic redundancy check, abbreviation: CRC), IP checksum (English: checksum), packet parsing (English: packet parse), packet editing (English: packet edit), checklist, etc.
  • CRC cyclic redundancy check
  • IP checksum English: checksum
  • packet parsing English: packet parse
  • packet editing English: packet edit
  • checklist etc.
  • the NIC memory may further include a global configuration table, where the address information of the N information blocks is recorded.
  • the scheduler can load the jth information block for the Qth thread according to the record of the global configuration table.
  • the context information page is further divided into M new information blocks by the N information blocks.
  • the task interface may receive a modification instruction, where the modification instruction is used to modify address information of the N pieces of information recorded in the global configuration table into address information of M new information blocks, where the M information blocks are
  • the kth new information block includes context information required to perform task processing of the kth new stage, 1 ⁇ k ⁇ M.
  • the task processing task program is saved as an executable file in the network card memory
  • the executable file includes N program segments of N stages corresponding to the task processing, which are respectively the first program segment, the second program segment, ... the Nth program segment.
  • the i-th program segment includes program instructions for performing the task processing of the i-th stage. If the current processor needs to perform the j-th task processing on the Qth message through the Qth thread, the processor loads the jth program segment for the Qth thread, and adjusts the pointer of the Qth thread to point to the jth program segment. The processor runs the Qth thread to execute the jth program segment according to the processing result of the jth information block and the j-1th stage of the Qth message, to perform the jth stage task processing on the Qth message.
  • a fourth aspect of the present application provides a network card for performing task processing on a packet in a network.
  • the network card includes a processor, a network card memory, a task interface, and a bus, and the task processing is divided into N stages according to the execution order, which are the first stage, the second stage, ... the Nth stage.
  • a plurality of threads are running in the processor, and the multiple threads logically constitute a resource pool of the processor.
  • the task interface is configured to receive the P packets to be processed, and the scheduler is configured to determine, according to the resource pool of the processor, the thread corresponding to the P packets, and load the P packets into the corresponding thread.
  • the processor is configured to sequentially perform N stages of task processing for each message by using a thread corresponding to each message, and obtain a task processing result of the Nth stage of each message.
  • the network card uses only one thread to perform complete task processing on the message, so it is not necessary to copy the phased task processing result between multiple threads, and the entire task program only provides a complete function function set. Therefore, the NIC task provided by the present application has less processing overhead, and the program occupies less storage space, and has better performance than the prior art.
  • the network card memory is used to store context information processed by the task, where the context information includes N information blocks corresponding to the N stages, and is the first information block, the second information block, and the Nth information.
  • the i-th information block includes context information required to perform the task processing of the i-th stage, 1 ⁇ i ⁇ N.
  • the Qth message of the P messages corresponds to the Qth thread, where Q is any positive integer not greater than P.
  • the first message corresponds to the first thread
  • the second message corresponds to the second thread.
  • the scheduler is further configured to load the jth information block for the Qth thread before the Qth thread performs the jth stage task processing on the Qth message.
  • the processor is specifically configured to: perform, by the Qth thread, the task processing of the jth stage according to the processing result of the jth information block and the j-1th stage of the Qth message, to obtain the Qth message
  • the processing result of the 0th stage of the Qth message is the Qth message.
  • the P packets include a first packet and a second packet, where the first packet corresponds to the first thread.
  • the second message corresponds to the second thread.
  • the scheduler is further configured to: after the first thread executes the task processing of the jth stage of the first message, load the jth information block for the second thread. After the processor waits for the first thread to execute the task processing of the jth stage of the first message, the task processing of the jth stage of the second message is performed by the second thread. This enables multiple threads to be scheduled in stages, avoiding read and write conflicts when different threads access context information.
  • the scheduler is further configured to: when the processor performs the task processing of the jth stage of the Qth message through the Qth thread, lock the jth information block for the Qth thread, so that the jth information block cannot be used by other threads. Access to prevent other threads from accessing the jth block at the same time as the Qth thread causes a read/write conflict. After the processor performs the task processing of the jth stage of the Qth message through the Qth thread, the jth information block is unlocked, so that the jth information block can be accessed by other threads.
  • the scheduler after the scheduler unlocks the jth information block for the Qth thread, if the current j ⁇ N, the scheduler is further configured to automatically lock the j+1th information block for the Qth thread, without waiting for the Q thread to issue a lock.
  • the instruction of the j+1th information block to reduce instruction interaction between the thread and the scheduler.
  • the scheduler is further configured to temporarily suspend the Qth thread to save power after the Qth thread finishes performing the task processing of the jth stage of the Qth message. After the j+1th information block is loaded for the Qth thread, the Qth thread is restarted to continue the task processing of the j+1th stage.
  • an accelerator is also included in the network card.
  • the P message is accelerated after the P interface receives the P message to be processed, and the accelerated first message is obtained.
  • the scheduler mentioned above is used to load the Qth message for the Qth thread, which may be used by the scheduler to load the accelerated Qth message for the Qth thread.
  • the processing result of the 0th stage of the Qth message mentioned above is the Qth message, and specifically, the processing result of the 0th stage of the Qth message is the accelerated Qth message.
  • the network card passed by the application transfers the acceleration operation of the message to the accelerator for processing, so that the processor does not need to accelerate the operation of the message, which simplifies the function of the processor, so that the processor does not need to additionally customize the acceleration engine, thereby reducing the cost of the network card.
  • the accelerator may specifically include a CRC unit, a checksum unit, a packet parser (English: packet parser, abbreviated as: parser), a packet editor (English: packet editor, abbreviation: PE), and one of the table lookup units. Item or multiple items.
  • the CRC unit is configured to perform CRC check on the first packet
  • the checksum unit is configured to perform checksum check on the first packet
  • the parser is used to parse the first packet
  • the PE is used to perform the first packet check.
  • Packet editing The unit is configured to search for a matching entry of the first packet.
  • the NIC memory is further configured to save a global configuration table, where the address information of the N information blocks is recorded.
  • the scheduler is specifically configured to load the jth information block for the first thread according to the record of the global configuration table.
  • the task interface is further configured to receive, when the task processing is updated from the original N stages to the M new stage, a modification instruction, where the modification instruction is used to address information of the N pieces of information recorded in the global configuration table. It is modified into address information of M new information blocks, and the kth new information block includes context information required to perform task processing in the kth new stage, 1 ⁇ k ⁇ M.
  • the network card memory is further used to save an executable file for task processing, and the executable file includes N program segments of N stages corresponding to task processing, respectively, a first program segment, a second program segment, ... ...the Nth block.
  • the i-th program segment includes program instructions for performing the task processing of the i-th stage.
  • the scheduler is further configured to load the jth program segment for the Qth thread and adjust the pointer of the Qth thread to the jth program segment when the processor is to perform the jth stage task processing on the Qth message by the Qth thread. So that the Q thread can directly start executing the jth block.
  • the processor is specifically configured to: execute, by the Qth thread, the jth program segment according to the processing result of the jth information block and the j-1th stage of the Qth message, to perform the jth stage task processing on the Qth message.
  • the network card may further include a direct memory access (English: direct memory access, DMA) module, configured to obtain the context information from the memory of the host connected to the network card, and obtain the context information. Save to the memory of the network card.
  • a direct memory access English: direct memory access, DMA
  • the network card may further include a context management module, configured to manage the context information.
  • a context management module configured to manage the context information.
  • FIG. 1 is a schematic diagram of a connection relationship between a server, a switch, and an Ethernet;
  • FIG. 2 is a structural diagram of a network card in the current stage of technology
  • Figure 3 (a) is a schematic diagram of the principle of the task processing method in the current stage technology
  • FIG. 3(b) is another schematic diagram of the principle of the task processing method in the prior art
  • FIG. 4(a) is a flowchart of an embodiment of a task processing method provided by the present application.
  • FIG. 4(b) is a schematic diagram showing the principle of another embodiment of the task processing method provided by the present application.
  • FIG. 5(a) is a structural diagram of an embodiment of a network card provided by the present application.
  • FIG. 5(b) is a structural diagram of another embodiment of a network card provided by the present application.
  • FIG. 6 is a flowchart of another embodiment of a task processing method provided by the present application.
  • the application provides a task processing method, which can improve the task processing performance of the network card.
  • the present application also proposes corresponding network cards, which will be separately described below.
  • the Ethernet network generally unloads the relevant task processing of the protocol stack from the server side to the network card to solve the computing resources of the server and improve the performance of the network.
  • the tasks that are offloaded to the network card can be roughly divided into stateful tasks and stateless tasks. This application describes the processing methods of stateful tasks.
  • a stateful task refers to a sequence of packets or data frames in a network task.
  • the subsequent messages or data frames depend on the previous message or data frame. This dependency is usually through context (English: context).
  • Manage. Context information can be used to identify and manage a specific task flow, for example, an internet small computer system interface (iSCSI) connection, a remote direct memory access queue pair (English: remote direct memory access queue pairs, Abbreviations: RDMA QPs) and other services have sequential requirements for messages during network transmission. Therefore, each task in these services uses independent context information to maintain the status information of the tasks themselves.
  • the context information of the task is generally stored in the server.
  • the NIC obtains the context information of the task from the server to the network card memory through DMA.
  • FIG. 2 mainly includes a task interface such as a host interface 201 and a network interface 202, a DMA module 203, a network card memory 205, and a processor 206, and each module is connected by a bus (English: bus). .
  • the host interface 201 is a communication interface between the network card and the server host, and is used for transmitting data or packets between the network card and the server, and is generally a peripheral component interconnect express (English: interconnected component: PCIE) interface. It can also be other types of interfaces, which are not limited here.
  • PCIE peripheral component interconnect express
  • the network interface 202 is a communication interface between the network card and the Ethernet network, and is generally used for transmitting and receiving Ethernet network packets at the second layer (ie, the data link layer).
  • the DMA module 203 is used by the network card to directly acquire data in the memory of the server host.
  • the DMA module 203 is an optional module, and may be implemented by a hardware circuit as shown in FIG. 2 or integrated in the processor 206.
  • the processor 206 implements the function of the DMA module.
  • the DMA module When the DMA module is implemented by hardware as shown in FIG. 2, it can be set as a separate module in the network card or in the host interface 201.
  • the DMA module 203 can also be omitted when the network card does not need to acquire data in the server host memory.
  • the network card memory 205 is used for storing data information that the network card needs to use.
  • the network card memory 205 includes at least two memory areas: (1) a program memory area for storing a task program used by the network card; and (2) a data memory area. It is used to store various tables such as hash tables, linear tables, global configuration tables, etc. used by the network card, as well as context information or other data information that the network card needs to use.
  • the NIC memory 205 can be implemented by using a volatile storage medium (English: volatile memory), such as a random access memory (English: random-access memory, abbreviation: RAM), or can be implemented by a non-volatile storage medium ( English: non-volatile memory, abbreviation: NVM), such as read-only memory (English: read-only memory, abbreviation: ROM), flash memory (English: flash), etc., network card memory can also be combined by the above various types of memory Cheng, here is not limited.
  • a volatile storage medium English: volatile memory
  • RAM random access memory
  • NVM non-volatile memory
  • ROM read-only memory
  • flash memory English: flash
  • network card memory can also be combined by the above various types of memory Cheng, here is not limited.
  • the processor 206 may be composed of one or more CPUs, each CPU may include one or more cores (English: core), and each core may run one or more threads (English: thread).
  • the processor 206 runs a plurality of threads, which logically constitute a resource pool of the processor 206. This application focuses on the scheduling of each thread in the resource pool.
  • processor 206 also includes a processor cache that is allocated for use by each thread. Specifically, each thread in the resource pool is allocated a part of the processor cache as an instruction cache space (English: instruction cache, referred to as ICache), used to temporarily store the program instructions to be executed by the thread; and is allocated with a processor cache. The other part is used as the data cache space (English: data cache, referred to as: DCache), used to temporarily store the data to be used by the thread.
  • the ICache and DCache of each thread are not shown in Figure 2.
  • the network card can also include a context management module.
  • the context management module is configured to manage context information of the task, for example, including driving the DMA module 203 to obtain context information in the host memory, segmenting the context information, determining a context to be loaded by searching the global configuration table, or the like. Multiple.
  • the context management module is an optional module, and may be implemented by a hardware circuit or integrated into the processor 206 to implement context information management by the processor 206. When the context management module is implemented by hardware, it can be set up as a separate module in the network card or in the processor 206.
  • the context management module can also be omitted if there is no need to manage the context of the task.
  • the network card may further include a management processor for controlling basic management configuration information of the network card, a design for the product life cycle/link (English: design for X, abbreviation: DFX) module, for managing data transmission and reception queues and processing One or more of the queue management module of the command queue of the device, the phase locked loop (PLP) for clock phase synchronization, and the timer of the task flow (English: Timer)
  • a management processor for controlling basic management configuration information of the network card
  • a design for the product life cycle/link (English: design for X, abbreviation: DFX) module
  • PLP phase locked loop
  • timer of the task flow English: Timer
  • Task processing can often be split into N mutually independent execution stages that can be performed separately (for ease of description, hereinafter referred to as stages.
  • stages in this application may have other similarities in the art.
  • descriptions of “segments”, “segments”, “parts”, “subtasks”, etc. of tasks in the literature in this field may be equivalent to the stage of the task in the present application, or the "section” of the task in the English document, Descriptions such as “stage”, “part”, “phase”, “period”, etc. may all be equivalent to the stages in this application).
  • the task program is divided into N program segments in advance according to different stages, which are sequentially the first program segment, the second program segment, ...
  • N is an integer not less than 2
  • i is an integer not larger than N.
  • Each block is stored as an executable file in the program memory area of the NIC memory.
  • the NIC obtains the context information of the task from the server through DMA and saves it in the data memory area of the NIC memory.
  • the context information is also divided into N information blocks corresponding to the N phase, and is sequentially a first information block, a second information block, ... an Nth information block.
  • the i-th information block includes context information to be used for performing the task processing of the i-th stage, that is, context information to be used by the i-th program segment. Since some context information may be used by multiple program segments, the N information blocks may have overlapping portions.
  • the processor runs the thread in the resource pool for task processing.
  • Figure 3 (a): Specifically, the processor selects a thread in the resource pool as the main thread to schedule other threads in the resource pool. After determining the to-be-processed packet (the packet to be processed can be an uplink packet or a downlink packet), the main thread allocates an idle thread for the task processing of each stage of the packet to be processed. Taking N 3 as an example: the main thread selects the first thread from the idle thread of the resource pool, and the processor loads the to-be-processed message and the first information block to the DCache of the first thread through the first thread, and loads the first thread.
  • the program segment reaches the ICache of the first thread, and then executes the program in the first thread ICache according to the message and the first information block to perform the first stage processing on the message; then the main thread selects the idle second thread to process The second thread loads the processing result of the first thread of the first thread and the DCache of the second information block to the second thread, and loads the second program segment to the ICache of the second thread, and performs the packet processing.
  • the network card completes the complete message task processing flow.
  • the prior art also uses a pipeline (English: pipeline) mechanism to make full use of the computing resources of the network card, the specific principle is as shown in 3 (b): the next message does not need to wait for the current message to be processed in all stages. After the i-th thread finishes processing the i-th phase of the current message, the i-th thread can directly process the i-th stage of the next message. This allows the network card to process multiple packets in parallel, which is beneficial to improve the processing efficiency of the task.
  • a pipeline English: pipeline
  • the processor runs different threads to perform different phases of the task, so the threads need to copy the phased processing results between each other.
  • the second thread needs to copy the processing result of the first phase of the packet to the DCache of the second thread to perform the processing of the second phase of the packet.
  • the third thread needs to copy the processing result of the second stage of the second thread to the DCache of the third thread, so as to perform the processing of the third stage of the message.
  • the result of copying the phased task processing between the threads will occupy a large amount of computing resources, causing serious delays and increasing the overhead of task processing.
  • each block since each block is run by a different thread, each block needs to provide a complete set of function functions. This results in a larger overall amount of task programs, which can take up too much memory space in the program memory.
  • the present application provides a new task processing method and a network card based on the prior art, which will be described in detail below.
  • the task program is also divided into N program segments corresponding to the N stages of the task processing, which are the first program segment, the second program segment, ... the Nth program segment.
  • N is an integer not less than 2
  • i is a positive integer not greater than N.
  • the pointer of the processor adjustment thread points to the i-th program segment, and the task of the i-th stage can be processed by the thread.
  • each program segment is sequentially executed.
  • the context information is also divided into N information blocks corresponding to the N stages, and is sequentially a first information block, a second information block, ... an Nth information block.
  • the i-th information block includes context information to be used for performing the task processing of the i-th stage, that is, context information to be used by the i-th program segment. Since some context information may be used by multiple program segments, the N information blocks may have overlapping portions.
  • the division of the stage may change at any time.
  • the old version may divide the task processing into N stages in the order of execution
  • the new version may divide the task processing into M new stages in the order of execution.
  • the context information is also re-divided correspondingly, that is, divided into M new information blocks, wherein the kth new information block includes context information required to perform task processing of the kth new stage, 1 ⁇ k ⁇ M.
  • the address information of the N information blocks obtained by dividing the context information may be recorded in a global configuration table, and the network card accesses the corresponding i-th information block according to the global configuration table when the ith program segment is executed, and the global configuration table is saved in the The data memory area of the NIC memory.
  • the address information of the information block may include an offset and a length of the information block with respect to the context information, and may be other forms, which are not limited herein.
  • the network card may receive a modification instruction sent by the host, where the modification instruction is used to modify address information of the N pieces of information recorded in the global configuration table into address information of the M new information blocks.
  • Table 1 is an example of a global configuration table, where a service number is used to identify a service type of a task, such as a TOE service, a RoCE service, and the like.
  • the task number is used to identify multiple tasks included in a service, such as receiving tasks, sending tasks, and so on.
  • the phase number is used to identify each phase of the task.
  • the offset is used to indicate the offset of the information block corresponding to each phase with respect to the context information
  • the length is used to indicate the length of the information block corresponding to each phase.
  • the network card can determine the corresponding information block according to the service number, task number, and phase number of the current task. The offset and length are used to obtain the corresponding information block.
  • Table 1 is only used to visually display the logical structure of the global configuration table.
  • the global configuration table may also be other structures or other parameters in the actual application, which is not limited herein.
  • the network card may also determine the information block according to one or two parameters of the service number, the task number, and the phase number, or determine the information block according to other parameters, which is not limited herein.
  • the network card is used to perform task processing on the received P packets to be processed, and P is a positive integer.
  • the P packets may be received by the network card in batches, or may be received by the network card one by one, which is not limited in this application.
  • the NIC may preferentially process the previously received packets, and then process the received packets, and may not preferentially process the previously received packets.
  • the network card processes all the P packets in parallel, and may process the remaining unprocessed packets in the P packets after processing one or more of the P packets, which is not in the present application. Make a limit.
  • the network card performs task processing on each of the P messages by the task processing method described in the embodiment shown in FIG. 4(a), FIG. 4(b) or FIG. 6.
  • the embodiment of the present application introduces the task processing method provided by the present application by taking the first packet received first and the second packet received later as an example.
  • the processing method of the other packets in the P packets is similar to the processing method of the first packet and the second packet, and is not described in the embodiment of the present application.
  • the P messages correspond to one thread in the processor.
  • the thread corresponding to the Qth message in the P messages is represented by the Qth thread, and Q is a positive integer not greater than P.
  • the first packet corresponds to the first thread
  • the second packet corresponds to the second thread.
  • the network card may continue to specify that the target thread corresponds to a new packet. Therefore, in the P packets of the present application, the threads corresponding to different packets may be the same or different. That is, if Q is a value of Q1 and Q2, respectively, the Q1 thread and the Q2 thread may be the same thread, or may be different threads, where Q1 and Q2 are positive integers that are not greater than P and are not equal to each other.
  • FIG. 4(a) Please refer to Figure 4(a) for the basic flow of the task processing method provided by this application.
  • the NICs of Figures 1 and 2 perform this method while in operation.
  • the processing of the first packet by the network card is taken as an example for description.
  • the network card obtains the first packet to be processed.
  • the first packet may be an uplink packet or a downlink packet.
  • the first packet can be obtained from the Ethernet interface of the network card, and can be obtained from the server by the host interface of the network card, which is not limited herein.
  • the first thread that the network card searches for from the resource pool of the processor is allocated to the first packet, and the first thread is responsible for performing a complete task processing procedure for the first packet.
  • the processor of the network card may include multiple CPUs, and one of the CPUs performs the operations of this step 402 as the main CPU.
  • the network card processor resource pool includes multiple threads, and one of the threads performs the operations of this step 402 as a main thread.
  • the NIC can obtain the context information of the task from the server through the DMA module, and save the context information in the NIC memory.
  • step 403 may also be located before step 402 or even step 401.
  • step 403 may be omitted.
  • the task flow preparation is completed.
  • the processor then runs the first thread to perform the processing of the N stages of the task in sequence for the first message. Specifically, the processor runs the first thread according to the jth information block and the processing result of the j-1th stage of the first packet, A message performs the task processing of the jth stage, and the processing result of the jth stage of the first message is obtained, where j is a positive integer not greater than N.
  • the first thread completes the task processing of the first packet, and obtains the processing result of the Nth phase of the first packet, that is, the first report.
  • the final task processing results are examples of the first thread.
  • the first thread needs to use the processing result of the 0th phase of the first packet, where the 0th phase can be understood as the first packet has not been processed, so the first packet
  • the result of the 0th stage processing is the first message.
  • the first thread loads the first packet and the first information block to the DCache of the first thread, and loads the first program segment to the ICache of the first thread, and then according to the first packet and The first information block executes the first program segment to perform the first phase of the task processing on the first packet, and the processing result of the first phase of the first packet is temporarily stored in the DCache.
  • the first thread loads the jth information block to the DCache of the first thread, and loads the jth program segment to the ICache of the first thread, and then according to the first report
  • the processing result of the j-1th stage and the jth information block are executed, and the jth program segment is executed to perform the task processing of the jth stage of the first message, and the processing result of the jth stage of the first message is temporarily stored in the DCache.
  • Medium then if j ⁇ N, then add j to 1 and perform the steps described in this paragraph again.
  • the first thread can directly use the processing result of the j-1th phase of the first packet in the DCache of the first thread when performing the task processing of the jth phase, without copying from other threads.
  • the first thread can be released as an idle thread to the resource pool, so that the first thread can process the subsequent received packet of the network card.
  • the NIC may forward the processing result of the first packet to the Ethernet through the network interface according to a predetermined forwarding path, or forward the packet to the server through the host interface.
  • the task program is divided into N blocks, each of which is run by a separate thread, so each block is stored as a separate executable file in the network card memory.
  • each block is stored as a separate executable file in the network card memory.
  • the task flow is originally divided into three phases, the task program is originally divided into three executable files and stored in the network card memory. If the user wants to refine the task flow into 4 blocks to increase the throughput of the task, the original 3 executable files need to be re-divided into 4 executable files, which involves modification of 3 executable files.
  • the workload is large and the flexibility is poor, which is not conducive to the development of the mission program.
  • a thread performs a complete task processing flow, so the entire task program can be saved as an executable file in the program memory area of the network card memory. Since the task program is only an executable file, only one executable file needs to be modified when the task processing flow is improved.
  • the executable file data involved is small, the modification workload is small, and the flexibility is high, which is beneficial to the task program.
  • the first thread may also load multiple or even all the program segments into the ICache at one time, and then execute the program segments step by step through the pointer.
  • the processor allocates the first thread to process the first packet. If the network card acquires the second packet to be processed, the processor allocates the idle second thread to process the second packet. If the network card acquires the third message to be processed again, the processor allocates an idle third thread for processing, and so on.
  • the specific processing method of a single thread is similar to the embodiment shown in FIG. 4(a), and is not described herein. Wherein, the processor needs to use the jth information block when performing the task processing of the jth stage by the first thread, and the jth information block may be rewritten in the process. In order to avoid data read and write conflicts, other threads should be prevented from accessing the jth information block at this time.
  • the processor may temporarily suspend the second thread, waiting for the first
  • the second thread reloads the jth information block, and according to the processing result of the jth information block and the j-1th stage of the second message
  • the second message performs the task processing of the jth stage.
  • the rest of the threads can also be scheduled in a similar way, so I won't go into details here.
  • the network card may lock the jth information block to ensure that the jth information block cannot be accessed by other threads.
  • the specific locking mode may be that the flag bit of the jth information block is inverted, or may be other locking modes, which is not limited herein.
  • the network card locks the jth information block for the first thread.
  • the second thread is to perform the task processing of the jth stage for the second message, but since the jth information block has been locked, the second thread cannot obtain the jth information block, and the network card temporarily suspends the second thread.
  • the network card unlocks the jth information block. Then the network card loads the jth information block for the second thread, and wakes up the second thread to perform the task processing of the jth stage for the second message.
  • the network card may automatically lock the j+1th information block for the first thread.
  • the application can use the network card 200 shown in FIG. 2 to implement the task processing method shown in FIG. 4(a) and FIG. 4(b).
  • the task program is saved in the program memory area of the network card memory 205, and the context information and the global configuration table are stored in the data memory area of the network card memory 205.
  • the steps described in FIG. 4(a) and FIG. 4(b) are performed by The processor 206 is executed.
  • For a specific operation mode of the network card reference may be made to the related description of the method embodiment shown in FIG. 4(a) and FIG. 4(b), and details are not described herein.
  • the task processing methods shown in FIG. 4(a) and FIG. 4(b) are mainly performed by a processor in the network card using a software level method. Because of the programmability of the processor, there is a high degree of flexibility in using the processor to handle tasks. But the processor is expensive and consumes a lot of energy, so the performance and cost it achieves Compared to not quite satisfactory. In contrast, hardware circuits tend to be faster, consume less power, have lower prices, and have higher performance, so they have a higher price/performance ratio than processors.
  • the present application improves the existing network card to combine the advantages of software and hardware, and enhances the performance of the network card while retaining the flexibility of the network card.
  • the network card provided by the present application includes a task interface such as an existing host interface 501 and a network interface 502, a network card memory 505, and a processor 506.
  • a scheduler (English: scheduler) 508 has been added.
  • the functions of the host interface 501, the network interface 502, and the network card memory 505 are basically the same as those of the existing network card. For details, refer to the description of the network card shown in FIG. 2.
  • the processor 506 and the scheduler 508 are mainly described below.
  • the present application sets a scheduler 508 in the network card.
  • the scheduler 508 is constructed by hardware circuitry for coordinating the interaction between the accelerator 507, the processor 506, and other modules of the network card. Specifically, the scheduler 508 is configured to determine, after receiving the first packet, the task interface, such as the host interface 501 or the network interface 502, the first thread for processing the first packet, and load the first packet for the first thread.
  • the processor 506 sequentially performs N stages of task processing on the first message through the first thread.
  • the scheduler 508 is further configured to load the jth information block for the first thread before the processor 506 runs the first thread to perform the jth stage task processor on the first message.
  • the scheduler 508 is further configured to: after receiving the second packet, the task interface determines a second thread for processing the second packet, and loads the second packet for the second thread.
  • the jth information block is loaded for the second thread before the processor runs the second thread to execute the jth stage task processor for the second message.
  • the scheduler 508 waits for the first thread to execute the task processing of the jth phase of the first packet, and then loads the jth information block for the second thread.
  • the scheduler 508 locks the jth information block for the first thread when the processor runs the first thread to perform the task processing of the jth stage on the first message, so that the jth information block cannot be except the first thread. Other thread access. After the first thread finishes the task processing of the jth stage of the first message, the scheduler 508 unlocks the jth information block so that the jth information block can be accessed by any thread.
  • the scheduler 508 can automatically lock the j+1th information block for the first thread, without waiting for the first thread to issue the lock. An indication of the j+1 information block.
  • the scheduler 508 executes the j-th stage of the first message by the first thread by the processor.
  • the first thread may be temporarily suspended, and after the scheduler 508 loads the j+1th information block for the first thread, the first thread is awake.
  • the network card memory 505 further includes a global configuration table, configured to record address information of the N information blocks.
  • the scheduler 508 loads the jth information block for the first thread according to the address information of the jth information block in the global configuration table.
  • the program processing program instruction is stored in the network card memory 505 as a complete executable file, where the executable file includes N program segments, where the i-th program segment includes task processing for performing the i-th stage.
  • the scheduler 508 is further configured to load the jth program segment for the first thread and adjust the pointer of the first thread to the jth program segment before the processor performs the jth phase task processing on the first message by using the first thread. Enabling the first thread to execute the jth program segment.
  • the processor 506 in the present application still includes a resource pool composed of a plurality of threads. For details, refer to the related description of the embodiment shown in FIG. 2, and details are not described herein.
  • the processor 506 is configured to run the first thread to perform N stages of service processing on the first packet. Specifically, the processor 506 runs the first thread to perform the following steps, so that j traverses the integer in [1, N]. And finally obtaining the task processing result of the Nth stage of the first message: performing the task processing of the jth stage on the first message according to the processing result of the jth information block and the j-1th stage of the first message, The result of the processing of the jth stage of the first message.
  • the processing result of the 0th stage of the first packet is the first packet.
  • the processor 506 is further configured to perform the following steps in a loop, so that j traverses the integer in [1, N], and finally obtains the task of the Nth phase of the second packet.
  • Processing result According to the processing result of the jth information block and the j-1th stage of the second message, the task processing of the jth stage is performed on the second message, and the processing result of the jth stage of the second message is obtained.
  • the processing result of the 0th stage of the second packet is the second packet.
  • the processor 506 may wait to perform the task processing of the jth phase of the first packet by using the first thread, and then perform the task processing of the second packet by using the second thread.
  • the processor When the processor performs task processing on a packet, it generally needs to perform multiple acceleration operations first. For example, the processor needs to check the data integrity field (English: data integrity field, abbreviation: DIF) to ensure that the message is complete, including CRC, IP checksum, and so on. DIF checksums such as CRC and checksum can be regarded as acceleration operations of the message. In addition, operations such as packet parse, packet edit, and lookup table (that is, lookup of message matching entries) can also be regarded as an accelerated operation of the message. This is in the prior art These acceleration operations are performed by the processor itself. Generally, the acceleration engine needs to be built on the CPU of the processor according to the acceleration function required by the task to obtain a customized CPU. Customized CPUs are expensive, and once built, it is difficult to change their hardware structure.
  • DIF data integrity field
  • acceleration operations tend to be simple, repetitive, and single-function, and can be implemented with simple hardware circuits. Therefore, in the application, the independent accelerator 507 is set in the network card, and the acceleration operation processing of the message is concentrated to the accelerator 507 for execution. Please refer to FIG. 5(b).
  • the accelerator 507 is a pure hardware circuit, and may specifically be a circuit that integrates various acceleration functions into one, or may be a collection of multiple acceleration unit circuits.
  • the accelerator 507 may include one or more of the following acceleration units: a CRC unit 5011 for performing CRC check, a checksum unit 5072 for performing checksum check, and a packet parser (English: packet parser, Abbreviation: parser) 5073, packet editor (English: packet editor, abbreviation: PE) 5074, table lookup unit 5075 for performing table lookup operations.
  • the accelerating unit 507 may further include other accelerating units, and may also include a combination circuit of several units in the accelerating unit, which is not limited herein.
  • the accelerator 507 After receiving the first packet, the accelerator 507 performs an acceleration operation on the first packet to obtain an accelerated first packet.
  • the scheduler 508 mentioned above loads the first message for the first thread, and specifically may load the first message after the acceleration for the first thread; the processing of the 0th stage of the first message mentioned above As a result, specifically, the accelerated first message.
  • the accelerator 507 is responsible for processing one or more acceleration processes of the message, reducing the type of acceleration processing performed by the processor 506.
  • the processor 506 does not even need to perform any acceleration operations. Therefore, the processor 506 in the present application can adopt a universally-used CPU, and it is not necessary to specifically customize a CPU having multiple acceleration engines, which can further reduce the cost of the network card.
  • the network card provided by the present application may further include an optional module DMA module 503.
  • the DMA module 503 is substantially the same as the current DMA module 203, and details are not described herein.
  • the network card provided by the present application may further include one or more of a context management module, a management processor, a DFX module, a queue management module, a PPL, a Timer, and the like. For details, refer to the related description in the embodiment shown in FIG. 2 . I will not repeat them here.
  • FIG. 6 which includes:
  • the processing of the first packet by the network card is taken as an example for description.
  • the network card obtains the first packet to be processed.
  • the first packet may be an uplink packet or a downlink packet.
  • the first packet can be obtained from the Ethernet interface of the network card, and can be obtained from the server by the host interface of the network card, which is not limited herein.
  • the accelerator accelerates the first packet, and sends the accelerated first packet to the scheduler.
  • the processing of the first packet by the network card is taken as an example for description.
  • the accelerator accelerates the first message.
  • the acceleration processing includes one or more of an acceleration operation such as a CRC check, a checksum check, a packet edit, a packet parsing, and a table lookup.
  • the accelerator After the first message is accelerated, it becomes a form of metadata, and the accelerator sends the first message in the form of metadata to the scheduler.
  • the scheduler determines a first thread used to process the first packet, and loads the first message after the acceleration for the first thread.
  • each thread in the scheduling resource pool in this embodiment is executed by the scheduler, and is no longer executed by the main CPU or the main thread. Therefore, in this step, the scheduler searches for the first packet from the resource pool of the processor and allocates it to the first packet.
  • the first packet in the form of metadata is loaded into the DCache of the first thread.
  • the context information of the DMA module acquiring the task is saved in the network card memory
  • the DMA module gets the context information of the task from the server and saves the context information in the NIC memory.
  • Step 604 may be performed before any of steps 601 to 603.
  • the order of steps 602 and 603 may also be reversed, as long as steps 602 and 603 are after step 601.
  • the task processing in this embodiment is also divided into N stages, and the task program is also divided into the first program segment, the second program segment, the Nth program segment, and the context information is also divided into the first information block.
  • the second information block, the Nth information block, and the specific division method can be referred to the description in Embodiment 4, and details are not described herein.
  • the division of information blocks is recorded in the global configuration table as shown in Table 1, and stored in the network.
  • the scheduler loads the jth program segment and the jth information block for the first thread.
  • the scheduler loads the jth information block from the network card memory into the DCache of the first thread, and loads the jth program segment from the network card memory into the ICache of the first thread.
  • the scheduler can also modify the pointer of the first thread to point to the jth program segment, and then wake up the first thread so that the first thread can execute the jth program segment.
  • the processor performs, by using the first thread, the task processing of the jth stage according to the processing result of the jth information block and the j-1th stage of the first packet, to obtain the jth of the first packet.
  • the result of the stage processing is a result of the stage processing.
  • the processor executes the jth program segment according to the jth information block and the processing result of the j-1th stage of the first message through the first thread, to perform the jth stage task processing on the first message, and obtain the first report.
  • the processing result of the jth stage of the text is temporarily stored in the DCache of the first thread.
  • the processing result of the j-1th stage of the first message is also obtained by the first thread processing, so there is no need to copy from other threads.
  • step 606 If j ⁇ N after step 606 is performed, j is incremented by 1 and step 605 is executed again.
  • the scheduler can release the first thread as an idle thread into the resource pool.
  • the NIC may forward the first packet to the Ethernet through the network interface according to a predetermined forwarding path, or forward the packet to the server through the host interface.
  • This embodiment is similar to the embodiment shown in FIG. 4(a).
  • the task processing overhead is reduced, the program size is reduced, and the program flexibility is improved.
  • the performance of the network card for task processing can be comprehensively improved.
  • the task processing operations with complex logic, large computational overhead, and high evolution requirements are still performed by the threads of the processor, but the acceleration operations with simple logic, low computational overhead, and high repetitiveness are handed over to the hardware.
  • the accelerator is executed, which takes into account the flexibility of the software and the high performance of the hardware, and improves the task processing performance of the network card.
  • step 604 is an optional step. In the case where the context information has been saved in the network card memory or the DMA module is not set in the network card, the step 604 may be omitted.
  • Step 602 is an optional step. In the case that the acceleration operation of the accelerator or the message is not performed by the processor in the network card, the step 602 may be omitted, and the processor runs the thread to perform the acceleration operation of the message.
  • the first thread may temporarily suspend (English: suspend), and the suspended first thread stops the task processing operation, thereby saving power consumption. .
  • the scheduler wakes up the first thread to continue the task processing operation.
  • the first thread may also perform some operations of the jth stage that do not need to use the context to save the task processing time.
  • the first thread may also load multiple or even all the program segments into the ICache at one time, and then execute the program segments step by step through the pointer.
  • the scheduler again schedules an accelerator such as a table lookup unit to perform an acceleration operation.
  • the embodiment shown in FIG. 6 introduces the task processing flow of the network card only from the perspective of the first message. If there is still a second message, a third message, or more messages to be processed, the network card is processed by the processor according to the method shown in FIG. 6 by using a second thread, a third thread, or another thread. Do not repeat them.
  • the network card can also perform parallel processing on multiple messages by using the pipeline method shown in FIG. 4(b), that is, if the processor executes the j-1th stage through the second thread, the first thread has not finished executing the jth. In the stage, the scheduler temporarily suspends the second thread.
  • the scheduler wakes up the second thread and loads the jth information block for the second thread to execute the jth stage.
  • the scheduler can also use a similar method for the rest of the threads, which will not be described here. In this way, multiple threads can be scheduled in a staggered manner, so that multiple threads can process multiple messages in parallel without reading and writing conflicts, thereby improving the throughput and efficiency of the task.
  • the operation of locking and unlocking the information block in the embodiment shown in FIG. 4(b) may also be performed by a scheduler.
  • the scheduler locks the jth information block to ensure that the jth information block cannot be accessed by other threads. If the second thread is to perform the task processing of the jth stage at this time, since the second thread cannot acquire the jth information block, the scheduler may temporarily suspend the second thread. After the first thread finishes the jth stage, the first indication information is sent to the scheduler to inform the scheduler that the task processing operation of the current stage is completed.
  • the scheduler unlocks the jth information block according to the first indication information, loads the jth information block for the second thread, and then wakes up the second thread to perform the task processing of the jth stage.
  • the scheduler can actively lock the j+1th information block for the first thread after unlocking the jth information block locked for the first thread, without waiting for the first A thread sends a second indication message to inform the locking of the j+1th information block. This can reduce the instruction interaction between the thread and the scheduler, and further improve the performance of the network card.
  • stages of task processing may also be empty operations. If the empty operation stage is skipped, the locking, loading and unlocking operations of the information blocks corresponding to the empty operation stage should be skipped at the same time.
  • the location of the information blocks of each stage is obtained by the scheduler sequentially searching the global configuration table. If the locking, loading and unlocking operations of certain information blocks are skipped, the scheduler needs to be able to jump to find the global configuration table, which has high requirements on the intelligence of the scheduler.
  • the scheduler is built by pure hardware circuits. Increasing the intelligence of the scheduler will inevitably lead to more complicated circuit design of the scheduler, which will greatly increase the power consumption, cost and circuit area of the hardware.
  • the thread of the processor may not perform the task processing operation in the idle operation phase, but still send the indication information to the scheduler indicating that the task processing operation of the current stage is completed.
  • the scheduler processes the information blocks of each stage in sequence according to the records of the global configuration table. Taking the j-th operation as an example: after the first thread finishes the task processing of the j-1th stage, the scheduler unlocks the j-1th information block and actively locks and loads the jth information block for the first thread. The first thread determines that the jth stage is a null operation, so the task processing is not performed, but the indication information is still sent to the scheduler.
  • the scheduler unlocks the jth information block according to the indication information, and actively locks and loads the j+1th information block for the first thread. In this way, the scheduler can sequentially lock, load, and unlock the information blocks of each stage for the threads according to the global configuration table according to the order from the first stage to the Nth stage, without skipping the information of the idle operation stage. Piece. This reduces the need for scheduler intelligence Seeking to simplify hardware costs.
  • first”, second and the like described in the present application are only used to distinguish different technical features, and are not used to further define technical features.
  • first thread in this application can also be used as the “second thread” in practical applications.
  • first message in this application can also be used as the “second message” in practical applications.
  • the disclosed systems and methods can be implemented in other ways.
  • the system embodiment described above is merely illustrative.
  • the division of the unit is only a logical function division, and the actual implementation may have another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, module or unit, and may be electrical, mechanical or otherwise.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a program function unit.
  • the integrated unit if implemented in the form of a program functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • the technical solution of the present application which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a program product stored in a storage medium.
  • a number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the storage medium includes: a U disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Computer And Data Communications (AREA)

Abstract

本申请提供了一种任务处理方法,用于提升网卡的任务处理性能。本申请提供的任务处理方法包括:获取待处理的P个报文,确定该P个报文对应的线程,并将该P个报文加载到对应的线程中。分别通过每个报文对应的线程,对每个报文进行N个阶段的任务处理,得到每个报文的第N阶段的任务处理结果。本申请还提供了相关的网卡。

Description

一种任务处理方法以及网卡 技术领域
本申请涉及通信领域,尤其涉及一种任务处理方法以及网卡。
背景技术
以太网络中的一项业务可以包括多种任务,举例来说,传输控制协议卸载(英文:TCP offload engine,缩写:TOE)业务就包括接收任务以及发送任务。现有的以太网络的任务处理一般由服务器实现,服务器通过交换机接入以太网络(英文:ethernet)中,请参阅图1。其中服务器上插入有网卡,用于处理服务器与交换机之间的数据交换。网络设备在收发数据包的过程中会触发网络I/O中断,因此服务器在工作时要响应大量的I/O中断信号。例如,若网络任务以64Kbps的速率发送传输控制协议(英文:transmission control protocol,缩写:TCP)数据,则单单将数据封装成以太网数据包以及响应网络接受确认信号,每秒钟就会在服务器和网卡间触发60多个I/O中断。大量的中断处理会占用服务器相当可观的计算资源,拉低网络的整体性能。
为了解决上述问题,现阶段将协议栈的相关处理从服务器侧卸载到网卡上实现,以求解放服务器的计算资源,减少服务器和网卡之间的数据交互,进而提升网络的性能。例如现阶段的某些网卡已经能够支持融合以太网远程直接数据访问(英文:RDMA over converged ethernet,缩写:RoCE)功能、基于以太网的光纤通道(英文:fibre channel over ethernet,缩写:FCoE)功能等。
但是,现阶段的网卡执行任务处理的性能较低,不能满足以太网络中大量任务的处理需求。
发明内容
本申请提供了任务处理方法以及网卡,用于提升网卡的任务处理性能。
本申请第一方面提供了一种任务处理方法,适用于网卡进行任务处理。其中,任务处理按照执行顺序分为N个阶段,依次为第一阶段、第二阶段、……第N阶段。网卡包括处理器和网卡内存。处理器中运行多个线程,该多个线 程在逻辑上构成了处理器的资源池。网卡获取待处理的P个报文,并从处理器的资源池中,确定该P个报文中每个报文对应的线程。网卡通过每个报文对应的线程,对每个报文依次执行N个阶段的任务处理,得到每个报文的第N阶段的任务处理结果。本申请中仅使用一个线程对报文执行完整的任务处理,故不需要在多个线程之间拷贝阶段性的任务处理结果,且整个任务程序只提供一个完整的函数功能集即可。这样就减少了任务处理消耗的资源和时延,降低了任务处理开销,减小了程序的体量,节约了存储空间。因此,本申请提供的任务处理流程与现有技术相比,具有较好的性能。
可选的,网卡内存中包括任务处理的上下文信息,该上下文信息包括与该N个阶段一一对应的N个信息块,依次为第一信息块、第二信息块、……第N信息块。其中第i信息块包括执行第i阶段的任务处理所需要使用的上下文信息,1≤i≤N。该P个报文中的第Q报文对应第Q线程,其中Q为不大于P的任意正整数。例如第一报文对应第一线程,第二报文对应第二线程。在第Q线程对第Q报文执行第j阶段的任务处理时,网卡为第Q线程加载第j信息块,并通过该第Q线程,根据第j信息块以及第Q报文的第j-1阶段的处理结果,对第Q报文执行第j阶段的任务处理,得到第Q报文的第j阶段的处理结果,其中j依次遍历[1,N]中的整数。其中,第Q报文的第0阶段的处理结果即为第Q报文。
可选的,该P个报文中包括第一报文和第二报文,第一报文对应第一线程,第二报文对应第二线程。网卡在通过第一线程执行完对第一报文的第j阶段的任务处理后,再为第二线程加载第j信息块,并通过第二线程执行对第二报文的第j阶段的任务处理。这样能够将多个线程按照阶段错开调度,避免不同的线程在访问上下文信息时出现读写冲突。
可选的,在第Q线程执行第Q报文的第j阶段的任务处理时,网卡可以锁定第j信息块,使得第j信息块不能被其它线程访问,以避免其它线程与第Q线程同时访问第j信息块造成读写冲突。在第Q线程执行完第Q报文的第j阶段的任务处理后,网卡解锁第j信息块,使得第j信息块可以被其它线程访问。
可选的,若网卡解锁了为第Q线程锁定的第j信息块后,j<N,则网卡继 续为第Q线程锁定第j+1信息块。
可选的,网卡在通过第Q线程执行完对第Q报文的第j阶段的任务处理后,可以将第Q线程挂起以节约功耗。当网卡为第Q线程加载了第j+1信息块后,网卡再唤醒第Q线程对第Q报文执行第j+1阶段的任务处理。
可选的,网卡在获取了该P个报文后,还可以对该P个报文进行加速,得到加速后的P个报文。网卡在确定了该P个报文对应的线程后,将加速后的P个报文分别发送给各自对应的线程。
可选的,网卡内存中还可以包括全局配置表,该全局配置表中记录了该N个信息块的地址信息。网卡可以根据该全局配置表的记录来获取第j信息块。
可选的,若任务发生演进,任务处理由原本的N个阶段更新为M个新阶段,则上下文信息页对应的由N个信息块重新划分为M个新信息块。在这种场景下,网卡可以接收修改指令,该修改指令用于将该全局配置表中记录的N个信息块的地址信息修改为M个新信息块的地址信息,该M个信息块中,第k新信息块包括执行第k新阶段的任务处理所需要使用的上下文信息,1≤k≤M。
可选的,任务处理的任务程序作为一个可执行文件保存在网卡内存中,该可执行文件包括有对应任务处理的N个阶段的N个程序段,分别为第一程序段、第二程序段、……第N程序段。其中第i程序段包括用于执行第i阶段的任务处理的程序指令。网卡为第Q线程加载第j程序段,并调整第Q线程的指针指向第j程序段。然后网卡通过第Q线程,根据第j信息块以及第Q报文的第j-1阶段的处理结果,执行第j程序段,即实现了执行第j阶段的任务处理。
本申请第二方面提供了一种网卡,用于对网络中的报文进行任务处理。其中,网卡包括处理器和网卡内存。任务处理按照执行顺序分为N个阶段,依次为第一阶段、第二阶段、……第N阶段。处理器中运行多个线程,该多个线程在逻辑上构成了处理器的资源池。处理器通过运行内部的线程来执行网卡内存中保存的程序,以实现本申请第一方面提供的任务处理方法。
本申请第三方面提供了一种任务处理方法,适用于网卡进行任务处理。其中,任务处理按照执行顺序分为N个阶段,依次为第一阶段、第二阶段、…… 第N阶段。网卡包括处理器、网卡内存、调度器、任务接口和总线。处理器中还运行多个线程,该多个线程在逻辑上构成了处理器的资源池。任务接口接收待处理的P个报文,调度器从处理器的资源池中,确定该P个报文对应的线程,并为该P个报文加载到对应的线程中。处理器通过每个报文对应的线程对每个报文依次执行N个阶段的任务处理,得到每个报文的第N阶段的任务处理结果。本申请中仅使用一个线程对报文执行完整的任务处理,故不需要在多个线程之间拷贝阶段性的任务处理结果,且整个任务程序只提供一个完整的函数功能集即可。这样就减少了任务处理消耗的资源和时延,降低了任务处理开销,减小了程序的体量,节约了存储空间。因此,本申请提供的网卡具有更好的使用性能。
可选的,网卡内存中包括任务处理的上下文信息,该上下文信息包括与该N个阶段一一对应的N个信息块,依次为第一信息块、第二信息块、……第N信息块。其中第i信息块包括执行第i阶段的任务处理所需要使用的上下文信息,1≤i≤N。该P个报文中的第Q报文对应第Q线程,其中Q为不大于P的任意正整数。例如第一报文对应第一线程,第二报文对应第二线程。在第Q线程对第Q报文执行第j阶段的任务处理时,调度器为第Q线程加载第j信息块,处理器通过该第Q线程,根据第j信息块以及第Q报文的第j-1阶段的处理结果,对第Q报文执行第j阶段的任务处理,得到第Q报文的第j阶段的处理结果,其中j依次遍历[1,N]中的整数。其中,第Q报文的第0阶段的处理结果即为第Q报文。
可选的,该P个报文中包括第一报文和第二报文,第一报文对应第一线程,第二报文对应第二线程。调度器待第一线程执行完第一报文的第j阶段的任务处理后,再为第二线程加载第j信息块。使得处理器待第一线程执行完对第一报文的第j阶段的任务处理后,才通过第二线程执行对第二报文的第j阶段的任务处理。这样能够将多个线程按照阶段错开调度,避免不同的线程在访问上下文信息时出现读写冲突。
可选的,在处理器通过第Q线程执行第Q报文的第j阶段的任务处理时,调度器可以为第Q线程锁定第j信息块,使得第j信息块不能被其它线程访问,以避免其它线程与第Q线程同时访问第j信息块造成读写冲突。在处理器通过 第Q线程执行完第Q报文的第j阶段的任务处理后,调度器解锁第j信息块,使得第j信息块可以被其它线程访问。
可选的,调度器为第Q线程解锁第j信息块后,若当前j<N,则调度器无需等待第Q线程下发锁定第j+1信息块的指令,自动为第Q线程锁定第j+1信息块,以减少线程和调度器之间的指令交互。
可选的,调度器在第Q线程执行完对第Q报文的第j阶段的任务处理后,可以将第Q线程暂时挂起以节约功耗。在调度器为第Q线程加载完第j+1信息块后,调度器再唤醒第Q线程继续执行第j+1阶段的任务处理。
可选的,网卡中还包括加速器。在任务接口接收到待处理的P个报文后,加速器对该P个报文进行加速,得到加速后的P个报文。上面提到的调度器为第Q线程加载第Q报文,指的可以是调度器为第Q线程加载该加速后的第Q报文。上面提到的第Q报文第0阶段的处理结果为第一报文,具体可以指第Q报文第0阶段的处理结果为该加速后的第Q报文。本申请中通过将报文的加速操作转移到加速器进行处理,使得处理器无需再对报文进行加速操作,能够简化处理器的功能,使得处理器无需额外定制加速引擎,降低了网卡的成本。
可选的,加速器进行的加速操作包括循环冗余校验(英文:cyclic redundancy check,缩写:CRC)、IP校验和(英文:checksum)、数据包解析(英文:packet parse)、数据包编辑(英文:packet edit)、查表等操作中的一项或多项。
可选的,网卡内存中还可以包括全局配置表,该全局配置表中记录了该N个信息块的地址信息。调度器可以根据该全局配置表的记录,来为第Q线程加载第j信息块。
可选的,若任务发生演进,任务处理由原本的N个阶段更新为M个新阶段,则上下文信息页对应的由N个信息块重新划分为M个新信息块。在这种场景下,任务接口可以接收修改指令,该修改指令用于将该全局配置表中记录的N个信息块的地址信息修改为M个新信息块的地址信息,该M个信息块中,第k新信息块包括执行第k新阶段的任务处理所需要使用的上下文信息,1≤k≤M。
可选的,任务处理的任务程序作为一个可执行文件保存在网卡内存中,该 可执行文件包括有对应任务处理的N个阶段的N个程序段,分别为第一程序段、第二程序段、……第N程序段。其中第i程序段包括用于执行第i阶段的任务处理的程序指令。若当前处理器需要通过第Q线程对第Q报文执行第j阶段的任务处理,则处理器为第Q线程加载第j程序段,并调整第Q线程的指针指向该第j程序段。处理器运行第Q线程根据第j信息块以及第Q报文第j-1阶段的处理结果,执行该第j程序段,以对第Q报文进行第j阶段的任务处理。
本申请第四方面提供了又一种网卡,用于对网络中的报文进行任务处理。其中,网卡包括处理器、网卡内存、任务接口和总线,任务处理按照执行顺序分为N个阶段,依次为第一阶段、第二阶段、……第N阶段。处理器中运行多个线程,该多个线程在逻辑上构成了处理器的资源池。其中,任务接口用于接收待处理的P个报文,调度器用于从处理器的资源池中,确定该P个报文对应的线程,并将该P个报文加载到对应的线程中。处理器用于通过每个报文对应的线程对每个报文依次执行N个阶段的任务处理,得到每个报文的第N阶段的任务处理结果。本申请中网卡仅使用一个线程对报文执行完整的任务处理,故不需要在多个线程之间拷贝阶段性的任务处理结果,且整个任务程序只提供一个完整的函数功能集即可。因此本申请提供的网卡任务处理开销少,程序占用的存储空间小,与现有技术相比具有更好的性能。
可选的,网卡内存用于存储任务处理的上下文信息,该上下文信息包括与该N个阶段一一对应的N个信息块,依次为第一信息块、第二信息块、……第N信息块。其中第i信息块包括执行第i阶段的任务处理所需要使用的上下文信息,1≤i≤N。该P个报文中的第Q报文对应第Q线程,其中Q为不大于P的任意正整数。例如第一报文对应第一线程,第二报文对应第二线程。调度器还用于:在第Q线程对第Q报文执行第j阶段的任务处理之前,为第Q线程加载第j信息块。处理器具体用于:通过该第Q线程,根据第j信息块以及第Q报文的第j-1阶段的处理结果,对第Q报文执行第j阶段的任务处理,得到第Q报文的第j阶段的处理结果,其中j依次遍历[1,N]中的整数。其中,第Q报文的第0阶段的处理结果即为第Q报文。
可选的,该P个报文中包括第一报文和第二报文,第一报文对应第一线程, 第二报文对应第二线程。调度器还用于待第一线程执行完第一报文的第j阶段的任务处理后,再为第二线程加载第j信息块。使得处理器待第一线程执行完对第一报文的第j阶段的任务处理后,才通过第二线程执行对第二报文的第j阶段的任务处理。这样能够将多个线程按照阶段错开调度,避免不同的线程在访问上下文信息时出现读写冲突。
可选的,调度器还用于:在处理器通过第Q线程执行第Q报文的第j阶段的任务处理时,为第Q线程锁定第j信息块,使得第j信息块不能被其它线程访问,以避免其它线程与第Q线程同时访问第j信息块造成读写冲突。在处理器通过第Q线程执行完第Q报文的第j阶段的任务处理后,解锁第j信息块,使得第j信息块可以被其它线程访问。
可选的,调度器为第Q线程解锁第j信息块后,若当前j<N,则调度器还用于自动为第Q线程锁定第j+1信息块,无需等待第Q线程下发锁定第j+1信息块的指令,以减少线程和调度器之间的指令交互。
可选的,调度器还用于在第Q线程执行完对第Q报文的第j阶段的任务处理后,将第Q线程暂时挂起以节约功耗。在为第Q线程加载完第j+1信息块后,再唤醒第Q线程继续执行第j+1阶段的任务处理。
可选的,网卡中还包括加速器。用于在任务接口接收到待处理的P个报文后对该P个报文进行加速,得到加速后的第一报文。上面提到的调度器用于为第Q线程加载第Q报文,指的可以是调度器用于为第Q线程加载该加速后的第Q报文。上面提到的第Q报文第0阶段的处理结果为第Q报文,具体可以指第Q报文第0阶段的处理结果为该加速后的第Q报文。本申请通过的网卡将报文的加速操作转移到加速器进行处理,使得处理器无需再对报文进行加速操作,能够简化处理器的功能,使得处理器无需额外定制加速引擎,降低了网卡的成本。
可选的,加速器具体可以包括CRC单元、checksum单元、数据包解析器(英文:packet parser,简称:parser)、数据包编辑器(英文:packet editor,缩写:PE)、查表单元中的一项或多项。其中,CRC单元用于对第一报文进行CRC校验;checksum单元用于对第一报文进行checksum校验;parser用于对第一报文进行数据包解析;PE,用于对第一报文进行数据包编辑;查表 单元,用于查找所述第一报文的匹配表项。
可选的,网卡内存还用于保存全局配置表,该全局配置表中记录了该N个信息块的地址信息。调度器具体用于根据该全局配置表的记录,来为第一线程加载第j信息块。
可选的,任务接口还用于在任务处理由原本的N个阶段更新为M个新阶段时,接收修改指令,该修改指令用于将该全局配置表中记录的N个信息块的地址信息修改为M个新信息块的地址信息,该M个信息块中,第k新信息块包括执行第k新阶段的任务处理所需要使用的上下文信息,1≤k≤M。
可选的,网卡内存还用于保存任务处理的一个可执行文件,该可执行文件包括有对应任务处理的N个阶段的N个程序段,分别为第一程序段、第二程序段、……第N程序段。其中第i程序段包括用于执行第i阶段的任务处理的程序指令。调度器还用于:在处理器待通过第Q线程对第Q报文执行第j阶段的任务处理时,为第Q线程加载第j程序段,并调整第Q线程的指针指向第j程序段,使得第Q线程能够直接开始执行第j程序段。处理器具体用于:通过第Q线程,根据第j信息块以及第Q报文的第j-1阶段的处理结果执行第j程序段,以对第Q报文进行第j阶段的任务处理。
可选的,网卡还可以包括直接内存访问(英文:direct memory access,缩写:DMA)模块,用于从与网卡相连的主机的内存中,获取所述上下文信息,并将获取的所述上下文信息保存到所述网卡的内存中。
可选的,网卡还可以包括上下文管理模块,用于对所述上下文信息进行管理。
附图说明
图1为服务器、交换机、以太网的连接关系示意图;
图2为现阶段技术中网卡的一个结构图;
图3(a)为现阶段技术中任务处理方法的一个原理示意图;
图3(b)为现有技术中任务处理方法的另一个原理示意图;
图4(a)为本申请提供的任务处理方法一个实施例流程图;
图4(b)为本申请提供的任务处理方法另一个实施例的原理示意图;
图5(a)为本申请提供的网卡一个实施例结构图;
图5(b)为本申请提供的网卡另一个实施例结构图;
图6为本申请提供的任务处理方法另一个实施例的流程图。
具体实施方式
本申请提供了一种任务处理方法,可以提升网卡的任务处理性能。本申请还提出了相应的网卡,以下将分别进行说明。
现阶段的以太网络一般将协议栈的相关任务处理从服务器侧卸载到网卡上实现,以求解放服务器的计算资源,提升网络的性能。卸载到网卡上的任务可以大致分为有状态任务以及无状态任务两种,本申请介绍有状态任务的处理方法。
有状态任务指的是网络任务的先后报文或数据帧之间存在依赖关系,后面的报文或数据帧依赖于前面报文或数据帧,这种依赖关系一般通过上下文(英文:context)信息进行管理。上下文信息可以用于标识和管理一条特定的任务流,例如,网络小型计算机接口(英文:internet small computer system interface,缩写:iSCSI)连接,远程直接内存访问队列(英文:remote direct memory access queue pairs,缩写:RDMA QPs)等业务在网络传输过程中对报文有顺序要求,故这些业务中的每项任务都使用独立的上下文信息来维护任务自身的状态信息。任务的上下文信息一般保存在服务器中,网卡在工作时通过DMA的方式从服务器中获取任务的上下文信息到网卡内存中。
现有的网卡200的基本结构请参阅图2,主要包括主机接口201和网络接口202等任务接口、DMA模块203、网卡内存205以及处理器206,各个模块之间通过总线(英文:bus)相连。
主机接口201是网卡与服务器主机之间的通信接口,用于传输网卡与服务器之间的数据或报文,一般为快速外部组件互连标准(英文:peripheral component interconnect express,缩写:PCIE)接口,也可以为其它类型的接口,此处不做限定。
网络接口202是网卡与以太网络之间的通信接口,一般用于在二层(即数据链路层)收发以太网络的报文。
DMA模块203用于网卡直接获取服务器主机内存中的数据。其中,DMA模块203为可选的模块,具体可以如图2所示由一个硬件电路来实现,也可以集成在处理器206中,由处理器206来实现DMA模块的功能。当DMA模块如图2所示由硬件来实现时,既可以为网卡中的独立模块,也可以与主机接口201设置在一起。当网卡不需要获取服务器主机内存中的数据时,DMA模块203也可以省略。
网卡内存205用于存储网卡需要使用到的数据信息,网卡内存205至少包括两个内存区:(一)程序内存区,用于存储网卡所需要使用到的任务程序;(二)数据内存区,用于存储网卡使用的哈希表、线性表、全局配置表等各种表格,以及上下文信息或其它网卡需要使用的数据信息。网卡内存205可以采用易失性存储介质(英文:volatile memory)来实现,如随机存取存储器(英文:random-access memory,缩写:RAM)等,也可以由非易失性存储介质来实现(英文:non-volatile memory,缩写:NVM),如只读存储器(英文:read-only memory,缩写:ROM)、闪存(英文:flash)等,网卡内存也可以由上述多种类型的存储器组合而成,此处不做限定。
处理器206可以由一个或多个CPU组成,每个CPU可以包括一个或多个核(英文:core),每个核可以运行一个或多个线程(英文:thread)。处理器206共运行多个线程,这些线程在逻辑上组成了处理器206的资源池,本申请着重介绍资源池中各线程的调度。此外,处理器206还包括处理器缓存,该处理器缓存被分配给各线程使用。具体的,资源池中的每个线程都分配有处理器缓存的一部分作为指令缓存空间(英文:instruction cache,简称:ICache),用于暂存线程所要执行的程序指令;并分配有处理器缓存的另一部分作为数据缓存空间(英文:data cache,简称:DCache),用于暂存线程所要使用的数据。各线程的ICache与DCache不在图2中一一展示。
网卡还可以包括上下文管理模块。上下文管理模块用于对任务的上下文信息进行管理,例如包括驱动DMA模块203获取主机内存中的上下文信息、对上下文信息进行分块、通过查找全局配置表确定待加载的上下文等中的一项或 多项。上下文管理模块为可选的模块,具体可以由一个硬件电路来实现,也可以集成在处理器206中由处理器206来实现上下文信息管理的功能。当上下文管理模块由硬件来实现时,既可以为网卡中的独立模块,也可以与处理器206设置在一起。在不需要对任务的上下文进行管理的情况下,上下文管理模块也可以省略。
此外,网卡还可以包括用于控制网卡的基本管理配置信息的管理处理器、面向产品生命周期各/环节的设计(英文:design for X,缩写:DFX)模块、用于管理数据收发队列以及处理器的命令队列的队列管理模块、用于进行时钟相位同步的锁相环(英文:phase locked loop,缩写:PPL)、任务流的计时器(英文:Timer)等模块中的一项或多项,本申请中不在图2中一一进行展示。网卡还可以根据任务需求设置其它功能模块,此处不一一赘述。
下面将在图2所示的网卡的基础上,介绍现有的任务处理方法的基本流程。
任务处理往往能够拆分成N个相互独立可以分别执行的执行阶段(为了便于描述以下简称阶段。本领域内的技术人员可以理解的,本申请中的阶段在本领域内还可以有其它类似的描述。例如,本领域文献中任务的“段”、“分段”、“部分”、“子任务”等描述均可以等同为本申请中任务的阶段,或英文文献中任务的“section”、“stage”、“part”、“phase”、“period”等描述均可以等同为本申请中的阶段)。现有技术中,预先将任务程序按照不同的阶段划分为N个程序段,依次为第一程序段、第二程序段、……第N程序段,其中第i程序段用于执行任务处理的第i阶段,N为不小于2的整数,i为不大于N的整数。每个程序段分别作为一个执行文件保存在网卡内存的程序内存区中。
网卡通过DMA方式从服务器处获取任务的上下文信息,并保存在网卡内存的数据内存区。现有技术中,上下文信息也与该N阶段对应划分为N个信息块,依次为第一信息块、第二信息块、……第N信息块。其中第i信息块包括执行第i阶段的任务处理所要使用的上下文信息,即第i程序段要使用的上下文信息。由于某些上下文信息可能会被多个程序段使用,因此该N个信息块可以有重叠部分。
现阶段的技术中,处理器运行资源池中的线程进行任务处理的方法请参阅 图3(a):具体的,处理器在资源池中选择一个线程作为主线程,以对资源池中的其它线程进行调度。主线程在确定待处理的报文(待处理的报文可以是上行报文也可以是下行报文)后,为待处理的报文每个的阶段的任务处理分配一个空闲的线程。以N=3为例:主线程从资源池空闲的线程中选择第一线程,处理器通过第一线程,加载待处理的报文和第一信息块到第一线程的DCache,并加载第一程序段到第一线程的ICache,然后根据报文以及第一信息块,执行第一线程ICache中的程序,以对报文进行第一阶段的处理;然后主线程选择空闲的第二线程,处理器通过第二线程,加载第一线程对报文的第一阶段的处理结果以及第二信息块到第二线程的DCache,并加载第二程序段到第二线程的ICache,对报文进行第二阶段的处理;最后主线程选择空闲的第三线程,处理器通过第三线程,加载第二线程对报文的第二阶段的处理结果以及第三信息块到第三线程的DCache,并加载第三程序段到第三线程的ICache,对报文进行第三阶段的处理。在第三线程得到对报文的第三阶段的处理结果后,网卡即完成了的完整的报文任务处理流程。
另外,现有技术中还采用流水线(英文:pipeline)机制以充分利用网卡的计算资源,具体原理如3(b)所示:下一封报文无需等待当前报文进行全部阶段的任务处理,第i线程在完成了对当前报文的第i阶段的处理后,可以直接处理下一封报文的第i阶段。这样就使得网卡能够并行处理多封报文,有利于提升任务的处理效率。
但是,上述两种任务处理方法任务处理存在多种缺陷。举例来说:处理器运行不同的线程执行任务的不同阶段,因此线程之间需要彼此拷贝阶段性的处理结果。例如,第二线程需要将第一线程对报文的第一阶段的处理结果拷贝到第二线程的DCache中,才能执行对报文的第二阶段的处理。同样的,第三线程需要将第二线程对报文的第二阶段的处理结果拷贝到第三线程的DCache中,才能执行对报文的第三阶段的处理。各线程之间相互拷贝阶段性的任务处理结果,会占用大量的计算资源,产生严重的时延,加大任务处理的开销。此外,由于各程序段由不同的线程运行,故每个程序段均需要提供完整的函数功能集。这样就导致任务程序的总体量较大,会占用过多的程序内存区空间。
为了解决上述问题,本申请在现有技术的基础上提供了新的任务处理方法以及网卡,以下将对其进行详细描述。
本申请中,任务程序也与任务处理的N个阶段对应划分为N个程序段,分别为第一程序段、第二程序段、……第N程序段。其中第i程序段用于执行任务的第i阶段,N为不小于2的整数,i为不大于N的正整数。处理器调整线程的指针指向第i程序段,即可通过线程进行第i阶段的任务处理。处理器在进行任务处理时,顺序执行各程序段。
上下文信息也与该N个阶段对应划分为N个信息块,依次为第一信息块、第二信息块、……第N信息块。其中第i信息块包括执行第i阶段的任务处理所要使用的上下文信息,即第i程序段要使用的上下文信息。由于某些上下文信息可能会被多个程序段使用,因此该N个信息块可以有重叠部分。
任务在演进更新时,阶段的划分随时可能发生变化。例如,同一任务,旧版本可能将任务处理按照执行顺序划分为N个阶段,而新版本可能将任务处理按照执行顺序划分为M个新阶段。此时上下文信息也要对应的进行重新划分,即划分为M个新信息块,其中第k新信息块包括执行第k新阶段的任务处理所需要使用的上下文信息,1≤k≤M。
上下文信息划分后得到的N个信息块的地址信息可以记录在一张全局配置表中,网卡在执行第i程序段时根据该全局配置表访问对应的第i信息块,该全局配置表保存在网卡内存的数据内存区中。其中,信息块的地址信息可以包括信息块相对于上下文信息的偏移量和长度,也可以为其它形式,此处不做限定。
若任务处理从N个阶段更新为M个新阶段,则全局配置表也应相应的进行更新。具体的,网卡可以接收主机下发的修改指令,该修改指令用于将全局配置表中记录的N个信息块的地址信息修改为M个新信息块的地址信息。表1是全局配置表的一个示例,其中,业务号用于标识任务的业务类型,如TOE业务、RoCE业务等。任务号用于标识一个业务中包括的多种任务,如接收任务、发送任务等。阶段号用于标识任务的各个阶段,偏移量用于表示各阶段对应的信息块相对于上下文信息的偏移量,长度用于表示各阶段对应的信息块的长度。网卡根据当前任务的业务号、任务号和阶段号,能够确定对应信息块的 偏移量和长度,进而获取对应的信息块。
业务号 任务号 阶段号 偏移量 长度
0 0 0 0 100
0 0 1 100 50
0 0 2 150 100
0 0 3 250 50
0 1 0 300 20
0 1 1 320 60
0 1 2 380 100
0 1 NA NA NA
表1仅用于形象的展示全局配置表的逻辑结构,全局配置表在实际应用中也可以为其它结构或配置其它的参数,此处不做限定。
在本申请的一些实施方式中,网卡也可以根据业务号、任务号和阶段号中的一个或两个参数来确定信息块,或根据其它参数来确定信息块,此处不做限定。
本申请中,网卡用于对接收到的待处理的P个报文进行任务处理,P为正整数。其中,该P个报文可以是网卡批量接收的,也可以是网卡在一段时间内逐个接收的,本申请中不做限定。网卡接收到该P个报文后,可以优先处理在前接收到的报文,其次处理在后接收到的报文,也可以不优先处理在前接收到的报文,本申请中不做限定。网卡将该P个报文全部并行处理,也可以在处理完该P个报文中的一个或多个报文后,再处理该P个报文中剩余未处理的报文,本申请中不做限定。
本申请中,网卡通过图4(a)、图4(b)或图6所示的实施例中介绍的任务处理方法,对该P个报文中的每个报文进行任务处理。为了便于描述,本申请实施例仅以在先接收的第一报文,以及在后接收的第二报文为例对本申请提供的任务处理方法进行介绍。该P个报文中的其它报文的处理方法与第一报文以及第二报文的处理方法类似,本申请实施例中不做赘述。
本申请中,该P个报文均对应处理器中的一个线程。为了便于描述,该P个报文中的第Q报文对应的线程用第Q线程来表示,Q为不大于P的正整数。 例如,第一报文对应第一线程,第二报文对应第二线程。
在本申请的一些应用场景中,某目标报文对应的目标线程处理完该目标报文后,网卡可以继续指定该目标线程对应新的报文。因此本申请的P个报文中,不同报文对应的线程可以相同也可以不同。即若Q分别取值为Q1以及Q2,则第Q1线程与第Q2线程可以为同一个线程,也可以为不同的线程,其中Q1、Q2为不大于P且互不相等的正整数。
本申请提供的任务处理方法的基本流程请参阅图4(a)。图1和图2中的网卡在工作时执行该方法。
401、获取待处理的第一报文。
本实施例以网卡对第一报文的处理为例进行描述。首先,网卡获取待处理的第一报文。第一报文可以是上行报文也可以是下行报文。第一报文可以由网卡的网络接口从以太网处获取,可也可以由网卡的主机接口从服务器处获取,此处不做限定。
402、确定用于处理第一报文的第一线程;
网卡从处理器的资源池中寻找空闲的第一线程分配给第一报文,第一线程负责对第一报文执行完整的任务处理流程。
可选的,网卡的处理器中可以包括多个CPU,其中一个CPU作为主CPU执行本步骤402的操作。或,网卡处理器资源池中包括多个线程,其中一个线程作为主线程执行本步骤402的操作。
403、获取任务的上下文信息。
网卡可以通过DMA模块从服务器处获取任务的上下文信息,并将上下文信息保存在网卡内存中。
本申请不限定步骤403与前面之间的先后关系,步骤403也可以位于步骤402甚至步骤401之前。在网卡内存中已经保存有任务处理的上下文信息的情况下,步骤403也可以省略。
404、通过第一线程对第一报文依次执行N个阶段的任务处理。
网卡执行完了步骤401~403后,即完成了任务流程的准备工作。之后处理器运行第一线程对第一报文依次执行任务的N个阶段的处理。具体的,处理器运行第一线程根据第j信息块以及第一报文的第j-1阶段的处理结果,对第 一报文执行第j阶段的任务处理,得到第一报文的第j阶段的处理结果,其中j为不大于N的正整数。当j依次遍历了[1,N]中的所有整数后,第一线程就完成了对第一报文的任务处理,得到了第一报文的第N阶段的处理结果,也就是第一报文最终的任务处理结果。
特别指出的是,当j=1时,第一线程需要使用第一报文的第0阶段的处理结果,其中第0阶段可以理解为还未对第一报文进行处理,故第一报文的第0阶段的处理结果就是第一报文。
更为具体的,对于j=1,第一线程加载第一报文以及第一信息块到第一线程的DCache,并加载第一程序段到第一线程的ICache,然后根据第一报文以及第一信息块,执行第一程序段,以对第一报文进行第一阶段的任务处理,得到第一报文的第一阶段的处理结果暂存在DCache中。
对于j≥2,在得到第j-1阶段的处理结果后,第一线程加载第j信息块到第一线程的DCache,并加载第j程序段到第一线程的ICache,然后根据第一报文第j-1阶段的处理结果以及第j信息块,执行第j程序段,以对第一报文进行第j阶段的任务处理,得到第一报文的第j阶段的处理结果暂存在DCache中,然后若j<N,则将j加1并再次执行本段所描述的步骤。
其中,第一线程在执行第j阶段的任务处理时,可以直接使用第一线程的DCache中对第一报文的第j-1阶段的处理结果,无需从其它线程处拷贝。
网卡在完成了的对第一报文的任务处理后,可以将第一线程重新作为空闲的线程释放到资源池中,使得第一线程能够处理网卡后续接收到的报文。
网卡在完成了对第一报文的任务处理后,可以将第一报文的处理结果按照预定的转发路径,通过网络接口转发到以太网中,或通过主机接口转发给服务器。
将图4(a)所示的实施例与现有技术进行对比可以发现:现有技术中采用不同的线程执行不同阶段的任务处理,而本申请仅采用一个线程执行所有阶段的任务处理。由于本申请仅使用了一个线程,因此就不需要在多个线程之间相互拷贝任务的阶段性处理结果,减少了拷贝操作消耗的资源和时延,降低了任务处理开销。且由于本申请仅使用了一个线程,因此程序只需要提供一次函数功能集即可,不需要为每个程序段都提供完整的函数功能集,减小了程序的 体量,节约了存储空间。因此,本申请提供的任务处理流程与现有技术相比,具有较好的性能。
上文在现有技术的介绍部分中提到,任务程序分为N个程序段,各程序段由彼此独立的线程运行,故各程序段均作为一个独立的可执行文件保存在网卡内存中。在对任务处理流程进行改进时,往往需要修改每个可执行文件。例如,假设任务流程原本分为3个阶段,则任务程序原本分为3个可执行文件保存在网卡内存中。若用户希望将任务流程细化为4个程序段以增大任务的吞吐量,则需要将原本的3个可执行文件重新划分为4个可执行文件,该操作涉及3个可执行文件的修改,工作量较大,灵活性较差,不利于任务程序的发展演进。
与现有技术不同的,本申请中由一个线程来执行完整的任务处理流程,故整个任务程序可以仅作为一个可执行文件保存在网卡内存的程序内存区中。由于任务程序仅为一个可执行文件,因此在对任务处理流程进行改进时只需要修改一个可执行文件即可,涉及的可执行文件数据较少,修改工作量小灵活性高,有利于任务程序的发展演进。
可选的,若第一线程的ICache空间足够,第一线程也可以一次性的将多个甚至所有程序段加载到ICache中,然后通过指针来逐阶段执行各程序段。
值得指出的是,现有技术使用不同的线程处理不同的任务阶段,使得多个报文可以按照pipeline方式并行处理,进而提高了任务的吞吐量和效率。相比之下,本申请使用一个线程处理所有的任务阶段,因此无法直接照搬现有的pipeline方式。为此,本申请在图4(a)所示实施例的基础上引入了一种新的pipeline方式,其原理请参阅图4(b):
网卡获取第一报文后,处理器分配第一线程对第一报文进行处理。若之后网卡又获取了待处理的第二报文,则处理器分配空闲的第二线程对第二报文进行处理。若之后网卡又获取了待处理的第三报文,则处理器分配空闲的第三线程进行处理,以此类推。单个线程具体的处理方法与图4(a)所示的实施例类似,此处不做赘述。其中,处理器通过第一线程执行第j阶段的任务处理时需要使用第j信息块,在此过程中可能会对第j信息块进行改写。为了避免发生数据读写冲突,应避免其它线程在此时访问第j信息块。因此,若第二线程 对第二报文执行完第j-1阶段的任务处理时,第一线程还未对第一报文执行完第j阶段的任务处理,则处理器可以将第二线程暂时挂起,待第一线程执行完对第一报文的第j阶段的任务处理后,第二线程再加载第j信息块,并根据第j信息块以及第二报文的第j-1阶段的处理结果对第二报文执行第j阶段的任务处理。其余更多的线程也可以采用类似的方法调度,此处不做赘述。本申请通过这样的方法,将多个线程按照阶段错开调度,使得多个线程可以在不发生读写冲突的情况下并行处理多个报文,实现了pipeline机制,提高了任务的吞吐量和效率。
可选的,在第一线程执行第j阶段的任务处理时,网卡可以锁定第j信息块,以保证第j信息块无法被其它线程访问。具体的锁定方式可以为将第j信息块的标志位翻转,也可以为其它锁定方式,此处不做限定。在第一线程执行完对第一报文的第j阶段的任务处理后,网卡再解锁第j信息块。这样在采用pipeline方式多线程并行处理报文时,能够更安全的避免多个线程同时改写一个信息块所造成的访问冲突。举例来说:网卡在第一线程根据第j信息块对第一报文执行第j阶段的任务处理时,为第一线程锁定第j信息块。此时第二线程待对第二报文执行第j阶段的任务处理,但由于第j信息块已经被锁定,故第二线程获取不到第j信息块,网卡将第二线程暂时挂起。在第一线程执行完对第一报文的第j阶段的任务处理后,网卡解锁第j信息块。然后网卡再为第二线程加载第j信息块,并唤醒第二线程对第二报文执行第j阶段的任务处理。
可选的,当网卡解锁了为第一线程锁定的第j信息块后,若当前j<N,则网卡可以自动为第一线程锁定第j+1信息块。
本申请可以采用图2所示的网卡200来实现图4(a)与图4(b)所示的任务处理方法。其中,任务程序保存在网卡内存205的程序内存区中,上下文信息以及全局配置表保存在网卡内存205的数据内存区中,图4(a)与图4(b)中所描述的步骤则由处理器206来执行。网卡的具体运行方式可以参考图4(a)与图4(b)所示的方法实施例的相关描述,此处不做赘述。
图4(a)与图4(b)所示的任务处理方法主要由网卡中的处理器采用软件层面的方法来执行。由于处理器的可编程性好,因此使用处理器来处理任务具有很高的灵活性。但是处理器的价格高、能耗大,因此其达到的性能与成本 相比并不尽如人意。相比之下,硬件电路往往速度快、能耗小、价格低,性能高,因此具有比处理器更高的性价比。
为此,本申请对现有的网卡进行了改进,以结合软件与硬件的优点,在保留网卡的灵活性的同时提升网卡的性能。改进后的网卡500的结构请参阅图5(a),与现有技术相比,本申请提供的网卡除了包括现有的主机接口501和网络接口502等任务接口、网卡内存505以及处理器506之外,还新增了调度器(英文:scheduler)508。主机接口501、网络接口502、网卡内存505的功能与现有的网卡基本相同,具体可以参考图2所示的网卡的描述。下面主要介绍处理器506与调度器508。
本申请在网卡中设置了调度器508。调度器508由硬件电路搭建而成,用于协调加速器507、处理器506以及网卡的其它模块之间的交互配合。具体的:调度器508用于在主机接口501或网络接口502等任务接口接收到第一报文后,确定用于处理第一报文的第一线程,并为第一线程加载第一报文,使得处理器506通过第一线程对第一报文依次执行N个阶段的任务处理。
可选的,调度器508还用于在处理器506运行第一线程对第一报文执行第j阶段的任务处理器之前,为第一线程加载第j信息块。
可选的,调度器508还用于:在任务接口接收到第二报文后,确定用于处理第二报文的第二线程,并为第二线程加载第二报文。在处理器运行第二线程对第二报文执行第j阶段的任务处理器之前,为第二线程加载第j信息块。
可选的,所述调度器508待第一线程执行完对第一报文第j阶段的任务处理后,再为第二线程加载第j信息块。
可选的,调度器508在处理器运行第一线程来对第一报文执行第j阶段的任务处理时,为第一线程锁定第j信息块,使得第j信息块无法被除了第一线程之外的其它线程访问。在第一线程执行完对第一报文的第j阶段的任务处理后,调度器508解锁第j信息块,使得第j信息块可以被任意的线程访问。
可选的,调度器508在为第一线程解锁第j信息块后,若当前j<N,则可以自动的为第一线程锁定第j+1信息块,无需等待第一线程下发锁定第j+1信息块的指示。
可选的,调度器508在处理器通过第一线程执行完了第一报文的第j阶段 的任务处理后,可以将第一线程暂时挂起,并在调度器508为第一线程加载了第j+1信息块后,再唤醒第一线程。
可选的,网卡内存505中还包括全局配置表,用于记录N个信息块的地址信息。调度器508在为第一线程加载第j信息块时,具体根据该全局配置表中第j信息块的地址信息,来为第一线程加载第j信息块。
可选的,任务处理的程序指令作为一个完整的可执行文件保存在网卡内存505中,该可执行文件包括N个程序段,其中第i程序段中包括用于执行第i阶段的任务处理的程序指令。调度器508在处理器通过第一线程对第一报文执行第j阶段的任务处理之前,还用于为第一线程加载第j程序段,并调整第一线程的指针指向第j程序段,使得第一线程能够根据执行第j程序段。
本申请中处理器506依然包括多个线程构成的资源池,具体可以参考图2所示的实施例的相关介绍,此处不做赘述。处理器506主要用于运行第一线程对第一报文依次执行N个阶段的业务处理,具体的,处理器506运行第一线程循环执行如下步骤,使得j遍历[1,N]中的整数,最终得到第一报文的第N阶段的任务处理结果:根据第j信息块以及第一报文的第j-1阶段的处理结果,对第一报文执行第j阶段的任务处理,得到第一报文第j阶段的处理结果。其中,第一报文的第0阶段的处理结果为第一报文。
可选的,若任务接口接收了第二报文,则处理器506还用于循环执行如下步骤,使得j遍历[1,N]中的整数,最终得到第二报文的第N阶段的任务处理结果:根据第j信息块以及第二报文的第j-1阶段的处理结果,对第二报文执行第j阶段的任务处理,得到第二报文第j阶段的处理结果。其中,第二报文的第0阶段的处理结果为第二报文。
可选的,处理器506可以待通过第一线程执行完对第一报文的第j阶段的任务处理后,再通过第二线程执行对第二报文的任务处理。
处理器在对报文进行任务处理时,一般需要先进行多种加速操作。举例来说,处理器需要预先对报文进行数据完整性(英文:data integrity field,缩写:DIF)检查来确保报文是否完整,具体包括CRC、IP checksum等。CRC、checksum等DIF校验都可以视为报文的加速操作。此外,packet parse、packet edit、查表(即查找报文匹配表项)等操作也可以视为报文的加速操作。现有技术中这 些加速操作是由处理器自行完成的,一般需要根据任务需要的加速功能在处理器的CPU上搭建加速引擎,得到定制的CPU。定制得到的CPU造价较高,且一旦搭建好后便难以改动其硬件结构。
但是与复杂的任务处理操作不同,加速操作往往逻辑简单、重复度高,功能单一,通过简单的硬件电路就能够实现。因此可选的,本申请在网卡中设置了独立的加速器507,将报文的加速操作处理集中到加速器507来执行,请参阅图5(b)。加速器507是纯硬件电路,具体可以是将多种加速功能集成为一体的电路,也可以是多个加速单元电路的集合。例如,加速器507中可以包括如下加速单元中的一项或多项:用于进行CRC校验的CRC单元5071、用于进行checksum校验的checksum单元5072、数据包解析器(英文:packet parser,简称:parser)5073、数据包编辑器(英文:packet editor,缩写:PE)5074、用于进行查表操作的查表单元5075。加速器507中还可以包括其它的加速单元,也可以包括上述加速单元中几个单元的组合电路,此处不做限定。加速器507在任务接口接收到第一报文后,对第一报文进行加速操作,得到加速后的第一报文。上文中提到的调度器508为第一线程加载第一报文,具体可以是为第一线程加载该加速后的第一报文;上文中提到的第一报文的第0阶段的处理结果,具体可以是该加速后的第一报文。
本申请中,加速器507负责处理报文的一种或多种加速处理,减少了处理器506执行的加速处理的种类。在加速器507负责报文的全部加速处理的情况下,处理器506甚至不需要执行任何加速操作。因此,本申请中的处理器506可以采用普及通用的CPU,无需特地定制具有多种加速引擎的CPU,能够进一步的降低网卡成本。
此外,本申请提供的网卡还可以包括可选模块DMA模块503,DMA模块503与现阶段的DMA模块203基本相同,此处不做赘述。本申请提供的网卡还可以包括上下文管理模块、管理处理器、DFX模块、队列管理模块、PPL、Timer等模块中的一项或多项,具体可以参考图2所示的实施例中的相关描述,此处不做赘述。
基于图5所示的网卡结构,本申请又提供了一种需要软件硬件相互协调的任务处理方法,其流程请参阅图6,包括:
601、获取待处理的第一报文。
本实施例以网卡对第一报文的处理为例进行描述。首先,网卡获取待处理的第一报文。第一报文可以是上行报文也可以是下行报文。第一报文可以由网卡的网络接口从以太网处获取,可也可以由网卡的主机接口从服务器处获取,此处不做限定。
602、加速器对第一报文进行加速处理,并将加速后的第一报文发送给调度器;
本实施例以网卡对第一报文的处理为例进行描述。首先,加速器对第一报文进行加速处理。加速处理包括CRC校验、checksum校验、数据包编辑、数据包解析、查表等加速操作中的一项或多项。
第一报文经过加速处理后变为元数据(metadata)的形式,加速器将metadata形式的第一报文发送给调度器。
603、调度器确定用于处理第一报文的第一线程,并为第一线程加载加速后的第一报文;
与图4所示的实施例不同的,本实施例中调度资源池中的各个线程由调度器来执行,而不再由主CPU或者主线程执行。故本步骤中,调度器从处理器的资源池中寻找空闲的第一线程分配给第一报文。
调度器确定了第一线程后,将metadata形式的第一报文加载到第一线程的DCache中。
604、DMA模块获取任务的上下文信息保存在网卡内存中;
DMA模块从服务器处获取任务的上下文信息,并将上下文信息保存在网卡内存中。
本申请不限定步骤601至步骤604的先后顺序,步骤604可以在步骤601至603中任一项之前,步骤602与603的顺序也可以颠倒,只要步骤602与603在步骤601后即可。
与实施例4类似的,本实施例中任务处理也分为N个阶段,任务程序也分为第一程序段、第二程序段……第N程序段,上下文信息也分为第一信息块、第二信息块……第N信息块,具体的划分方法可参阅实施例4中的描述,此处不做赘述。信息块的划分记录在如表1所示的全局配置表中,并保存在网 卡内存的数据内存区里。待执行第i阶段的任务处理时,调度器根据该全局配置表访问对应的第i信息块。
网卡执行完了步骤601至604后,即完成了任务流程的准备工作,之后网卡从j=1开始,循环执行步骤605和606,使得第一线程对第一报文依次执行任务的N个阶段的处理:
605、调度器为第一线程加载第j程序段以及第j信息块。
调度器将第j信息块从网卡内存加载到第一线程的DCache中,并将第j程序段从网卡内存加载到第一线程的ICache中。
调度器还可以修改第一线程的指针指向第j程序段,然后唤醒第一线程,使得第一线程能够执行第j程序段。
606、处理器通过第一线程,根据第j信息块以及第一报文的第j-1阶段的处理结果,对第一报文执行第j阶段的任务处理,得到第一报文的第j阶段的处理结果。
处理器通过第一线程,根据第j信息块以及第一报文的第j-1阶段的处理结果执行第j程序段,以对第一报文进行第j阶段的任务处理,得到第一报文的第j阶段的处理结果暂存到第一线程的DCache。其中,第一报文的第j-1阶段的处理结果也由第一线程处理获得,故无需从其它线程处拷贝。
特别指出的是,当j=1时,第一报文的第0阶段的处理结果就是第一报文。
若执行完步骤606后j<N,则将j加1并再次执行步骤605。
当对j=1,2,……N执行了步骤605与606后,网卡就完成了对第一报文的任务处理,对第一报文的第N阶段的任务处理结果,就是第一报文最终的任务处理结果。
在完成了的对第一报文的任务处理后,调度器可以将第一线程重新作为空闲的线程释放到资源池中。
网卡在完成了对第一报文的任务处理后,可以将第一报文按照预定的转发路径,通过网络接口转发到以太网中,或通过主机接口转发给服务器。
本实施例与图4(a)所示的实施例类似的,通过采用一个线程执行所有阶段的任务处理,降低了任务处理开销,减小了程序的体量,并提高了程序的灵活性,进而相对于现有技术而言能够全面提高了网卡进行任务处理的性能。 此外,本实施例将逻辑复杂、计算开销大、有较高演进要求的任务处理操作仍交由处理器的线程执行,但是将逻辑简单、计算开销小、重复性高的加速操作交由硬件的加速器来执行,这样就兼顾了软件的灵活性与硬件的高性能的优点,提高了网卡的任务处理性能。
其中,步骤604为可选步骤。在网卡内存中已经保存了上下文信息或网卡中未设置DMA模块等情况下,步骤604也可以省略。
其中,步骤602为可选步骤,在网卡中未设置加速器或报文的加速操作由处理器来执行等情况下,可以省略步骤602,由处理器运行线程来执行报文的加速操作。
可选的,调度器在加载第j程序段以及第j信息块时,可以将第一线程暂时挂起(英文:suspend),被挂起的第一线程会停止任务处理操作,能够节约功耗。待第j信息块加载完后,调度器再将第一线程唤醒继续执行任务处理操作。
可选的,在调度器加载第j程序段以及第j信息块时,第一线程也可以先执行第j阶段的一些不需要使用上下文的操作,以节约任务处理时间。
可选的,若第一线程的ICache空间足够,第一线程也可以一次性的将多个甚至所有程序段加载到ICache中,然后通过指针来逐阶段执行各程序段。
可选的,若在任务处理流程中仍需要进行加速操作,如需要进行内部查表等操作,则调度器再次调度查表单元等加速器进行加速操作。
图6所示的实施例仅从第一报文的角度介绍了网卡的任务处理流程。若当前还有第二报文、第三报文或更多报文待处理,则网卡按照图6所示的方法,通过处理器分配第二线程、第三线程或其它线程进行处理,此处不做赘述。网卡还可以采用图4(b)所示的pipeline方式对多封报文进行并行处理,即:若处理器通过第二线程执行完第j-1阶段时,第一线程还未执行完第j阶段,则调度器暂时挂起第二线程,待第一线程执行完第j阶段后,调度器再将第二线程唤醒,并为第二线程加载第j信息块执行第j阶段。调度器对其余更多的线程也可以采用类似的方法调度,此处不做赘述。这样就能够将多个线程按照阶段错开调度,使得多个线程可以在不发生读写冲突的情况下并行处理多个报文,进而提高了任务的吞吐量和效率。
可选的,图4(b)所示的实施例中提到锁定与解锁信息块的操作,也可以由调度器来完成。例如,在第一线程执行第j阶段时,调度器锁定第j信息块,以保证第j信息块无法被其它线程访问。若此时第二线程待执行第j阶段的任务处理,则由于第二线程无法获取第j信息块,调度器可以将第二线程暂时挂起。第一线程执行完第j阶段后,向调度器发送第一指示信息,以告知调度器当前阶段的任务处理操作执行完成。调度器根据第一指示信息解锁第j信息块,并为第二线程加载第j信息块,然后唤醒第二线程执行第j阶段的任务处理。这样即使采用pipeline方式并行处理报文时,也能够避免多个线程同时改写一个信息块所造成的访问冲突。又可选的,由于任务各阶段是顺序执行的,因此调度器在解锁了为第一线程锁定的第j信息块后,可以主动的为第一线程锁定第j+1信息块,不必等待第一线程下发第二指示信息告知锁定第j+1信息块。这样能够减少线程与调度器之间的指令交互,进一步的提升网卡的性能。
值得指出的是,任务处理的某些阶段也有可能为空操作,若跳过该空操作阶段,则应同时跳过对空操作阶段对应的信息块的锁定、加载和解锁操作。但是本实施例中,各阶段的信息块的位置是由调度器顺序查找全局配置表获得的。若跳过某些信息块的锁定、加载和解锁操作,需要调度器能够跳跃地查找全局配置表,这样就对调度器的智能性有较高的要求。而调度器是由纯硬件电路搭建的,提升调度器的智能性必然导致调度器的电路设计的较为复杂,进而大幅度增高硬件的功耗、成本和电路面积。
为了解决上述问题,在本申请的一些实施例中,处理器的线程在空操作阶段可以不进行任务处理操作,但仍向调度器发送指示信息指示当前阶段的任务处理操作执行完成。调度器则按照全局配置表的记录,顺序的处理各阶段的信息块即可。以第j阶段为空操作为例:在第一线程执行完第j-1阶段的任务处理后,调度器解锁第j-1信息块,并主动为第一线程锁定和加载第j信息块。第一线程确定第j阶段为空操作,于是不执行任务处理,但仍然向调度器发送指示信息。调度器根据该指示信息,解锁第j信息块,并主动为第一线程锁定和加载第j+1信息块。通过这样的方法,使得调度器可以根据全局配置表,按照从第1阶段到第N阶段的顺序,依次为线程锁定、加载、解锁每个阶段的信息块,而不必跳过空操作阶段的信息块。这样就降低了对调度器智能性的需 求,能够简化硬件成本。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,模块和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
本申请所述的“第一”、“第二”等描述,仅用于区分不同的技术特征,并不用于对技术特征进行进一步的限定。举例来说,本申请中的“第一线程”,在实际应用中也可以作为“第二线程”。本申请中的“第一报文”,在实际应用中也可以作为“第二报文”。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统和方法,可以通过其它的方式实现。例如,以上所描述的系统实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,模块或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用程序功能单元的形式实现。
所述集成的单元如果以程序功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以程序产品的形式体现出来,该计算机程序产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述 的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (20)

  1. 一种任务处理方法,用于网卡对网络中的报文执行任务处理,其特征在于,所述任务处理按照执行顺序分为N个阶段,所述网卡的处理器运行多个线程,所述N为正整数,所述方法包括:
    获取待处理的P个报文,所述P为正整数;
    分别确定所述P个报文中每个报文对应的线程,并将所述P个报文加载到对应的线程,其中,每个报文对应一个线程;
    通过所述每个报文对应的线程对所述每个报文依次执行所述N个阶段的任务处理,得到所述每个报文的第N阶段的任务处理结果。
  2. 根据权利要求1所述的任务处理方法,其特征在于,所述网卡的网卡内存中包括所述任务处理的上下文信息,所述上下文信息包括N个信息块,其中第i信息块包括执行第i阶段的任务处理所需要使用的上下文信息;
    所述P个报文中的第Q报文对应第Q线程,所述Q为不大于所述P的任意正整数;
    在对所述第Q报文执行第j阶段的任务处理时,为所述第Q线程加载第j信息块,通过所述第Q线程,根据所述第j信息块以及所述第Q报文的第j-1阶段的处理结果,对所述第Q报文执行第j阶段的任务处理,得到所述第Q报文的第j阶段的处理结果,1≤j≤N,其中,所述第Q报文的第0阶段的处理结果为所述第Q报文。
  3. 根据权利要求1或2所述的任务处理方法,其特征在于,所述P个报文中包括第一报文以及第二报文,所述第一报文对应第一线程,所述第二报文对应第二线程;
    所述方法还包括:待所述第一线程执行完对所述第一报文的第j阶段的任务处理后,再为所述第二线程加载所述第j信息块。
  4. 根据权利要求2或3所述的任务处理方法,其特征在于,所述方法还包括:
    在通过所述第Q线程对所述第Q报文执行第j阶段的任务处理时,为所述第Q线程锁定所述第j信息块;
    在通过所述第Q线程执行完对所述第Q报文的第j阶段的任务处理后, 解锁所述第j信息块。
  5. 根据权利要求4所述的任务处理方法,其特征在于,所述方法在所述解锁所述第j信息块后还包括:
    若当前j<N,则为所述第Q线程锁定第j+1信息块。
  6. 根据权利要求1至5中任一项所述的任务处理方法,其特征在于,所述方法还包括:
    在通过所述第Q线程执行完对所述第Q报文的第j阶段的任务处理后,将所述第Q线程挂起,并在为所述第Q线程加载了所述第j+1信息块后唤醒所述第Q线程。
  7. 根据权利要求1至6中任一项所述的任务处理方法,其特征在于,所述方法在所述获取待处理的P个报文之后还包括:对所述P个报文进行加速处理,得到加速后的P个报文;
    所述确定所述P个报文中每个报文对应的线程,并将所述P个报文发送给对应的线程包括:确定所述P个报文中每个报文对应的线程,并将所述加速后的P个报文分别发送给所述P个报文对应的线程。
  8. 根据权利要求2至7中任一项所述的任务处理方法,其特征在于,所述网卡内存中还包括全局配置表,所述全局配置表用于记录所述N个信息块的地址信息;
    所述为所述第Q线程加载第j信息块包括:根据所述全局配置表中第j信息块的地址信息,为所述第Q线程加载所述第j信息块。
  9. 根据权利要求8所述的任务处理方法,其特征在于,所述方法还包括:
    若所述任务处理由所述N个阶段更新为M个新阶段,则接收修改指令,所述修改指令用于将所述全局配置表中记录的所述N个信息块的地址信息修改为M个新信息块的地址信息,所述M个新信息块中,第k新信息块包括执行第k新阶段的任务处理所需要使用的上下文信息,1≤k≤M。
  10. 根据权利要1至9中任一项所述的任务处理方法,其特征在于,所述网卡内存中还保存有所述任务处理的可执行文件,所述可执行文件包括N个程序段,其中第i程序段包括用于执行所述第i阶段的任务处理的程序指令;
    所述方法还包括:在通过所述第Q线程对所述第Q报文执行第j阶段的 任务处理之前,为所述第Q线程加载第j程序段,并调整所述第Q线程的指针指向所述第j程序段;
    所述根据所述第j信息块以及所述第Q报文的第j-1阶段的处理结果,对所述第Q报文执行第j阶段的任务处理包括:根据所述第j信息块以及所述第Q报文的第j-1阶段的处理结果,执行所述第j程序段,以对所述第Q报文进行第j阶段的任务处理。
  11. 一种网卡,用于对网络中的报文执行任务处理,其特征在于,所述任务处理按照执行顺序分为N个阶段,所述网卡包括:处理器、网卡内存、调度器、任务接口和总线,所述处理器运行多个线程,所述N为正整数;
    所述任务接口,用于获取待处理的P个报文,所述P为正整数;
    所述调度器,分别确定所述P个报文中每个报文对应的线程,并将所述P个报文发送给对应的线程,其中,每个报文对应一个线程;
    所述处理器,用于通过所述第Q线程对所述第Q报文依次执行所述N个阶段的任务处理,得到所述第Q报文的第N阶段的任务处理结果。
  12. 根据权利要求11所述的网卡,其特征在于,所述网卡内存包括所述任务处理的上下文信息,所述上下文信息包括N个信息块,其中第i信息块包括执行第i阶段的任务处理所需要使用的上下文信息;
    所述P个报文中的第Q报文对应第Q线程,所述Q为不大于所述P的任意正整数:
    所述调度器,还用于在所述处理器通过所述第Q线程对所述第Q报文执行第j阶段的任务处理之前,为所述第Q线程加载第j信息块;
    所述处理器具体用于:在对所述第Q报文执行第j阶段的任务处理时,通过所述第Q线程,根据所述第j信息块以及所述第Q报文的第j-1阶段的处理结果,对所述第Q报文执行第j阶段的任务处理,得到所述第Q报文的第j阶段的处理结果,1≤j≤N,其中,所述第Q报文的第0阶段的处理结果为所述第Q报文。
  13. 根据权利要求11或12所述的网卡,其特征在于,所述P个报文中包括第一报文以及第二报文,所述第一报文对应第一线程,所述第二报文对应第二线程;
    所述调度器具体用于:待所述第一线程执行完对所述第一报文的第j阶段的任务处理后,再为所述第二线程加载所述第j信息块。
  14. 根据权利要求12或13中任一项所述的网卡,其特征在于,所述调度器还用于:
    在所述处理器通过所述第Q线程对所述第Q报文执行第j阶段的任务处理时,为所述第Q线程锁定所述第j信息块;
    在所述处理器通过所述第Q线程执行完对所述第Q报文的第j阶段的任务处理后,解锁所述第j信息块。
  15. 根据权利要求14所述的网卡,其特征在于,所述调度器还用于:
    在所述解锁所述第j信息块后,若当前j<N,则为所述第Q线程锁定第j+1信息块。
  16. 根据权利要求11至15中任一项所述的网卡,其特征在于,所述调度器还用于:
    在所述处理器通过所述第Q线程执行完对所述第Q报文的第j阶段的任务处理后,将所述第Q线程挂起,并在为所述第Q线程加载了所述第j+1信息块后唤醒所述第Q线程。
  17. 根据权利要求11至16中任一项所述的网卡,其特征在于,所述网卡还包括加速器,用于在所述任务接口接收待处理的P个报文之后,对所述P个报文进行加速处理,得到加速后的所述P个报文;
    所述调度器具体用于:确定所述第Q报文对应所述第Q线程,并为所述第Q线程加载加速后的第Q报文。
  18. 根据权利要求12至17中任一项所述的网卡,其特征在于,所述网卡内存中还包括全局配置表,所述全局配置表用于记录所述N个信息块的地址信息;
    所述调度器具体用于:根据所述全局配置表中第j信息块的地址信息,为所述第Q线程加载所述第j信息块。
  19. 根据权利要求18所述的网卡,其特征在于,所述任务接口还用于:
    若所述任务处理由所述N个阶段更新为M个新阶段,则接收修改指令,所述修改指令用于将所述全局配置表中记录的所述N个信息块的地址信息修 改为M个新信息块的地址信息,所述M个新信息块中,第k新信息块包括执行第k新阶段的任务处理所需要使用的上下文信息,1≤k≤M。
  20. 根据权利要11至19中任一项所述的网卡,其特征在于,所述网卡内存中还保存有所述任务处理的可执行文件,所述可执行文件包括N个程序段,其中第i程序段包括用于执行所述第i阶段的任务处理的程序指令;
    所述调度器还用于:在所述处理器通过所述第Q线程对所述第Q报文执行第j阶段的任务处理前,为所述第Q线程加载第j程序段,并调整所述第Q线程的指针指向所述第j程序段;
    所述处理器具体用于:通过所述第Q线程,根据所述第j信息块以及所述第Q报文的第j-1阶段的处理结果,执行所述第j程序段,以对所述第Q报文进行第j阶段的任务处理。
PCT/CN2016/092316 2016-07-29 2016-07-29 一种任务处理方法以及网卡 WO2018018611A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
PCT/CN2016/092316 WO2018018611A1 (zh) 2016-07-29 2016-07-29 一种任务处理方法以及网卡
CN201680002876.7A CN107077390B (zh) 2016-07-29 2016-07-29 一种任务处理方法以及网卡
CN202110713436.5A CN113504985B (zh) 2016-07-29 2016-07-29 一种任务处理方法以及网络设备
CN202110711393.7A CN113504984A (zh) 2016-07-29 2016-07-29 一种任务处理方法以及网络设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/092316 WO2018018611A1 (zh) 2016-07-29 2016-07-29 一种任务处理方法以及网卡

Publications (1)

Publication Number Publication Date
WO2018018611A1 true WO2018018611A1 (zh) 2018-02-01

Family

ID=59624647

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/092316 WO2018018611A1 (zh) 2016-07-29 2016-07-29 一种任务处理方法以及网卡

Country Status (2)

Country Link
CN (3) CN113504984A (zh)
WO (1) WO2018018611A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112313625A (zh) * 2018-06-19 2021-02-02 微软技术许可有限责任公司 动态混合计算环境
CN113612837A (zh) * 2021-07-30 2021-11-05 杭州朗和科技有限公司 数据处理方法、装置、介质和计算设备
CN113821174A (zh) * 2021-09-26 2021-12-21 迈普通信技术股份有限公司 存储处理方法、装置、网卡设备及存储介质
CN115473861A (zh) * 2022-08-18 2022-12-13 珠海高凌信息科技股份有限公司 基于通信与计算分离的高性能处理系统和方法、存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107818016A (zh) * 2017-11-22 2018-03-20 苏州麦迪斯顿医疗科技股份有限公司 服务器应用程序设计方法、请求事件处理方法及装置
CN109831394B (zh) * 2017-11-23 2021-07-09 华为技术有限公司 数据处理方法、终端以及计算机存储介质
CN110262884B (zh) * 2019-06-20 2023-03-24 山东省计算中心(国家超级计算济南中心) 一种基于申威众核处理器的核组内多程序多数据流分区并行的运行方法
CN111031011B (zh) * 2019-11-26 2020-12-25 中科驭数(北京)科技有限公司 Tcp/ip加速器的交互方法和装置
CN113383531B (zh) * 2019-12-25 2022-10-11 华为技术有限公司 转发设备、网卡及报文转发方法
CN111245794B (zh) * 2019-12-31 2021-01-22 中科驭数(北京)科技有限公司 数据传输方法和装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7248585B2 (en) * 2001-10-22 2007-07-24 Sun Microsystems, Inc. Method and apparatus for a packet classifier
CN101739242A (zh) * 2009-11-27 2010-06-16 宇盛通信科技(深圳)有限公司 一种流数据处理方法及流处理器
CN103019806A (zh) * 2011-09-22 2013-04-03 北京新媒传信科技有限公司 一种异步任务处理方法和装置
CN105075204A (zh) * 2013-03-12 2015-11-18 高通股份有限公司 可配置的多核网络处理器

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101436989B (zh) * 2008-12-26 2010-10-27 福建星网锐捷网络有限公司 一种转发报文的方法及装置
CN101540727B (zh) * 2009-05-05 2012-05-09 曙光信息产业(北京)有限公司 一种ip报文的硬件分流方法
CN101968748B (zh) * 2010-09-17 2014-04-02 北京星网锐捷网络技术有限公司 多线程数据调度方法、装置及网络设备
CN101964749A (zh) * 2010-09-21 2011-02-02 北京网康科技有限公司 一种基于多核构架的报文转发方法及系统
WO2012106905A1 (zh) * 2011-07-20 2012-08-16 华为技术有限公司 报文处理方法及装置
CN102331923B (zh) * 2011-10-13 2015-04-22 西安电子科技大学 一种基于多核多线程处理器的功能宏流水线实现方法
US20130283280A1 (en) * 2012-04-20 2013-10-24 Qualcomm Incorporated Method to reduce multi-threaded processor power consumption
CN102710497A (zh) * 2012-04-24 2012-10-03 汉柏科技有限公司 多核多线程网络设备的报文处理方法及系统
CN102752198B (zh) * 2012-06-21 2014-10-29 北京星网锐捷网络技术有限公司 多核报文转发方法、多核处理器及网络设备
US9588782B2 (en) * 2013-03-18 2017-03-07 Tencent Technology (Shenzhen) Company Limited Method and device for processing a window task
CN105700937A (zh) * 2016-01-04 2016-06-22 北京百度网讯科技有限公司 多线程任务处理方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7248585B2 (en) * 2001-10-22 2007-07-24 Sun Microsystems, Inc. Method and apparatus for a packet classifier
CN101739242A (zh) * 2009-11-27 2010-06-16 宇盛通信科技(深圳)有限公司 一种流数据处理方法及流处理器
CN103019806A (zh) * 2011-09-22 2013-04-03 北京新媒传信科技有限公司 一种异步任务处理方法和装置
CN105075204A (zh) * 2013-03-12 2015-11-18 高通股份有限公司 可配置的多核网络处理器

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112313625A (zh) * 2018-06-19 2021-02-02 微软技术许可有限责任公司 动态混合计算环境
CN113612837A (zh) * 2021-07-30 2021-11-05 杭州朗和科技有限公司 数据处理方法、装置、介质和计算设备
CN113612837B (zh) * 2021-07-30 2023-08-08 杭州朗和科技有限公司 数据处理方法、装置、介质和计算设备
CN113821174A (zh) * 2021-09-26 2021-12-21 迈普通信技术股份有限公司 存储处理方法、装置、网卡设备及存储介质
CN113821174B (zh) * 2021-09-26 2024-03-22 迈普通信技术股份有限公司 存储处理方法、装置、网卡设备及存储介质
CN115473861A (zh) * 2022-08-18 2022-12-13 珠海高凌信息科技股份有限公司 基于通信与计算分离的高性能处理系统和方法、存储介质
CN115473861B (zh) * 2022-08-18 2023-11-03 珠海高凌信息科技股份有限公司 基于通信与计算分离的高性能处理系统和方法、存储介质

Also Published As

Publication number Publication date
CN107077390B (zh) 2021-06-29
CN113504985A (zh) 2021-10-15
CN107077390A (zh) 2017-08-18
CN113504984A (zh) 2021-10-15
CN113504985B (zh) 2022-10-11

Similar Documents

Publication Publication Date Title
WO2018018611A1 (zh) 一种任务处理方法以及网卡
US11042501B2 (en) Group-based data replication in multi-tenant storage systems
US8914805B2 (en) Rescheduling workload in a hybrid computing environment
US10706009B2 (en) Techniques to parallelize CPU and IO work of log writes
US10733019B2 (en) Apparatus and method for data processing
US20160283282A1 (en) Optimization of map-reduce shuffle performance through shuffler i/o pipeline actions and planning
US8739171B2 (en) High-throughput-computing in a hybrid computing environment
US8046758B2 (en) Adaptive spin-then-block mutual exclusion in multi-threaded processing
KR101686010B1 (ko) 실시간 멀티코어 시스템의 동기화 스케쥴링 장치 및 방법
US10402223B1 (en) Scheduling hardware resources for offloading functions in a heterogeneous computing system
US20120297216A1 (en) Dynamically selecting active polling or timed waits
TW200540705A (en) Methods and apapratus for processor task migration in a multi-processor system
JP2005235228A (ja) マルチプロセッサシステムにおけるタスク管理方法および装置
WO2023103296A1 (zh) 一种写数据高速缓存的方法、系统、设备和存储介质
US20040055002A1 (en) Application connector parallelism in enterprise application integration systems
US20210303375A1 (en) Multithreaded lossy queue protocol
US10289306B1 (en) Data storage system with core-affined thread processing of data movement requests
Liu et al. Optimizing shuffle in wide-area data analytics
US10095627B2 (en) Method and system for efficient communication and command system for deferred operation
US10776012B2 (en) Lock-free datapath design for efficient parallel processing storage array implementation
CN112306652A (zh) 带有上下文提示的功能的唤醒和调度
Geyer et al. Pipeline Group Optimization on Disaggregated Systems.
TWI823655B (zh) 適用於智慧處理器的任務處理系統與任務處理方法
US11860785B2 (en) Method and system for efficient communication and command system for deferred operation
Jo et al. Request-aware Cooperative {I/O} Scheduling for Scale-out Database Applications

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16910191

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16910191

Country of ref document: EP

Kind code of ref document: A1